Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0710

D toc

This section describes techniques to normalize numeric values in your datasets. Ideally, your source systems are configured to capture and deliver data using a consistent set of units in a standardized structure and format. In practice, data from multiple systems can illuminate differences in the level of precision used in numeric data or differences in text entries that reference the same thing. Within

D s product
rtrue
, you can use the following techniques to address some of the issues you might encounter in the standardization of units and values for numeric types.

Numeric precision

In 

D s product
, mathematical computations are performed using 64-bit floating point operations to 15 decimals of precision. However, due to rounding off, truncation, and other technical factors, small discrepancies in outputs can be expected. Example:

-636074.22
-2465086.34

Suppose you apply the following transformation:

D trans
RawWrangletrue
p03ValueMySum
Typestep
WrangleTextderive type: single value: (-636074.22 + -2465086.34) as: 'MySum'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value(-636074.22 + -2465086.34)
p03NameNew column name
SearchTermNew formula

The expected output in the MySum column: -3101160.56

The actual output for in the MySum column: -3101160.5599999996

Info

NOTE: For 64-bit floating point mathematical operations, deviations like the above are intrinsic to the Decimal data type and how the platform performs computations.

Depending on your precision requirements, you can manage precision across your columns using a transformation like the following, which rounds off MySum to three digits:

D trans
RawWrangletrue
Typestep
WrangleTextset col: WM_Week value: round($col, 3)
p01NameColumns
p01ValueMySum
p02NameFormula
p02ValueROUND($col,3)
SearchTermEdit column with formula

For more information on floating point computations, see https://en.wikipedia.org/wiki/Numeric_precision_in_Microsoft_Excel.

Standardize decimal precision

If decimal values in a column are of varying levels of precision, you can standardize to a single level of precision.

Steps:

  1. From the column menu, select Column Details.
  2. .In the Column Details panel, select the Patterns tab.Among the patterns, select the following:

    Code Block
    {digit}.{digit}
  3. In the Suggestions panel on the right, locate the Edit Column transformation suggestion that uses the ROUND function. Click Edit.
  4. Change the second parameter of the ROUND function to match the number of digits of precision.

You can generalize this formatting across multiple columns by applying the $col reference in the transformation's function, as in the following:

D trans
Typestep
p01NameColumns
p01ValuecolA, colB, colC
p02NameFormula
p02ValueIFVALID($col, ['Float'], ROUND($col, 2))
SearchTermEdit column with formula

See Column Details Panel

For more information, see ROUND Function.

Standardize units

Tip

Tip: Each column that contains numeric values should have an identified unit of measurement. Ideally, this information is embedded in the name of the column data. If the unit of measurement is not included, it can be difficult to properly interpret the data.

D s product
 does not impose any units on imported data. For example, a column of values in floating point format could represent centimeters, ounces, or any other unit of measurement. As long as the data conforms to the specified data type for the column, then
D s product
 can work with it. 

However, this flexibility can present issues for users of the dataset. If data is not clearly labeled and converted to a standardized set of units, its users are forced to make assumptions about the data, which can lead to misuse of it. 

Info

Tip: The meaning of some units of measure can change over time. For example, a US Dollar in 2010 does not have the same value as a dollar in 2015. When you standardize shifting units of measure, you should account for any time-based differences, if possible.

Example - Fixed conversion factors

In many cases, units can be converted to other units by applying a fixed conversion factor to a column of data. For example, your dataset has the following three columns of measured data:

PersonHeight_ftWeight_kgArm_Length_in
Jack5'10"92 kg32
Jill5'2"56 kg29
Joe6'3"101 kg35

The above data has the following issues:

  1. The Weight and Height columns contain unit identifiers, which forces the values to be treated as strings.
  2. Metric data (kg) is mixed with English unit data (ft and in).
  3. The Height data is non-numeric.

Problem 1 - remove units

The Weight_kg column contains a unit identifier. On import, these values are treated as strings, which limits their use for analysis.

Steps:

  1. In the data grid, select an instance of " kg". Note that the space should be selected, too.
  2. Among the suggestion cards, select the Replace card. 
  3. It should automatically choose to replace with nothing, effectively deleting the content. To check, click Modify
  4. The transformation should look like the following:

    D trans
    p03Value''
    Typestep
    p01NameColumn
    p01ValueWeight_kg
    p02NameFind
    p02Value' kg'
    p03NameReplace with
    p04Valuetrue
    p04NameMatch all occurrences
    SearchTermReplace text or patterns

  5. Add it to your recipe.
  6. Verify that the column's data type has been changed to Integer or Decimal, depending on the values in it.

Problem 2 - convert English to metric units

To normalize to English units, the first issue is easily corrected by multiplying the Weight values by 2.2, since 1 kg = 2.2 lb:

D trans
Typestep
p01NameColumns
p01ValueWeight_kg
p02NameFormula
p02Value(Weight_kg * 2.2)
SearchTermEdit column with formula

If you want to round the value to the nearest integer, use the following:

D trans
Typestep
p01NameColumns
p01ValueWeight_kg
p02NameFormula
p02ValueROUND((Weight_kg * 2.2))
SearchTermEdit column with formula

After the above is added to the recipe, you should rename the column: Weight_lbs.

Problem 3 - convert ft/in to in

The final issue involves converting the Height_ft values to a single value for inches, so that these values can be used consistently with the other columns in the dataset.

On import, your data for the column might actually look like the following:

Height_ft
"5'10""
"5'2""
"6'3""

 

Steps:

  1. Select the first quote mark in one of the entries. 
  2. In the suggestion cards, select the Replace card.
  3. Select the variant that deletes all quotes in the column. 

  4. The full command should look like the following:

    D trans
    p03Value''
    Typestep
    p01NameColumn
    p01ValueHeight_ft
    p02NameFind
    p02Value`"`
    p03NameReplace with
    p04Valuetrue
    p04NameMatch all occurrences
    SearchTermReplace text or patterns

  5. Add it to your recipe.
  6. The remaining steps compute the number of inches. Multiply the feet by 12, and then add the number of inches, using new columns of data.
  7. Select the single quote mark, and choose the Split suggestion card. This transformation step should split the column into two columns: Height_ft1 and Height_ft2
  8. Derive the value in inches:

    D trans
    p03ValueHeight_in
    Typestep
    p01NameFormula type
    p01ValueSingle row formula
    p02NameFormula
    p02Value((Height_ft1 *12)+Height_ft2)
    p03NameNew column name
    SearchTermNew formula

  9. You can delete the other, interim columns.

Dynamic conversion factors

In some cases, the conversion rate between two different units of measures is dynamic. A common example involves mismatches between currency. For example, one dataset can be using U.S. dollars while another represents values in Euros.

Within a column

If you have inconsistent units within a column, it might be possible to correct these values by applying a multiple. For example, you might be able to determine that some values are in kilometers, instead of meters, based on their much smaller values. Multiplying the kilometer values by 1000 should standardize your units. The following multiplies all values in the column Distance that are less than 1000 by 1000. 

D trans
Typestep
p01NameColumns
p01ValueDistance
p02NameFormula
p02ValueIF((Distance < 1000,(Distance * 1000), Distance)
SearchTermEdit column with formula

Note the implied assumption that there are no distances in kilometers that are over 1000. 

Info

NOTE: Inconsistency in units within a column indicates a problem in either the source data or how the column data was modified after import. Where possible, you should try to fix these issues in the source data first, as they can introduce problems when the data is used.

Adjust level of precision

For numeric values that are used for measurement, you can adjust the level of precision within and across columns of values. For example, you have the following columns of data:

NameWidth_cmHeight_cm
Object 123.355.5512
Object 265.2102.4024
Object 354.212.22

In the above, you can see the following precision mismatches:

  • The Height column contains one value with only two digits of arithmetic precision in measurement.
  • The Width column uses two digits of arithmetic precision, while the Height column contains more digits of precision.

Where precision in measurement is important, you should consider rounding to the lowest level of precision. In this case, within the Height column, that level is to two significant digits after the decimal point (e.g. 12.22). However, across all of the columns of the dataset, the level of precision is to one significant digit after the decimal point, as the Width values are all restricted to this level of precision. While you could choose to round off to four digits across all columns, the extra values of 0 do not accurately reflect measurement and are therefore misleading.

You can use the following transformations to perform rounding functions within these columns:

D trans
Typestep
p01NameColumns
p01ValueWidth_cm
p02NameFormula
p02ValueNUMFORMAT(Width_cm '#.#')
SearchTermEdit column with formula

D trans
Typestep
p01NameColumns
p01ValueHeight_cm
p02NameFormula
p02ValueNUMFORMAT(Height_cm '#.#')
SearchTermEdit column with formula

Info

NOTE: The above assumes that the number of significant digits remains fixed in the source data. If this varies over times or uses in your recipe, you might need to revisit these specific transformation steps.

Info

NOTE: The above formatting option drops the zero for values like 4.0. As an alternative, you can use a format of '#.0', which always inserts a zero, even in cases where the zero is not present.

Results:

NameWidth_cmHeight_cm
Object 123.355.5
Object 265.2102.4
Object 354.212.2

Adjust data granularity by aggregation

For data hierarchies, you can use aggregations to adjust the granularity of your data to the appropriate grouping level. For example, you want to join a dataset that is organized by individual products with a dataset that is organized by brand. In most cases, you should aggregate the product-level data in the first dataset to the brand level.

Info

NOTE: When aggregation is applied, a new table of data is generated with the columns that you specifically select for inclusion.

For more information, see Pivot Data.