Excerpt |
---|
This section describes techniques to normalize numeric values in your datasets. |
...
Ideally, your source systems are configured to capture and deliver data using a consistent set of units in a standardized structure and format. In practice, data from multiple systems can illuminate differences in the level of precision used in numeric data or differences in text entries that reference the same thing. Within
, you can use the following techniques to address some of the issues you might encounter in the standardization of units and values for numeric types
.Numeric precision
In
, mathematical computations are performed using 64-bit floating point operations to 15 decimals of precision. However, due to rounding off, truncation, and other technical factors, small discrepancies in outputs can be expected. Example:Suppose you apply the following transformation:
D trans |
---|
RawWrangle | true |
---|
p03Value | MySum |
---|
Type | step |
---|
WrangleText | derive type: single value: (-636074.22 + -2465086.34) as: 'MySum' |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | (-636074.22 + -2465086.34) |
---|
p03Name | New column name |
---|
SearchTerm | New formula |
---|
|
The expected output in the MySum column: -3101160.56
The actual output for in the MySum column: -3101160.5599999996
Info |
---|
NOTE: For 64-bit floating point mathematical operations, deviations like the above are intrinsic to the Decimal data type and how the platform performs computations. |
Depending on your precision requirements, you can manage precision across your columns using a transformation like the following, which rounds off MySum to three digits:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | set col: WM_Week value: round($col, 3) |
---|
p01Name | Columns |
---|
p01Value | MySum |
---|
p02Name | Formula |
---|
p02Value | ROUND($col,3) |
---|
SearchTerm | Edit column with formula |
---|
|
For more information on floating point computations, see https://en.wikipedia.org/wiki/Numeric_precision_in_Microsoft_Excel.
Standardize decimal precision
...
- From the column menu, select Column Details.
.In the Column Details panel, select the Patterns tab.Among the patterns, select the following:
Code Block |
---|
{digit}.{digit} |
- In the Suggestions panel on the right, locate the
set
transform suggestion Edit Column transformation suggestion that uses the ROUND
function. Click Edit. - Change the second parameter of the
ROUND
function to match the number of digits of precision.
You can generalize this formatting across multiple columns by applying the $col
reference in the transformtransformation's functions function, as in the following:
codeset col:trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | colA, colB, colC |
---|
|
value: | p02Name | Formula |
---|
p02Value | IFVALID($col, ['Float'], ROUND($col, 2)) |
---|
|
See Column Details Panel. For more information, see ROUND Function.SearchTerm | Edit column with formula |
---|
|
Standardize units
Tip |
---|
Tip: Each column that contains numeric values should have an identified unit of measurement. Ideally, this information is embedded in the name of the column data. If the unit of measurement is not included, it can be difficult to properly interpret the data. |
...
- In the data grid, select an instance of " kg". Note that the space should be selected, too.
- Among the suggestion cards, select the Replace card.
- It should automatically choose to replace with nothing, effectively deleting the content. To check, click Modify.
The transform transformation should look like the following:
code |
replace col: Weight_kg on: ` kg` with: '' global: truetrans |
---|
p03Value | '' |
---|
Type | step |
---|
p01Name | Column |
---|
p01Value | Weight_kg |
---|
p02Name | Find |
---|
p02Value | ' kg' |
---|
p03Name | Replace with |
---|
p04Value | true |
---|
p04Name | Match all occurrences |
---|
SearchTerm | Replace text or patterns |
---|
|
- Add it to your recipe.
- Verify that the column's data type has been changed to
Integer
or Decimal
, depending on the values in it.
...
To normalize to English units, the first issue is easily corrected by multiplying the Weight values by 2.2, since 1 kg = 2.2 lb:
...
...
trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | Weight_kg |
---|
p02Name | Formula |
---|
p02Value | (Weight_kg * 2.2) |
---|
SearchTerm | Edit column with formula |
---|
|
If you want to round the value to the nearest integer, use the following:
...
...
trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | Weight_kg |
---|
p02Name | Formula |
---|
p02Value | ROUND((Weight_kg * 2.2)) |
---|
SearchTerm | Edit column with formula |
---|
|
After the above is added to the recipe, you should rename the column: Weight_lbs
.
...
- Select the first quote mark in one of the entries.
- In the suggestion cards, select the Replace card.
Select the variant that deletes all quotes in the column.
The full command should look like the following:
code |
replace col: Height_ft on: `"` with: '' global: true |
trans |
---|
p03Value | '' |
---|
Type | step |
---|
p01Name | Column |
---|
p01Value | Height_ft |
---|
p02Name | Find |
---|
p02Value | `"` |
---|
p03Name | Replace with |
---|
p04Value | true |
---|
p04Name | Match all occurrences |
---|
SearchTerm | Replace text or patterns |
---|
|
- Add it to your recipe.
- The remaining steps compute the number of inches. Multiply the feet by 12, and then add the number of inches, using new columns of data.
- Select the single quote mark, and choose the Split suggestion card. This transform transformation step should split the column into two columns:
Height_ft1
and Height_ft2
. Derive the value in inches:
code |
derive type:single value:You can drop trans |
---|
p03Value | Height_in |
---|
Type | step |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | ((Height_ft1 *12)+Height_ft2) |
---|
|
- Add it to your recipe.
- Rename the new column:
Height_in
. p03Name | New column name |
---|
SearchTerm | New formula |
---|
|
- You can delete the other, interim columns.
...
If you have inconsistent units within a column, it might be possible to correct these values by applying a multiple. For example, you might be able to determine that some values are in kilometers, instead of meters, based on their much smaller values. Multiplying the kilometer values by 1000 should standardize your units. The following multiplies all values in the column Distance
that are less than 1000 by 1000.
...
...
trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | Distance |
---|
p02Name | Formula |
---|
p02Value | IF((Distance < 1000 |
---|
|
...
...
, Distance) | SearchTerm | Edit column with formula |
---|
|
Note the implied assumption that there are no distances in kilometers that are over 1000.
...
You can use the following transforms transformations to perform rounding functions within these columns:
...
...
trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | Width_cm |
---|
p02Name | Formula |
---|
p02Value | NUMFORMAT(Width_cm '#.#') |
---|
SearchTerm | Edit column with formula |
---|
|
...
...
trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | Height_cm |
---|
p02Name | Formula |
---|
p02Value | NUMFORMAT(Height_cm '#.#') |
---|
SearchTerm | Edit column with formula |
---|
|
Info |
---|
NOTE: The above assumes that the number of significant digits remains fixed in the source data. If this varies over times or uses of the transform in your recipe, you might need to revisit these specific transform transformation steps. |
Info |
---|
NOTE: The above formatting option drops the zero for values like 4.0 . As an alternative, you can use a format of '#.0' , which always inserts a zero, even in cases where the zero is not present. |
...
For more information, see Pivot Transform.Data.
D s also |
---|
inCQL | true |
---|
label | (label = "cleanse_tasks") |
---|
|