Date: Fri, 27 May 2022 03:55:13 +0000 (GMT) Message-ID: <1038037645.30801.1653623713619@c7585db71e40> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_30800_903725082.1653623713619" ------=_Part_30800_903725082.1653623713619 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html Normalize Numeric Values

# Normalize Numeric Values

Contents:

=20

=20

=20
=20

This section describes techniques to normalize numeric values in your da= tasets. Ideally, your source systems are configured to capture and del= iver data using a consistent set of units in a standardized structure and f= ormat. In practice, data from multiple systems can illuminate differences i= n the level of precision used in numeric data or differences in text entrie= s that reference the same thing. Within Trifac= ta=C2=AE Self-Managed Enterprise Edition, you can use the follo= wing techniques to address some of the issues you might encounter in the st= andardization of units and values for numeric types.

## Numeric precision

In  Trifacta Self-Managed Enterpr= ise Edition, mathematical computations are performed using 64-bit floating point operations to 15 de= cimals of precision. However, due to rounding off, truncation, and other te= chnical factors, small discrepancies in outputs can be expected. Example:

 -636074 -2.46509e+06

Suppose you apply the following tr= ansformation:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name <= code>New formula `Single row formula` `(-636074.22 + -2465086.34)` `MySum`
=20

The expected output in the MySum c= olumn: -3101160.56=

The actual output for in the MySum= column: -3101160.5599999996

NOTE: For 64-bit floating point mathematical operations= , deviations like the above are intrinsic to the Decimal data type and how = the platform performs computations.

Depending on your precision requirements, y= ou can manage precision across your columns using a transformation like the= following, which rounds off MySum to three digits:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name `Edit column with formula` `MySum` `ROUND(\$col,3)`
=20

## Standardize d= ecimal precision

If decimal values in a column are of varying levels of precision, you ca= n standardize to a single level of precision.

Steps:

1. From the column menu, select Column Details.
2. .In the Column Details panel, select the Patterns tab.Among the patt= erns, select the following:

=20
```{digit}.{d=
igit}```
=20
3. In the Suggestions panel on the right, locate the Edit Column transform= ation suggestion that uses the `ROUND` function. Click= Edit.
4. Change the second parameter of the `ROUND` function to match= the number of digits of precision.

You can generalize this formatting across multiple columns by applying t= he `\$col` reference in the transformation's function, = as in the following:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name `Edit column with formula` `colA, colB, colC` `IFVALID(\$col, ['Float'], ROUND(\$col, 2))`
=20

## Standardize units

Tip: Each column that contains numeric values sh= ould have an identified unit of measurement. Ideally, this information is e= mbedded in the name of the column data. If the unit of measurement is not i= ncluded, it can be difficult to properly interpret the data.

Trifacta Self-Managed Enterprise Edit= ion does not impose any units on imported data. For example, a = column of values in floating point format could represent centimeters, ounc= es, or any other unit of measurement. As long as the data conforms to the s= pecified data type for the column, then Trifac= ta Self-Managed Enterprise Edition can work with it. =

However, this flexibility can present issues for users of the dataset. I= f data is not clearly labeled and converted to a standardized set of units,= its users are forced to make assumptions about the data, which can le= ad to misuse of it.

Tip: The meaning of some units of measure can change ov= er time. For example, a US Dollar in 2010 does not have the same value as a= dollar in 2015. When you standardize shifting units of measure, you should= account for any time-based differences, if possible.

### Example - = Fixed conversion factors

In many cases, units can be converted to other units by applying a fixed= conversion factor to a column of data. For example, your dataset has = the following three columns of measured data:

=20 =20 =20 =20 =20
Person Height_ft Weight_kg Arm_Length_in
Jack 5'10" 92 kg 32
Jill 5'2" 56 kg 29
Joe 6'3" 101 kg 35

The above data has the following issues:

1. The Weight and Height columns contain unit identifiers, which forces th= e values to be treated as strings.
2. Metric data (kg) is mixed with English unit data (ft and in).
3. The Height data is non-numeric.

Problem 1 - remove units

The `Weight_kg` column contains a unit identifier. = On import, these values are treated as strings, which limits their use for = analysis.

Steps:

1. In the data grid, select an instance of " kg". Note that the space shou= ld be selected, too.
2. Among the suggestion cards, select the Replace card.
3. It should automatically choose to replace with nothing, effectively del= eting the content. To check, click Modify
4. The transformation should look like the following:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name `Replace text or patterns` `Weight_kg` `' kg'` `''` `true`
=20

6. Verify that the column's data type has been changed to `Integer or Decimal, depending on the values in it.`
7. ``` ```
``` Problem 2 - convert English to metric units To normalize to English units, the first issue is easily corrected by mu= ltiplying the Weight values by 2.2, since 1 kg =3D 2.2 lb: =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 Transformation NameEdit column with formulaParameter: ColumnsWeight_kgParameter: Formula(Weight_kg * 2.2) =20 If you want to round the value to the nearest integer, use the following= : =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 Transformation NameEdit column with formulaParameter: ColumnsWeight_kgParameter: FormulaROUND((Weight_kg * 2.2)) =20 After the above is added to the recipe, you should rename the column:&nb= sp;Weight_lbs. Problem 3 - convert ft/in to in The final issue involves converting the Height_ft valu= es to a single value for inches, so that these values can be used consisten= tly with the other columns in the dataset. On import, your data for the column might actually look like the followi= ng: Height_ft "5'10"" "5'2"" "6'3""   Steps: Select the first quote mark in one of the entries.  In the suggestion cards, select the Replace card. Select the variant that deletes all quotes in the column. <= /li> The full command should look like the following: =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 Transformation NameReplace text or patternsParameter: ColumnHeight_ftParameter: Find`"`Parameter: Replace with''Parameter: Match all occurrencestrue =20 Add it to your recipe. The remaining steps compute the number of inches. Multiply the feet by = 12, and then add the number of inches, using new columns of data. Select the single quote mark, and choose the Split suggestion card. Thi= s transformation step should split the column into two columns: Heigh= t_ft1 and Height_ft2.  Derive the value in inches: =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 = =20 =20 =20 =20 =20 =20 Transformation Name<= code>New formulaParameter: Formula typeSingle row formulaParameter: Formula((Height_ft1 *12)+Height_ft2)Parameter: New column nameHeight_in =20 You can delete the other, interim columns. Dynamic conversi= on factors In some cases, the conversion rate between two different units of measur= es is dynamic. A common example involves mismatches between currency. = For example, one dataset can be using U.S. dollars while another repre= sents values in Euros. Within a column If you have inconsistent units within a column, it might be possibl= e to correct these values by applying a multiple. For example, you might be= able to determine that some values are in kilometers, instead of meters, b= ased on their much smaller values. Multiplying the kilometer values by 1000= should standardize your units. The following multiplies all values in the = column Distance that are less than 1000 by 1000.&nbs= p; =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 Transformation NameEdit column with formulaParameter: ColumnsDistanceParameter: FormulaIF((Distance < 1000,(Distance * 1000), = Distance) =20 Note the implied assumption that there are no distances in kilometers th= at are over 1000.  NOTE: Inconsistency in units within a column indicates = a problem in either the source data or how the column data was modified aft= er import. Where possible, you should try to fix these issues in the source= data first, as they can introduce problems when the data is used. Adjust level of pr= ecision For numeric values that are used for measurement, you can adjust the lev= el of precision within and across columns of values. For example, you have = the following columns of data: =20 =20 =20 =20 Name Width_cm Height_cm Object 1 23.3 55.5512 Object 2 65.2 102.4024 Object 3 54.2 12.22 In the above, you can see the following precision mismatches: The Height column contains one value with only two digits of arithmetic= precision in measurement. The Width column uses two digits of arithmetic precision, while th= e Height column contains more digits of precision. Where precision in measurement is important, you should consider roundin= g to the lowest level of precision. In this case, within the Height column,= that level is to two significant digits after the decimal point (e.g. = ;12.22). However, across all of the columns of the dataset, th= e level of precision is to one significant digit after the decimal point, a= s the Width values are all restricted to this level of precision. While you= could choose to round off to four digits across all columns, the extra val= ues of 0 do not accurately reflect measurement and are th= erefore misleading. You can use the following transformations to perform rounding functions = within these columns: =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 Transformation NameEdit column with formulaParameter: ColumnsWidth_cmParameter: FormulaNUMFORMAT(Width_cm '#.#') =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 Transformation NameEdit column with formulaParameter: ColumnsHeight_cmParameter: FormulaNUMFORMAT(Height_cm '#.#') =20 NOTE: The above assumes that the number of significant = digits remains fixed in the source data. If this varies over times or uses = in your recipe, you might need to revisit these specific transformation ste= ps. ```
``` ```
``` ```

```NOTE: The above formatting option drops the zero for va= lues like 4.0. As an alternative, you can use a format of '#.0'```, which always inserts a zero, even in cases where the zero i= s not present.

Results:

=20 =20 =20 =20
Name Width_cm Height_cm
Object 1 23.3 55.5
Object 2 65.2 102.4
Object 3 54.2 12.2

### Adjust= data granularity by aggregation

For data hierarchies, you can use aggregations to adjust the granularity= of your data to the appropriate grouping level. For example, you want to j= oin a dataset that is organized by individual products with a dataset that = is organized by brand. In most cases, you should aggregate the product-leve= l data in the first dataset to the brand level.

NOTE: When aggregation is applied, a new table of data = is generated with the columns that you specifically select for inclusion.