Page tree

 

Support | BlogContact Us | 844.332.2821

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0411

D toc

This section describes techniques to normalize numeric and text values in your datasets. Ideally, your source systems are configured to capture and deliver data using a consistent set of units in a standardized structure and format. In practice, data from multiple systems can illuminate differences in the level of precision used in numeric data or differences in text entries that reference the same thing. Within the

d-s-
webapp
product
rtrue
, you can use the following techniques to address some of the issues you might encounter in the standardization of units and values and normalization of text values.

Trim whitespace

You can trim out whitespace from an individual column via transform. The TRIM function applied to string values removes the leading and trailing whitespace:

D code
showNotetrue

set col: myCol value: TRIM(myCol)

To apply this function across all columns in the dataset, you must use the replace transform instead:
D code
hideNotetrue

replace col:* on: `{start}{[ ]}+|{[ ]}+{end}` with:'' global:true

The above transform utilizes 
D s item
patterns
patterns
, which are a simpler, macro-based method of referencing regular expressions. See Text Matching.

...

However, this flexibility can present issues for users of the dataset. If data is not clearly labeled and converted to a standardized set of units, its users are forced to make assumptions about the data, which can lead to misuse of it. 

Tip

Tip: To assist users of the data, you should either build a legend of the columns of the data, including their units.

Info

Tip: The meaning of some units of measure can change over time. For example, a US Dollar in 2010 does not have the same value as a dollar in 2015. When you standardize shifting units of measure, you should account for any time-based differences, if possible.

Example - Fixed conversion factors

In many cases, units can be converted to other units by applying a fixed conversion factor to a column of data. For  For example, your dataset has the following three columns of measured data:

PersonHeight_ftWeight_kgArm_Length_in
Jack5'10"92 kg32
Jill5'2"56 kg29
Joe6'3"101 kg35

The above data has the following issues:

  1. The Weight and Height columns contain unit identifiers, which forces the
    D s webapp
     to treat the values as values to be treated as strings.
  2. Metric data (kg) is mixed with English unit data (ft and in).
  3. The Height data is non-numeric.

Problem 1 - remove units

The Weight_kg column contains a unit identifier. On import, these values are treated as strings, which limits their use for analysis.

...

  1. In the data grid, select an instance of " kg". Note that the space should be selected, too.
  2. Among the transform suggestion cards, select the the Replace  onecard
  3. It should automatically choose to replace with nothing, effectively deleting the content. To check, click Modify
  4. The transform should look like the following:

    D code
    hideNotetrue

    replace col: Weight_kg on: ` kg` with: '' global: true

  5. Add it to your recipe.
  6. Verify that the column's data type has been changed to Integer or Decimal, depending on the values in it.

Problem 2 - convert English to metric units

To normalize to English units, the first issue is easily corrected by multiplying the Weight values by 2.2, since 1 kg = 2.2 lb:

D code
hideNotetrue

set col:Weight_kg value:(Weight_kg * 2.2)

 

If you want to round the value to the nearest integer, use the following:

D code
hideNotetrue

set col:Weight_kg value:ROUND((Weight_kg * 2.2))

 

After the above is added to the recipe, you should rename the column: Weight_lbs.

...

  1. Select the first quote mark in one of the entries. 
  2. In the transform suggestion cards, click the select the Replace onecard.
  3. Select the variant that deletes all quotes in the column. 

  4. The full command should look like the following:

    D code
    hideNotetrue

    replace col: Height_ft on: `"` with: '' global: true

  5. Add it to your recipe.
  6. The remaining steps compute the number of inches. Multiply the feet by 12, and then add the number of inches, using new columns of data.
  7. Select the single quote mark, and choose the the Split transform suggestion card. This transform step should split the column into two columns: Height_ft1 and Height_ft2
  8. Derive the value in inches:

    D code
    hideNotetrue

    derive value:((Height_ft1 *12)+Height_ft2)

  9. Add it to your recipe.
  10. Rename the new column: Height_in.
  11. You can drop the other, interim columns.

...

In some cases, the conversion rate between two different units of measures is dynamic. A common example involves example involves mismatches between currency. For example, one dataset can be using U.S. dollars while another represents values in Euros. 

For more information, see 

D tri link
urlhttp://www.trifacta.com/support/articles/article/654724/
.

Within a column

If you have inconsistent units within a column, it might be possible to correct these values by applying a multiple. For example, you might be able to determine that some values are in kilometers, instead of meters, based on their much smaller values. Multiplying the kilometer values by 1000 should standardize your units. The following multiplies all values in the column Distance that are less than 1000 by 1000. 

D code
hideNotetrue

set col:Distance value:(Distance < 1000 ? (Distance * 1000) : Distance)

 

Note the implied assumption that there are no distances in meters kilometers that are over 1000. If there are such values, then you must identify the distances in meters based on data in other columns. 

Info

NOTE: Inconsistency in units within a column indicates a problem in either the source data or how the column data was modified after import within the

D s webapp
. Where possible, you should try to fix these issues in the source data first, as they can introduce problems when the data is used.

...

You can use the following transforms to perform rounding functions within these columns:

D code
hideNotetrue

set col:Width_cm value:NUMFORMAT(Width_cm '#.#')

D code
hideNotetrue

set col:Height_cm value:NUMFORMAT(Height_cm '#.#')

...

NameWidth_cmHeight_cm
Object 123.355.5
Object 265.2102.4
Object 354.212.2

Adjust data granularity by aggregation

For data hierarchies, you can use aggregations to adjust the granularity of your data to the appropriate grouping level. For example, you want to join a dataset that is organized by individual products with a dataset that is organized by brand. In most cases, you should aggregate the product-level data in the first dataset to the brand level. 

Info

NOTE: When aggregation is applied, a new table of data is generated with the columns that you specifically select for inclusion.

...