Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r089

D toc

Depending on your downstream system, you may need to convert your data into numeric values of the expected form or to standardize the distribution of numeric values. This section summarizes some common statistical transformations that can be applied to columnar data to prepare it for use in downstream analytic systems.

Scaling

You can scale the values within a column using either of the following techniques.

Scale to zero mean and unit variance

Zero mean and unit variance scaling renders the values in the set to fit a normal distribution with a mean of 0 and a variance of 1. This technique is a common standard for normalizing values into a normal distribution for statistical purposes.

In the following example, the values in the POS_Sales column have been normalized to average 0, variance 1

  • Remove mean: When selected, the existing mean (average) of the values is used as the center of the distribution curve.

    Info

    NOTE: Re-centering sparse data by removing the mean may remove sparseness.

  • Scale to unit variance: When selected, the range of values are scaled such that their variance is 1. When deselected, the existing variance is maintained.

    Info

    NOTE: Scaling to unit variance may not work well for managing outliers. Some additional techniques for managing outliers are outlined below.

D trans
p03Valuefalse
p06NameNew column name
p01NameColumn
p06Valuescale_POS_Sales
p03NameRemove mean
p04Valuetrue
SearchTermScale column
Typestep
p05NameOutput options
p01ValuePOS_Sales
p02NameScaling method
p02ValueScale to zero mean and unit variance
p05ValueCreate new column
p04NameScale to unit variance

Scale to min-max range

You can scale column values fitting between a specified minimum and maximum value. This technique is useful for distributions with very small standard deviation values and for preserving 0 values in sparse data.

The following example scales the TestScores column to a range of 0 and 1, inclusive.

D trans
p03Value0
Typestep
p05NameOutput options
p01NameColumn
p01ValueTestScores
p02NameScaling method
p02ValueScale to a given min-max range
p05ValueReplace current column
p03NameMinimum
p04Value1
p04NameMaximum
SearchTermScale column

Outliers

You can use several techniques for identifying statistical outliers in your dataset and managing them as needed.

Identify outliers

Suppose you need to remove the outliers from a column. Assuming a normal bell distribution of values, you can use the following formula to calculate the number of standard deviations a column value is from the column mean (average). In this case, the source column is POS_Sales.

D trans
p03Valuestdevs_POS_Sales
Typestep
p01NameFormula type
p01ValueMultiple row formula
p02NameFormula
p02Value(ABS(POS_Sales - AVERAGE(POS_Sales))) / STDEV(POS_Sales)
p03NameNew column name
SearchTermNew formula

Remove outliers

The new stdevs_POS_Sales column now contains the number of standard deviations from the mean for the corresponding value in POS_Sales. You can use the following transformation to remove the rows that contain outlier values for this column.

Tip

Tip: An easier way to select these outlier values is to select the range of values in the stdevs_POS_Sales column histogram. Then, select the suggestion to delete these rows. You may want to edit the actual formula before you add it to your recipe.

In the following transformation, all rows that contain a value in POS_Sales that is greater than four standard deviations from the mean are deleted:

D trans
p03Value4 <= stdevs_POS_Sales
Typestep
p01NameCondition
p01ValueCustom formula
p02NameType of formula
p02ValueCustom single
p03NameCondition
p04ValueDelete matching rows
p04NameAction
SearchTermFilter rows

Change outliers to mean values

You can also remove the effects of outliers be setting their value to the mean (average), which preserves the data in other columns in the row. 

D trans
Typestep
p01NameColumns
p01ValuePOS_Sales
p02NameFormula
p02ValueIF(stdevs_POS_Sales > 4, AVERAGE(POS_Sales), POS_Sales)
SearchTermEdit with formula

Binning

You can modify your data to fit into bins of equal or custom size. For example, the lowest values in your range would be marked in the 0 bin, with larger values being marked with larger bin numbers.

Bins of equal size

You can bin numeric values into bins of equal size. Suppose your column contains numeric values 0-1000. You can bin values into equal ranges of 100 by creating 10 bins.

D trans
p03Value10
Typestep
p01NameColumn
p01ValueMilleBornes
p02NameSelect Option
p02ValueEqual Sized Bins
p03NameNumber of Bins
p04ValueMilleBornesRating
p04NameNew column name
SearchTermBin column

Bins of custom size

You can also create custom bins. In the following example, the TestScores column is binned into the following bins. In a later step, these bins are mapped to grades:

BinsBin RangeBin NumberGrade
590-590F
6960-691D
7970-792C
8980-893B
 90+4A
(no value)  I

First, you bin values into the bin numbers listed above: 

D trans
p03Value59,69,79,89
Typestep
p05NameNew column name
p01NameColumn
p01ValueTestScores
p02NameSelect option
p02ValueCustom bin size
p05ValueGrades
p03NameBins
p04NameDefault value
SearchTermBin column

You can then use the following transformation to assign letters in the Grades column:

D trans
p03Value'F'
p06NameCase - 3
p09ValueGrades_letters
p01NameCondition type
p06Value'B'
p03NameCase - 0
p07Value'A'
p04Value'D'
SearchTermConditions
p07NameCase - 4
p09NameNew column name
Typestep
p05NameCase - 2
p01ValueCase on single column
p02NameColumn to evaluate
p02ValueGrades
p05Value'C'
p04NameCase - 1
p08Value'I'
p08NameDefault value

One-Hot Encoding

One-hot encoding refers to distributing the listed values in a column into individual columns. Within each row of each individual column is a 0 or a 1, depending on whether the value represented by the column appears in the corresponding source column. The source column is untouched. This method of encoding allows for easier consumption of data in target systems.

Tip

Tip: This transformation is particularly useful for columns containing a limited set of enumerated values.

In the following example, the values in the BrandName column are distributed into separate columns of binary values, with a maximum limit of 50 new columns.

Info

NOTE: Be careful applying this to a column containing a wide variety of values, such as Decimal values. Your dataset can expand significantly in size. Use the max columns setting to constrain the upper limit on dataset expansion.

D trans
Typestep
p01NameColumn
p01ValueBrandName
p02NameMax number of columns to create
p02Value50
SearchTermOne-hot encoding of values to columns

Tip

Tip: If needed, you can prepend the names of the resulting columns with a reference to the source column. See Rename Columns.