Excerpt 

Depending on your downstream system, you may need to convert your data into numeric values of the expected form or to standardize the distribution of numeric values. This section summarizes some common statistical transformations that can be applied to columnar data to prepare it for use in downstream analytic systems. 
You can scale the values within a column using either of the following techniques.
Zero mean and unit variance scaling renders the values in the set to fit a normal distribution with a mean of 0
and a variance of 1
. This technique is a common standard for normalizing values into a normal distribution for statistical purposes.
In the following example, the values in the POS_Sales
column have been normalized to average 0
, variance 1
.
Remove mean: When selected, the existing mean (average) of the values is used as the center of the distribution curve.
Info 

NOTE: Recentering sparse data by removing the mean may remove sparseness. 
Scale to unit variance: When selected, the range of values are scaled such that their variance is 1
. When deselected, the existing variance is maintained.
Info 

NOTE: Scaling to unit variance may not work well for managing outliers. Some additional techniques for managing outliers are outlined below. 
D trans 

p03Value  false 

p06Name  New column name 

p01Name  Column 

p06Value  scale_POS_Sales 

p03Name  Remove mean 

p04Value  true 

SearchTerm  Scale column 

Type  step 

p05Name  Output options 

p01Value  POS_Sales 

p02Name  Scaling method 

p02Value  Scale to zero mean and unit variance 

p05Value  Create new column 

p04Name  Scale to unit variance 


You can scale column values fitting between a specified minimum and maximum value. This technique is useful for distributions with very small standard deviation values and for preserving 0 values in sparse data.
The following example scales the TestScores
column to a range of 0
and 1
, inclusive.
D trans 

p03Value  0 

Type  step 

p05Name  Output options 

p01Name  Column 

p01Value  TestScores 

p02Name  Scaling method 

p02Value  Scale to a given minmax range 

p05Value  Replace current column 

p03Name  Minimum 

p04Value  1 

p04Name  Maximum 

SearchTerm  Scale column 


You can use several techniques for identifying statistical outliers in your dataset and managing them as needed.
Suppose you need to remove the outliers from a column. Assuming a normal bell distribution of values, you can use the following formula to calculate the number of standard deviations a column value is from the column mean (average). In this case, the source column is POS_Sales
.
D trans 

p03Value  stdevs_POS_Sales 

Type  step 

p01Name  Formula type 

p01Value  Multiple row formula 

p02Name  Formula 

p02Value  (ABS(POS_Sales  AVERAGE(POS_Sales))) / STDEV(POS_Sales) 

p03Name  New column name 

SearchTerm  New formula 


The new stdevs_POS_Sales
column now contains the number of standard deviations from the mean for the corresponding value in POS_Sales
. You can use the following transformation to remove the rows that contain outlier values for this column.
Tip 

Tip: An easier way to select these outlier values is to select the range of values in the stdevs_POS_Sales column histogram. Then, select the suggestion to delete these rows. You may want to edit the actual formula before you add it to your recipe. 
In the following transformation, all rows that contain a value in POS_Sales
that is greater than four standard deviations from the mean are deleted:
D trans 

p03Value  4 <= stdevs_POS_Sales 

Type  step 

p01Name  Condition 

p01Value  Custom formula 

p02Name  Type of formula 

p02Value  Custom single 

p03Name  Condition 

p04Value  Delete matching rows 

p04Name  Action 

SearchTerm  Filter rows 


You can also remove the effects of outliers be setting their value to the mean (average), which preserves the data in other columns in the row.
D trans 

Type  step 

p01Name  Columns 

p01Value  POS_Sales 

p02Name  Formula 

p02Value  IF(stdevs_POS_Sales > 4, AVERAGE(POS_Sales), POS_Sales) 

SearchTerm  Edit with formula 


You can modify your data to fit into bins of equal or custom size. For example, the lowest values in your range would be marked in the 0
bin, with larger values being marked with larger bin numbers.
You can bin numeric values into bins of equal size. Suppose your column contains numeric values 01000
. You can bin values into equal ranges of 100
by creating 10
bins.
D trans 

p03Value  10 

Type  step 

p01Name  Column 

p01Value  MilleBornes 

p02Name  Select Option 

p02Value  Equal Sized Bins 

p03Name  Number of Bins 

p04Value  MilleBornesRating 

p04Name  New column name 

SearchTerm  Bin column 


You can also create custom bins. In the following example, the TestScores
column is binned into the following bins. In a later step, these bins are mapped to grades:
Bins  Bin Range  Bin Number  Grade 

59  059  0  F 
69  6069  1  D 
79  7079  2  C 
89  8089  3  B 
 90+  4  A 
(no value)    I 
First, you bin values into the bin numbers listed above:
D trans 

p03Value  59,69,79,89 

Type  step 

p05Name  New column name 

p01Name  Column 

p01Value  TestScores 

p02Name  Select option 

p02Value  Custom bin size 

p05Value  Grades 

p03Name  Bins 

p04Name  Default value 

SearchTerm  Bin column 


You can then use the following transformation to assign letters in the Grades
column:
D trans 

p03Value  'F' 

p06Name  Case  3 

p09Value  Grades_letters 

p01Name  Condition type 

p06Value  'B' 

p03Name  Case  0 

p07Value  'A' 

p04Value  'D' 

SearchTerm  Conditions 

p07Name  Case  4 

p09Name  New column name 

Type  step 

p05Name  Case  2 

p01Value  Case on single column 

p02Name  Column to evaluate 

p02Value  Grades 

p05Value  'C' 

p04Name  Case  1 

p08Value  'I' 

p08Name  Default value 


Onehot encoding refers to distributing the listed values in a column into individual columns. Within each row of each individual column is a 0
or a 1
, depending on whether the value represented by the column appears in the corresponding source column. The source column is untouched. This method of encoding allows for easier consumption of data in target systems.
Tip 

Tip: This transformation is particularly useful for columns containing a limited set of enumerated values. 
In the following example, the values in the BrandName
column are distributed into separate columns of binary values, with a maximum limit of 50
new columns.
Info 

NOTE: Be careful applying this to a column containing a wide variety of values, such as Decimal values. Your dataset can expand significantly in size. Use the max columns setting to constrain the upper limit on dataset expansion. 
D trans 

Type  step 

p01Name  Column 

p01Value  BrandName 

p02Name  Max number of columns to create 

p02Value  50 

SearchTerm  Onehot encoding of values to columns 


Tip 

Tip: If needed, you can rename the columns to prepend the names with a reference to the source column. 
D s also 

inCQL  true 

label  (label = "onehot") 

