Date: Mon, 24 Jan 2022 14:07:39 +0000 (GMT) Message-ID: <1447363830.107252.1643033259847@9c5033e110b2> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_107251_2108781856.1643033259847" ------=_Part_107251_2108781856.1643033259847 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html Prepare Data for Machine Processing

Prepare Data for Machine Processing

Contents:

=20

=20
=20
=20

=20
=20

Depending on your downstream system, you may need to convert your data i= nto numeric values of the expected form or to standardize the distribution = of numeric values. This section summarizes some common statistical transfor= mations that can be applied to columnar data to prepare it for use in downs= tream analytic systems.

Scaling

You can scale the values within a column using either of the following t= echniques.

S= cale to zero mean and unit variance

Zero mean and unit variance scaling renders the values = in the set to fit a normal distribution with a mean of 0 = and a variance of 1. This technique is a common standard = for normalizing values into a normal distribution for statistical purposes.=

In the following example, the values in the POS_Sales = column have been normalized to average 0, variance <= code>1.

• Remove mean: When selected, the existing mean (aver= age) of the values is used as the center of the distribution curve.

=

NOTE: Re-centering sparse data by removing the mean may= remove sparseness.

• Scale to unit variance: When selected, the range of= values are scaled such that their variance is 1. When de= selected, the existing variance is maintained.

NOTE: Scaling to unit variance may not work well for ma= naging outliers. Some additional techniques for managing outliers are outli= ned below.

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name = Scale column POS_Sales Scale to zero mean and unit variance false true Create new column scale_POS_Sales
=20

Scale to min= -max range

You can scale column values fitting between a specified minimum and maxi= mum value. This technique is useful for distributions with very small stand= ard deviation values and for preserving 0 values in sparse data.

The following example scales the TestScores colum= n to a range of 0 and 1, inclusive= .

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 = =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name = Scale column TestScores Scale to a given min-max range 0 1 Replace current column
=20

Outliers

You can use several techniques for identifying statistical outliers in y= our dataset and managing them as needed.

Identify outlie= rs

Suppose you need to remove the outliers from a column. Assuming a normal= bell distribution of values, you can use the following formula to calculat= e the number of standard deviations a column value is from the column mean = (average). In this case, the source column is POS_Sales.<= /p>

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name <= code>New formula Multiple row formula (ABS(POS_Sales - AVERAGE(POS_Sales))) / ST= DEV(POS_Sales) stdevs_POS_Sales
=20

Chang= e outliers to mean values

You can also remove the effects of outliers be setting their value to th= e mean (average), which preserves the data in other columns in the row.&nbs= p;

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Edit with formula POS_Sales IF(stdevs_POS_Sales > 4, AVERAGE(POS_Sa= les), POS_Sales)
=20

Binning

You can modify your data to fit into bins of equal or custom size. For e= xample, the lowest values in your range would be marked in the 0= bin, with larger values being marked with larger bin numbers.

Bins of equal si= ze

You can bin numeric values into bins of equal size. Suppose your column = contains numeric values 0-1000. You can bin values into e= qual ranges of 100 by creating 10 bins.=

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Bin column MilleBornes Equal Sized Bins 10 MilleBornesRating
=20

Bins of custom = size

You can also create custom bins. In the following example, the TestScores column is binned into the following bins. In a later st= ep, these bins are mapped to grades:

Bins Bin Range Bin Number Grade
59 0-59 0 F
69 60-69 1 D
79 70-79 2 C
89 80-89 3 B
90+ 4 A
(no value)     I

First, you bin values into the bin numbers listed above:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Bin column TestScores Custom bin size 59,69,79,89 Grades
=20

You can then use the following transformation to assign letters in the <= code>Grades column:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Conditions Case on single column Grades 'F' 'D' 'C' 'B' 'A' 'I' Grades_letters
=20

One-Hot Encoding=

One-hot encoding refers to distributing the listed= values in a column into individual columns. Within each row of each indivi= dual column is a 0 or a 1, dependi= ng on whether the value represented by the column appears in the correspond= ing source column. The source column is untouched. This method of encoding = allows for easier consumption of data in target systems.

Tip: This transformation is particularly useful for col= umns containing a limited set of enumerated values.

In the following example, the values in the BrandName&= nbsp;column are distributed into separate columns of binary values, with a = maximum limit of 50 new columns.

NOTE: Be careful applying this to a column containing a= wide variety of values, such as Decimal values. Your dataset can expand si= gnificantly in size. Use the max columns setting to constrain the upper lim= it on dataset expansion.

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 = =20 =20 =20
Transformation Name One-hot encoding of values to columns= BrandName 50
=20

Tip: If needed, you can prepend the names of the result= ing columns with a reference to the source column. See Rename Columns.

------=_Part_107251_2108781856.1643033259847--