Date: Fri, 29 Sep 2023 00:33:04 +0000 (UTC) Message-ID: <1360124853.109143.1695947584508@3978a672e405> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_109142_1240016446.1695947584508" ------=_Part_109142_1240016446.1695947584508 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html Prepare Data for Machine Processing

# Prepare Data for Machine Processing

=20

Contents:

=20

=20
=20
=20

=20
=20

Depending on your downstream system, you may need to convert your da= ta into numeric values of the expected form or to standardize the distribut= ion of numeric values. This section summarizes some common statistical tran= sformations that can be applied to columnar data to prepare it for use in d= ownstream analytic systems.

## Scaling

You can scale the values within a column using either of the following t= echniques.

### S= cale to zero mean and unit variance

Zero mean and unit variance scaling renders the values = in the set to fit a normal distribution with a mean of `0` = and a variance of `1`. This technique is a common standard = for normalizing values into a normal distribution for statistical purposes.=

In the following example, the values in the `POS_Sales` = column have been normalized to average `0`, variance <= code>1.

• Remove mean: When selected, the existing mean (aver= age) of the values is used as the center of the distribution curve.

=

NOTE: Re-centering sparse data by removing the mean may= remove sparseness.

• Scale to unit variance: When selected, the range of= values are scaled such that their variance is `1`. When de= selected, the existing variance is maintained.

NOTE: Scaling to unit variance may not work well for ma= naging outliers. Some additional techniques for managing outliers are outli= ned below.

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name = `Scale column` `POS_Sales` `Scale to zero mean and unit variance` `false` `true` `Create new column` `scale_POS_Sales`
=20

### Scale to min= -max range

You can scale column values fitting between a specified minimum and maxi= mum value. This technique is useful for distributions with very small stand= ard deviation values and for preserving 0 values in sparse data.

The following example scales the `TestScores` colum= n to a range of `0` and `1`, inclusive= .

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 = =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name = `Scale column` `TestScores` `Scale to a given min-max range` `0` `1` `Replace current column`
=20

## Outliers

You can use several techniques for identifying statistical outliers in y= our dataset and managing them as needed.

### Identify outlie= rs

Suppose you need to remove the outliers from a column. Assuming a normal= bell distribution of values, you can use the following formula to calculat= e the number of standard deviations a column value is from the column mean = (average). In this case, the source column is `POS_Sales`.<= /p>

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name <= code>New formula `Multiple row formula` ```(ABS(POS_Sales - AVERAGE(POS_Sales))) / ST= DEV(POS_Sales)``` `stdevs_POS_Sales`
=20

### Chang= e outliers to mean values

You can also remove the effects of outliers be setting their value to th= e mean (average), which preserves the data in other columns in the row.&nbs= p;

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name `Edit with formula` `POS_Sales` ```IF(stdevs_POS_Sales > 4, AVERAGE(POS_Sa= les), POS_Sales)```
=20

## Binning

You can modify your data to fit into bins of equal or custom size. For e= xample, the lowest values in your range would be marked in the ```0= ``` bin, with larger values being marked with larger bin numbers.

### Bins of equal si= ze

You can bin numeric values into bins of equal size. Suppose your column = contains numeric values `0-1000`. You can bin values into e= qual ranges of `100` by creating `10` bins.=

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Bin column `MilleBornes` `Equal Sized Bins` `10` `MilleBornesRating`
=20

### Bins of custom = size

You can also create custom bins. In the following example, the TestScores column is binned into the following bins. In a later st= ep, these bins are mapped to grades:

Bins Bin Range Bin Number Grade
59 0-59 0 F
69 60-69 1 D
79 70-79 2 C
89 80-89 3 B
90+ 4 A
(no value)     I

First, you bin values into the bin numbers listed above:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Bin column `TestScores` `Custom bin size` `59,69,79,89` `Grades`
=20

You can then use the following transformation to assign letters in the <= code>Grades column:

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
Transformation Name Conditions `Case on single column` `Grades` `'F'` `'D'` `'C'` `'B'` `'A'` `'I'` `Grades_letters`
=20

## One-Hot Encoding=

One-hot encoding refers to distributing the listed= values in a column into individual columns. Within each row of each indivi= dual column is a `0` or a `1`, dependi= ng on whether the value represented by the column appears in the correspond= ing source column. The source column is untouched. This method of encoding = allows for easier consumption of data in target systems.

Tip: This transformation is particularly useful for col= umns containing a limited set of enumerated values.

In the following example, the values in the `BrandName`&= nbsp;column are distributed into separate columns of binary values, with a = maximum limit of `50` new columns.

NOTE: Be careful applying this to a column containing a= wide variety of values, such as Decimal values. Your dataset can expand si= gnificantly in size. Use the max columns setting to constrain the upper lim= it on dataset expansion.

=20
=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 = =20 =20 =20
Transformation Name `One-hot encoding of values to columns`= `BrandName` `50`
=20

Tip: If needed, you can rename the columns to prepend t= he names with a reference to the source column.

=20
=20

= =20

Content by label=20

There is no content with the specified labels

=20
=20

=20
=20
=20

------=_Part_109142_1240016446.1695947584508--