Page tree

 

Support | BlogContact Us | 844.332.2821

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

This section describes some techniques for performing analysis across data stored in multiple columns. For example, you may want to analyze combinations of height and weight. Some options:

  • Consolidate dimensions to a single metric. For example, height and weight can be combined using a BMI (body mass index) calculation. Then, use available outlier analysis capabilities in the

    D s webapp
    rtrue
    Below, you can review a method for bringing together similar data from multiple columns into a single column for easier analysis.

  • Flag outlier values of individual columns, perhaps giving each column a weighting factor (e.g. 0.5). Sum the outliers and their weights together.
  • Defer analysis until the data has arrived in the target system. 

If you have homogeneous data across multiple columns, such as multiple individual events recorded in a single row, you can use a different method to calculate metrics. See Calculate Metrics across Columns

In some cases, you may need to identify outliers across multiple columns of data. For example, you have a dataset containing scores from three separate tests taken by a set of individuals. Your columns may look like the following:

Most calculations, such as standard deviation, work for a single column of data. To perform analysis across all three columns, you must reshape the above dataset to look like the following:

  • LastName
  • FirstName
  • TestNumber
  • TestScore

This steps below outline the workflow for this example.

  • The full recipe is provided at the bottom of this section.

Steps:

  1. Load the TestScores dataset into the Transformer page. It should already be split out into five separate columns.
  2. The three columns listed side by side are data that has been organized in a pivot table. To break down this data, you must unpivot the data, which breaks down the data into a key column (containing TestScore1TestScore2TestScore1) and a value column, which contains individual test scores.

    D code

    unpivot col: TestScore1,TestScore2,TestScore3

  3. Rename the generated column of test scores to TestScore.
  4. The numeric information in the key column values can be extracted using the following: 

    D code

    extract col: key on: `{digit}`

  5. The key2 column contains just the numeric data now. Rename this column to TestNumber. You can drop the key column now.
  6. The dataset does not contain a primary key, which field containing a unique identifier for each row. The combination of last name, first name, and test number is a unique identifier for each row in the dataset:

    D code

    merge col: LastName,FirstName,TestNumber with: '-'

  7. Rename the new column to TestID. Typically, primary keys are listed as the first field in a dataset. You might want to move the column before the LastName column. 
  8. You may have noticed that the data is still organized by name (first and last) and test number, so that an individual's tests are scattered throughout the dataset. To reorganize the information, you can re-aggregate the data using the following:

    D code

    aggregate value: SUM(TestScore) group: LastName,FirstName,TestNumber,TestID

    Tip

    Tip: The above retains all instances of tests that have been taken. If you are only interested in the average test score, you can remove the TestNumber and TestID groupings and the change the SUM function to AVERAGE. In the results, you have one average for each test taker.

  9.  You may want to rename the aggregation column. Your final dataset should look like the following:



D caption
typefigure
Single column of test scores
Now that your columns of data have been consolidated to a single column, you can use the single-column transforms and functions to perform analysis.

  • For more information on identifying outliers in this data, see Locate Outliers.

D s lang
 recipe:

The full 

D s lang
 recipe for this workflow is the following:

D code

splitrows col: column1 on: '\r'

D code

split col: column1 on: ',' limit: 5

D code

header

D code

unpivot col: TestScore1,TestScore2,TestScore3

D code

rename col: value to: 'TestScore' 

D code

extract col: key on: `{digit}`

D code

rename col: key1 to: 'TestNumber' 

D code

drop col: key

D code

merge col: LastName,FirstName,TestNumber with: '-'

D code

rename col: column1 to: 'TestID'

D code

move col: TestID before: LastName 

D code

aggregate value: SUM(TestScore) group: LastName,FirstName,TestNumber,TestID

D code

rename col: sum_TestScore to: 'TestScore'