Page tree

 

Support | BlogContact Us | 844.332.2821

 

Contents:

The cloud-based version of Trifacta Wrangler is now available! Read all about it, and register for your free account.

You can use a variety of mathematical and statistical functions to calculate metrics within a column.

To calculate metrics across columns, you can use a generalized version of the following example.

Source:

Your dataset tracks swimmer performance across multiple heats in a race, and you would like to calculate best, worst, and average times in seconds across all three heats. Here's the data:

RacerHeat1Heat2Heat3
Racer X37.2238.2237.61
Racer Y41.33DQ38.04
Racer Z39.2739.0438.85


In the above data, Racer Y was disqualified (DQ) in Heat 2.

Transform:

To compute the metrics, you must bundle the data into an array, break out the array into separate rows, and then calculate your metrics by grouping. Here are the steps:

  1. When the data is imported, you may need to create a header for each row:

    Transformation Name Rename columns with a row
    Parameter: Option Use row as header
    Parameter: Row 1

  2. The columns containing heat time data may need to be retyped. From the drop-down next to each column name, select Decimal type.
  3. The DQ value in the Heat2 column is invalid data for Decimal type. You can use the following transform to turn it into a missing value. For purposes of calculating averages, you may or may not want to turn invalid data into zeroes or blanks. In this case, replacing the data as 0.00 causes improper calculations for the metrics.

    Transformation Name Replace text or patterns
    Parameter: Column Heat2
    Parameter: Find 'DQ'
    Parameter: Replace with ''

  4. Use the following to gather all of the heat data into two columns:

    Transformation Name Unpivot columns
    Parameter: Columns Heat1,Heat2,Heat3
    Parameter: Group size 1

  5. You can now rename the two columns. Rename key to HeatNum and value to HeatTime.

  6. You may want to delete the rows that have a missing value for HeatTime:

    Transformation Name Delete rows
    Parameter: Condition ISMISSING([value])

  7. You can now perform calculations on this column. The following transforms calculate minimum, average (mean), and maximum times for each racer:

    Transformation Name New formula
    Parameter: Formula type Multiple row formula
    Parameter: Formula MIN(HeatTime)
    Parameter: Group rows by Racer
    Parameter: New column name 'BestTime'


    Transformation Name New formula
    Parameter: Formula type Multiple row formula
    Parameter: Formula AVERAGE(HeatTime)
    Parameter: Group rows by Racer
    Parameter: New column name 'AvgTime'

    Transformation Name New formula
    Parameter: Formula type Multiple row formula
    Parameter: Formula MAX(HeatTime)
    Parameter: Group rows by Racer
    Parameter: New column name 'WorstTime'

     

  8. To make the data look better, you might want to reformat the values in the AvgTime column to two decimal points:

    Transformation Name Edit column with formula
    Parameter: Columns AvgTime
    Parameter: Formula NUMFORMAT(AvgTime, '##.00')

Results:

After you use the move transform to re-organize your columns, the dataset should look like the following:

RacerHeatNumHeatTimeBestTimeWorstTimeAvgTime
Racer XHeat137.2237.2238.2237.68
Racer XHeat238.2237.2238.2237.68
Racer XHeat337.6137.2238.2237.68
Racer YHeat141.3338.0441.3339.69
Racer YHeat338.0438.0441.3339.69
Racer ZHeat139.2738.8539.2739.05
Racer ZHeat239.0438.8539.2739.05
Racer ZHeat338.8538.8539.2739.05

Your Rating: Results: PatheticBadOKGoodOutstanding! 2 rates

This page has no comments.