Source:
You have the following data on student test scores. Scores on individual scores are stored in the Scores
array, and you need to be able to track each test on a uniquely identifiable row. This example has two goals:
- One row for each student test
- Unique identifier for each student-score combination
LastName | FirstName | Scores |
---|
Adams | Allen | [81,87,83,79] |
Burns | Bonnie | [98,94,92,85] |
Cannon | Charles | [88,81,85,78] |
Transformation:
When the data is imported from CSV format, you must add a header
transform and remove the quotes from the Scores
column:
D trans |
---|
RawWrangle | true |
---|
p03Value | 1 |
---|
Type | step |
---|
WrangleText | rename type: header method: index sourcerownumber: 1 |
---|
p01Name | Option |
---|
p01Value | Use row(s) as column names |
---|
p02Name | Type |
---|
p02Value | Use a single row to name columns |
---|
p03Name | Row number |
---|
SearchTerm | Rename column with row(s) |
---|
|
D trans |
---|
RawWrangle | true |
---|
p03Value | '' |
---|
Type | step |
---|
WrangleText | replace col:Scores with:'' on:`"` global:true |
---|
p01Name | Column |
---|
p01Value | colScores |
---|
p02Name | Find |
---|
p02Value | '\"' |
---|
p03Name | Replace with |
---|
p04Value | true |
---|
p04Name | Match all occurrences |
---|
SearchTerm | Replace text or pattern |
---|
|
Validate test date: To begin, you might want to check to see if you have the proper number of test scores for each student. You can use the following transform to calculate the difference between the expected number of elements in the Scores
array (4) and the actual number:
D trans |
---|
RawWrangle | true |
---|
p03Value | 'numMissingTests' |
---|
Type | step |
---|
WrangleText | derive type:single value: (4 - arraylen(Scores)) as: 'numMissingTests' |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | (4 - arraylen(Scores)) |
---|
p03Name | New column name |
---|
SearchTerm | New formula |
---|
|
When the transform is previewed, you can see in the sample dataset that all tests are included. You might or might not want to include this column in the final dataset, as you might identify missing tests when the recipe is run at scale.
Unique row identifier: The Scores
array must be broken out into individual rows for each test. However, there is no unique identifier for the row to track individual tests. In theory, you could use the combination of LastName-FirstName-Scores
values to do so, but if a student recorded the same score twice, your dataset has duplicate rows. In the following transform, you create a parallel array called Tests
, which contains an index array for the number of values in the Scores
column. Index values start at 0
:
D trans |
---|
RawWrangle | true |
---|
p03Value | 'Tests' |
---|
Type | step |
---|
WrangleText | derive type:single value:range(0,arraylen(Scores)) as:'Tests' |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | range(0,arraylen(Scores)) |
---|
p03Name | New column name |
---|
SearchTerm | New formula |
---|
|
Also, we will want to create an identifier for the source row using the sourcerownumber
function:
D trans |
---|
RawWrangle | true |
---|
p03Value | 'orderIndex' |
---|
Type | step |
---|
WrangleText | derive type:single value:sourcerownumber() as:'orderIndex' |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | sourcerownumber() |
---|
p03Name | New column name |
---|
SearchTerm | New formula |
---|
|
One row for each student test: Your data should look like the following:
LastName | FirstName | Scores | Tests | orderIndex |
---|
Adams | Allen | [81,87,83,79] | [0,1,2,3] | 2 |
Burns | Bonnie | [98,94,92,85] | [0,1,2,3] | 3 |
Cannon | Charles | [88,81,85,78] | [0,1,2,3] | 4 |
Now, you want to bring together the Tests
and Scores
arrays into a single nested array using the arrayzip
function:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | derive type:single value:arrayzip([Tests,Scores]) |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | arrayzip([Tests,Scores]) |
---|
SearchTerm | New formula |
---|
|
Your dataset has been changed:
LastName | FirstName | Scores | Tests | orderIndex | column1 |
---|
Adams | Allen | [81,87,83,79] | [0,1,2,3] | 2 | [[0,81],[1,87],[2,83],[3,79]] |
Adams | Bonnie | [98,94,92,85] | [0,1,2,3] | 3 | [[0,98],[1,94],[2,92],[3,85]] |
Cannon | Charles | [88,81,85,78] | [0,1,2,3] | 4 | [[0,88],[1,81],[2,85],[3,78]] |
Use the following to unpack the nested array:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | flatten col: column1 |
---|
p01Name | Column |
---|
p01Value | column1 |
---|
SearchTerm | Expand arrays to rows |
---|
|
Each test-score combination is now broken out into a separate row. The nested Test-Score combinations must be broken out into separate columns using the following:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | unnest col:column1 keys:'[0]','[1]' |
---|
p01Name | Column |
---|
p01Value | column1 |
---|
p02Name | Paths to elements |
---|
p02Value | '[0]','[1]' |
---|
SearchTerm | Unnest Objects into columns |
---|
|
After you delete column1
, which is no longer needed you should rename the two generated columns:
D trans |
---|
RawWrangle | true |
---|
p03Value | 'TestNum' |
---|
Type | step |
---|
WrangleText | rename mapping:[column_0,'TestNum'] |
---|
p01Name | Option |
---|
p01Value | Manual rename |
---|
p02Name | Column |
---|
p02Value | column_0 |
---|
p03Name | New column name |
---|
SearchTerm | Rename columns |
---|
|
D trans |
---|
RawWrangle | true |
---|
p03Value | 'TestScore' |
---|
Type | step |
---|
WrangleText | rename mapping:[column_1,'TestScore'] |
---|
p01Name | Option |
---|
p01Value | Manual rename |
---|
p02Name | Column |
---|
p02Value | column_1 |
---|
p03Name | New column name |
---|
SearchTerm | Rename columns |
---|
|
Unique row identifier: You can do one more step to create unique test identifiers, which identify the specific test for each student. The following uses the original row identifier OrderIndex
as an identifier for the student and the TestNumber
value to create the TestId
column value:
D trans |
---|
RawWrangle | true |
---|
p03Value | 'TestId' |
---|
Type | step |
---|
WrangleText | derive type:single value: (orderIndex * 10) + TestNum as: 'TestId' |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | (orderIndex * 10) + TestNum |
---|
p03Name | New column name |
---|
SearchTerm | New formula |
---|
|
The above are integer values. To make your identifiers look prettier, you might add the following:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | merge col:'TestId00','TestId' |
---|
p01Name | Columns |
---|
p01Value | 'TestId00','TestId' |
---|
SearchTerm | Merge columns |
---|
|
Extending: You might want to generate some summary statistical information on this dataset. For example, you might be interested in calculating each student's average test score. This step requires figuring out how to properly group the test values. In this case, you cannot group by the LastName
value, and when executed at scale, there might be collisions between first names when this recipe is run at scale. So, you might need to create a kind of primary key using the following:
D trans |
---|
RawWrangle | true |
---|
p03Value | 'studentId' |
---|
Type | step |
---|
WrangleText | merge col:'LastName','FirstName' with:'-' as:'studentId' |
---|
p01Name | Columns |
---|
p01Value | 'LastName','FirstName' |
---|
p02Name | Separator |
---|
p02Value | '-' |
---|
p03Name | New column name |
---|
SearchTerm | Merge columns |
---|
|
You can now use this as a grouping parameter for your calculation:
D trans |
---|
RawWrangle | true |
---|
p03Value | studentId |
---|
Type | step |
---|
WrangleText | derive type:single value:average(TestScore) group:studentId as:'avg_TestScore' |
---|
p01Name | Formula type |
---|
p01Value | Single row formula |
---|
p02Name | Formula |
---|
p02Value | average(TestScore) |
---|
p03Name | Group rows by |
---|
p04Value | 'avg_TestScore' |
---|
p04Name | New column name |
---|
SearchTerm | New formula |
---|
|
Results:
After you delete unnecessary columns and move your columns around, the dataset should look like the following:
TestId | LastName | FirstName | TestNum | TestScore | studentId | avg_TestScore |
---|
TestId0021 | Adams | Allen | 0 | 81 | Adams-Allen | 82.5 |
TestId0022 | Adams | Allen | 1 | 87 | Adams-Allen | 82.5 |
TestId0023 | Adams | Allen | 2 | 83 | Adams-Allen | 82.5 |
TestId0024 | Adams | Allen | 3 | 79 | Adams-Allen | 82.5 |
TestId0031 | Adams | Bonnie | 0 | 98 | Adams-Bonnie | 92.25 |
TestId0032 | Adams | Bonnie | 1 | 94 | Adams-Bonnie | 92.25 |
TestId0033 | Adams | Bonnie | 2 | 92 | Adams-Bonnie | 92.25 |
TestId0034 | Adams | Bonnie | 3 | 85 | Adams-Bonnie | 92.25 |
TestId0041 | Cannon | Chris | 0 | 88 | Cannon-Chris | 83 |
TestId0042 | Cannon | Chris | 1 | 81 | Cannon-Chris | 83 |
TestId0043 | Cannon | Chris | 2 | 85 | Cannon-Chris | 83 |
TestId0044 | Cannon | Chris | 3 | 78 | Cannon-Chris | 83 |