Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r088

Source:

You have the following data on student test scores. Scores on individual scores are stored in the Scores array, and you need to be able to track each test on a uniquely identifiable row. This example has two goals:

  1. One row for each student test
  2. Unique identifier for each student-score combination

 

LastNameFirstNameScores
AdamsAllen[81,87,83,79]
BurnsBonnie[98,94,92,85]
CannonCharles[88,81,85,78]

Transformation:

When the data is imported from CSV format, you must add a header transform and remove the quotes from the Scores column:

D trans
RawWrangletrue
p03Value1
Typestep
WrangleTextrename type: header method: index sourcerownumber: 1
p01NameOption
p01ValueUse row(s) as column names
p02NameType
p02ValueUse a single row to name columns
p03NameRow number
SearchTermRename column with row(s)

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplace col:Scores with:'' on:`"` global:true
p01NameColumn
p01ValuecolScores
p02NameFind
p02Value'\"'
p03NameReplace with
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or pattern

Validate test date: To begin, you might want to check to see if you have the proper number of test scores for each student. You can use the following transform to calculate the difference between the expected number of elements in the Scores array (4) and the actual number:

D trans
RawWrangletrue
p03Value'numMissingTests'
Typestep
WrangleTextderive type:single value: (4 - arraylen(Scores)) as: 'numMissingTests'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value(4 - arraylen(Scores))
p03NameNew column name
SearchTermNew formula

When the transform is previewed, you can see in the sample dataset that all tests are included. You might or might not want to include this column in the final dataset, as you might identify missing tests when the recipe is run at scale.

Unique row identifier: The Scores array must be broken out into individual rows for each test. However, there is no unique identifier for the row to track individual tests. In theory, you could use the combination of LastName-FirstName-Scores values to do so, but if a student recorded the same score twice, your dataset has duplicate rows. In the following transform, you create a parallel array called Tests, which contains an index array for the number of values in the Scores column. Index values start at 0:

D trans
RawWrangletrue
p03Value'Tests'
Typestep
WrangleTextderive type:single value:range(0,arraylen(Scores)) as:'Tests'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Valuerange(0,arraylen(Scores))
p03NameNew column name
SearchTermNew formula

Also, we will want to create an identifier for the source row using the sourcerownumber function:

D trans
RawWrangletrue
p03Value'orderIndex'
Typestep
WrangleTextderive type:single value:sourcerownumber() as:'orderIndex'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Valuesourcerownumber()
p03NameNew column name
SearchTermNew formula

One row for each student test: Your data should look like the following:

LastNameFirstNameScoresTestsorderIndex
AdamsAllen[81,87,83,79][0,1,2,3]2
BurnsBonnie[98,94,92,85][0,1,2,3]3
CannonCharles[88,81,85,78][0,1,2,3]4

Now, you want to bring together the Tests and Scores arrays into a single nested array using the arrayzip function:

D trans
RawWrangletrue
Typestep
WrangleTextderive type:single value:arrayzip([Tests,Scores])
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Valuearrayzip([Tests,Scores])
SearchTermNew formula

Your dataset has been changed:

LastNameFirstNameScoresTestsorderIndexcolumn1
AdamsAllen[81,87,83,79][0,1,2,3]2[[0,81],[1,87],[2,83],[3,79]]
AdamsBonnie[98,94,92,85][0,1,2,3]3[[0,98],[1,94],[2,92],[3,85]]
CannonCharles[88,81,85,78][0,1,2,3]4[[0,88],[1,81],[2,85],[3,78]]

Use the following to unpack the nested array:

D trans
RawWrangletrue
Typestep
WrangleTextflatten col: column1
p01NameColumn
p01Valuecolumn1
SearchTermExpand arrays to rows

Each test-score combination is now broken out into a separate row. The nested Test-Score combinations must be broken out into separate columns using the following:

D trans
RawWrangletrue
Typestep
WrangleTextunnest col:column1 keys:'[0]','[1]'
p01NameColumn
p01Valuecolumn1
p02NamePaths to elements
p02Value'[0]','[1]'
SearchTermUnnest Objects into columns

After you delete column1, which is no longer needed you should rename the two generated columns:

D trans
RawWrangletrue
p03Value'TestNum'
Typestep
WrangleTextrename mapping:[column_0,'TestNum']
p01NameOption
p01ValueManual rename
p02NameColumn
p02Valuecolumn_0
p03NameNew column name
SearchTermRename columns

D trans
RawWrangletrue
p03Value'TestScore'
Typestep
WrangleTextrename mapping:[column_1,'TestScore']
p01NameOption
p01ValueManual rename
p02NameColumn
p02Valuecolumn_1
p03NameNew column name
SearchTermRename columns

Unique row identifier: You can do one more step to create unique test identifiers, which identify the specific test for each student. The following uses the original row identifier OrderIndex as an identifier for the student and the TestNumber value to create the TestId column value:

D trans
RawWrangletrue
p03Value'TestId'
Typestep
WrangleTextderive type:single value: (orderIndex * 10) + TestNum as: 'TestId'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value(orderIndex * 10) + TestNum
p03NameNew column name
SearchTermNew formula

The above are integer values. To make your identifiers look prettier, you might add the following:

D trans
RawWrangletrue
Typestep
WrangleTextmerge col:'TestId00','TestId'
p01NameColumns
p01Value'TestId00','TestId'
SearchTermMerge columns

Extending: You might want to generate some summary statistical information on this dataset. For example, you might be interested in calculating each student's average test score. This step requires figuring out how to properly group the test values. In this case, you cannot group by the LastName value, and when executed at scale, there might be collisions between first names when this recipe is run at scale. So, you might need to create a kind of primary key using the following:

D trans
RawWrangletrue
p03Value'studentId'
Typestep
WrangleTextmerge col:'LastName','FirstName' with:'-' as:'studentId'
p01NameColumns
p01Value'LastName','FirstName'
p02NameSeparator
p02Value'-'
p03NameNew column name
SearchTermMerge columns

You can now use this as a grouping parameter for your calculation:

D trans
RawWrangletrue
p03ValuestudentId
Typestep
WrangleTextderive type:single value:average(TestScore) group:studentId as:'avg_TestScore'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Valueaverage(TestScore)
p03NameGroup rows by
p04Value'avg_TestScore'
p04NameNew column name
SearchTermNew formula

Results:

After you delete unnecessary columns and move your columns around, the dataset should look like the following:

TestIdLastNameFirstNameTestNumTestScorestudentIdavg_TestScore
TestId0021AdamsAllen081Adams-Allen82.5
TestId0022AdamsAllen187Adams-Allen82.5
TestId0023AdamsAllen283Adams-Allen82.5
TestId0024AdamsAllen379Adams-Allen82.5
TestId0031AdamsBonnie098Adams-Bonnie92.25
TestId0032AdamsBonnie194Adams-Bonnie92.25
TestId0033AdamsBonnie292Adams-Bonnie92.25
TestId0034AdamsBonnie385Adams-Bonnie92.25
TestId0041CannonChris088Cannon-Chris83
TestId0042CannonChris181Cannon-Chris83
TestId0043CannonChris285Cannon-Chris83
TestId0044CannonChris378Cannon-Chris83