NOTE: Transforms are a part of the underlying language that is not directly accessible to users. This content is maintained for reference purposes only.
This transform does not reference keys in the array. If your array data contains keys, use the
unnest transform. See Unnest Transform.
Output: Generates a separate row for each value in the array. Values of other columns in generated rows are copied from the source.
|flatten||Y||transform||Name of the transform|
|col||Y||string||Source column name|
For more information on syntax standards, see Language Documentation Syntax Notes.
Identifies the column to which to apply the transform. You can specify only one column.
|Yes||String (column name)|
Example - Flatten an array
In this example, the source data includes an array of scores that need to broken out into separate rows.
When the data is imported, you might have to re-type the
Scores column as an array:
You can now flatten the
Scores column data into separate rows:
This example is extended below.
Example - Flatten and unnest together
While the above example nicely flattens out your data, there are two potential problems with the results:
- There is no identifier for each test. For example, Allen Adams' score of 87 cannot be associated with the specific test on which he recorded the score.
- There is no unique identifier for each row.
The following example addresses both of these issues. It also demonstrates differences between the
unnest and the
flatten transform, including how you use
unnest to flatten array data based on specified keys.
- For more information, see Unnest Transform.
You have the following data on student test scores. Scores on individual scores are stored in the
Scores array, and you need to be able to track each test on a uniquely identifiable row. This example has two goals:
- One row for each student test
- Unique identifier for each student-score combination
When the data is imported from CSV format, you must add a
header transform and remove the quotes from the
Validate test date: To begin, you might want to check to see if you have the proper number of test scores for each student. You can use the following transform to calculate the difference between the expected number of elements in the
Scores array (4) and the actual number:
When the transform is previewed, you can see in the sample dataset that all tests are included. You might or might not want to include this column in the final dataset, as you might identify missing tests when the recipe is run at scale.
Unique row identifier: The
Scores array must be broken out into individual rows for each test. However, there is no unique identifier for the row to track individual tests. In theory, you could use the combination of
LastName-FirstName-Scores values to do so, but if a student recorded the same score twice, your dataset has duplicate rows. In the following transform, you create a parallel array called
Tests, which contains an index array for the number of values in the
Scores column. Index values start at
Also, we will want to create an identifier for the source row using the
One row for each student test: Your data should look like the following:
Now, you want to bring together the
Scores arrays into a single nested array using the
Your dataset has been changed:
flatten transform, you can unpack the nested array:
Each test-score combination is now broken out into a separate row. The nested Test-Score combinations must be broken out into separate columns using
After you delete
column1, which is no longer needed you should rename the two generated columns:
Unique row identifier: You can do one more step to create unique test identifiers, which identify the specific test for each student. The following uses the original row identifier
OrderIndex as an identifier for the student and the
TestNumber value to create the
TestId column value:
The above are integer values. To make your identifiers look prettier, you might add the following:
Extending: You might want to generate some summary statistical information on this dataset. For example, you might be interested in calculating each student's average test score. This step requires figuring out how to properly group the test values. In this case, you cannot group by the
LastName value, and when executed at scale, there might be collisions between first names when this recipe is run at scale. So, you might need to create a kind of primary key using the following:
You can now use this as a grouping parameter for your calculation:
After you delete unnecessary columns and move your columns around, the dataset should look like the following:
This page has no comments.