Contents:
The unnest
transform must include keys that you specify as part of the transform step. To unnest a column of array data that contains no keys, use the flatten
transform. See Flatten Transform.
This transform might be automatically applied as one of the first steps of your recipe. See Initial Parsing Steps.
Basic Usage
unnest col: myObj keys:'sourceA','sourceB' pluck:true markLineage:true
Output:
- Extracts from the
myObj
column the corresponding values for the keyssourceA
andsourceB
into two new columns. - Since
markLineage
istrue
, these new column names are prepended with the source name:sourceA_column1
andsourceB_column2
. - Any non-missing values from the source columns are added to the corresponding new columns and are removed from the source column, since
pluck
istrue
.
Syntax and Parameters
unnest col:column_ref keys:'key1','key2' [pluck:true|false] [markLineage:true|false]
Token | Required? | Data Type | Description |
---|---|---|---|
unnest | Y | transform | Name of the transform |
col | Y | string | Source column name |
keys | Y | string | Comma-separated list of quoted key names. See below for examples. |
pluck | N | boolean | If true , any values unnested from the source are also removed from the source. Default is false . |
markLineage | N | boolean | If true , the names of new columns are prepended with the name of the source column. |
For more information on syntax standards, see Language Documentation Syntax Notes.
col
Identifies the column to which to apply the transform. You can specify only one column.
Usage Notes:
Required? | Data Type |
---|---|
Yes | String (column name) |
keys
Comma-separated list of keys to use to extract data from the specified source column.
- Key values must be quoted. (e.g
'key1','key2'
). Any quoted value is considered the path to a single key. - Key values are case-sensitive.
- Each key must be listed. A range of keys cannot be specified.
NOTE: Keys that contain non-alphanumeric values, such as spaces, must be enclosed in square brackets and quotes. Values with underscores do not require this bracketing.
The comma-separated list of keys determines the columns to generate from the source data. If you specify three values for keys, the three new columns contain the corresponding values from the source column.
This parameter has different syntax to use for single-level and multi-level nested data. There are also variations in syntax between Object and Array data type.
Usage Notes:
Required? | Data Type |
---|---|
Yes | Comma-separated String values. Syntax examples are provided below. |
Keys for Object data - single-level
NOTE: Key names are case-sensitive.
For a single, top-level key in an Object field, you can specify the key as a simple quoted string:
unnest col:myCol keys: 'myObjKey'
The above looks for the key myObjKey
among the top-level keys in the Object and returns the corresponding value for the new column. You can also bracket this key in square brackets:
unnest col:myCol keys: '[myObjKey]'
To specify multiple first-level keys, use the following:
unnest col:myCol keys:'myObjKey','my2ndObjKey'
The above generates two new columns ( myObjKey
and my2ndObjKey
) containing the corresponding values for the keys.
Keys for Object data - multi-level
You can also reference keys that are below the first level in the Object.
Example data:
{ "Key1" : { "Key1A" : { "Key1A1" : "Value1" } } } { "Key2" : { "Key2A" : { "Key2A1" : "Value2" } } } { "Key3" : { "Key3A" : { "Key3A1" : "Value3" } } }
To acquire the data for the Key1A
key, use the following:
unnest col: myCol keys: 'Key1[Key1A]'
In the new column, the displayed value is the following:
{ "Key1A1" : "Value1" }
To unnest a third-layer value, use a transform similar to the following:
unnest col: myCol keys: 'Key2[Key2A][Key2A1]'
In the new column, this transform generates a value of Value2
.
Keys for Array data - single level
You can reference array elements using zero-based indexes or key names.
NOTE: All references to Array keys must be bracketed. Array keys can be referenced by index number only.
Example array data:
["red","orange","yellow","green","blue","indigo","violet"]
unnest col: myCol keys:'[1]'
The above transform retrieves the value orange
from the array.
unnest col: myCol keys:'[1]','[3]'
Returned values: orange
and green
.
Keys for Array data - multi-level
The following example nested Array data matches the structure of the Object data in the previous example:
[ [ "Item1", ["Item1A", ["Item1A1","Value1"] ] ], [ "Item2", ["Item2A", ["Item2A1","Value2"] ] ], [ "Item3", ["Item3A",["Item3A1","Value3"] ] ] ]
To unnest the value for Items2A
:
unnest col:myCol keys:'[1][0]'
The value inserted into the new column is the following:
["Item2A1","Value2"]
To unnest from the third level:
unnest col:myCol keys:'[2][0][0]'
The inserted value is Item3A
.
pluck
- Set to
true
to remove values from source after they have been added to output columns. - (Default) Set to
false
to leave source columns untouched.
Usage Notes:
Required? | Data Type |
---|---|
No | Boolean |
markLineage
true
, the names of new columns are prepended with the name of the source column. Example:Source Column | Output Column |
---|---|
mySourceColumn | mySourceColumn_column1 |
Nested key references are appended to the column name:
Source Column | Key Value | Output Column |
---|---|---|
mySourceColumn | keys: '[Key1][Key2]' | mySourceColumn_Key1_Key2 |
NOTE: If your unnest
transform does not change the number of rows, you can still access source row number information in the data grid, assuming it was still available when the transform was executed.
Usage Notes:
Required? | Data Type |
---|---|
No | Boolean |
Tip: For additional examples, see Common Tasks.
Examples
Example - Unnest an Object
You have the following dataset. The Sizes
column contains Object data on available sizes.
Source:
ProdId | ProdName | Sizes |
---|---|---|
1001 | Hat | {'Small':'N','Medium':'Y','Large':'Y','Extra-Large':'Y'} |
1002 | Shirt | {'Small':'N','Medium':'Y','Large':'Y','Extra-Large':'N'} |
1003 | Pants | {'Small':'Y','Medium':'Y','Large':'Y','Extra-Large':'N'} |
Transform:
NOTE: Depending on the format of your source data, you might need to perform some replacements in the Sizes
column in order to make it inferred as proper Object type values. The final format should look like the above.
If it is not inferred already, set the type of the Sizes
column to Object:
settype col: Sizes type: 'Object'
Unnest the data into separate columns. The following prepends Sizes_
to the newly generated column name.
unnest col:Sizes keys:'Small','Medium','Large','Extra-Large' markLineage:true
You might find it useful to add pluck:true
to the above transform. When added, values that are un-nested are removed from the source, leaving only the values that weren't processed:
unnest col:Sizes keys:'Small','Medium','Large','Extra-Large' markLineage:true pluck:true
If all values have been processed, the Sizes
column now contains a set of maps missing data. You can use the following to determine if the length of the remaining data is longer than two characters. This transform is a good one to just preview:
derive type:single value:(LEN(Sizes) > 2) as:'len_Sizes'
If you sort the values in the generated column, you can review the true
values to see if you need to modify your preceding unnest
transform.You can drop the source column:
drop col:Sizes
Results:
When you are finished, the dataset should look like the following:
ProdId | ProdName | Sizes_Small | Sizes_Medium | Sizes_Large | Sizes_Extra-Large |
---|---|---|---|---|---|
1001 | Hat | N | Y | Y | Y |
1002 | Shirt | N | Y | Y | N |
1003 | Pants | Y | Y | Y | N |
Example - Unnest an array
The following example demonstrates differences between the unnest
and the flatten
transform, including how you use unnest
to flatten array data based on specified keys.
- For more information, see Flatten Transform.
You have the following data on student test scores. Scores on individual scores are stored in the Scores
array, and you need to be able to track each test on a uniquely identifiable row. This example has two goals:
- One row for each student test
- Unique identifier for each student-score combination
LastName | FirstName | Scores |
---|---|---|
Adams | Allen | [81,87,83,79] |
Burns | Bonnie | [98,94,92,85] |
Cannon | Charles | [88,81,85,78] |
Transform:
When the data is imported from CSV format, you must add a header
transform and remove the quotes from the Scores
column:
header
replace col:Scores with:'' on:`"` global:true
Scores
array (4) and the actual number:
derive type:single value: (4 - ARRAYLEN(Scores)) as: 'numMissingTests'
Unique row identifier: The Scores
array must be broken out into individual rows for each test. However, there is no unique identifier for the row to track individual tests. In theory, you could use the combination of LastName-FirstName-Scores
values to do so, but if a student recorded the same score twice, your dataset has duplicate rows. In the following transform, you create a parallel array called Tests
, which contains an index array for the number of values in the Scores
column. Index values start at 0
:
derive type:single value:RANGE(0,ARRAYLEN(Scores)) as:'Tests'
SOURCEROWNUMBER
function:
derive type:single value:SOURCEROWNUMBER() as:'orderIndex'
LastName | FirstName | Scores | Tests | orderIndex |
---|---|---|---|---|
Adams | Allen | [81,87,83,79] | [0,1,2,3] | 2 |
Burns | Bonnie | [98,94,92,85] | [0,1,2,3] | 3 |
Cannon | Charles | [88,81,85,78] | [0,1,2,3] | 4 |
Now, you want to bring together the Tests
and Scores
arrays into a single nested array using the ARRAYZIP
function:
derive type:single value:ARRAYZIP([Tests,Scores])
LastName | FirstName | Scores | Tests | orderIndex | column1 |
---|---|---|---|---|---|
Adams | Allen | [81,87,83,79] | [0,1,2,3] | 2 | [[0,81],[1,87],[2,83],[3,79]] |
Adams | Bonnie | [98,94,92,85] | [0,1,2,3] | 3 | [[0,98],[1,94],[2,92],[3,85]] |
Cannon | Charles | [88,81,85,78] | [0,1,2,3] | 4 | [[0,88],[1,81],[2,85],[3,78]] |
With the flatten
transform, you can unpack the nested array:
flatten col: column1
unnest
:
unnest col:column1 keys:'[0]','[1]'
column1
, which is no longer needed you should rename the two generated columns:
rename mapping:[column_0,'TestNum']
rename mapping:[column_1,'TestScore']
OrderIndex
as an identifier for the student and the TestNumber
value to create the TestId
column value:
derive type:single value: (orderIndex * 10) + TestNum as: 'TestId'
merge col:'TestId00','TestId'
Extending: You might want to generate some summary statistical information on this dataset. For example, you might be interested in calculating each student's average test score. This step requires figuring out how to properly group the test values. In this case, you cannot group by the
LastName
value, and when executed at scale, there might be collisions between first names when this recipe is run at scale. So, you might need to create a kind of primary key using the following:
merge col:'LastName','FirstName' with:'-' as:'studentId'
derive type:single value:AVERAGE(TestScore) group:studentId as:'avg_TestScore'
Results:
After you drop unnecessary columns and move your columns around, the dataset should look like the following:
TestId | LastName | FirstName | TestNum | TestScore | studentId | avg_TestScore |
---|---|---|---|---|---|---|
TestId0021 | Adams | Allen | 0 | 81 | Adams-Allen | 82.5 |
TestId0022 | Adams | Allen | 1 | 87 | Adams-Allen | 82.5 |
TestId0023 | Adams | Allen | 2 | 83 | Adams-Allen | 82.5 |
TestId0024 | Adams | Allen | 3 | 79 | Adams-Allen | 82.5 |
TestId0031 | Adams | Bonnie | 0 | 98 | Adams-Bonnie | 92.25 |
TestId0032 | Adams | Bonnie | 1 | 94 | Adams-Bonnie | 92.25 |
TestId0033 | Adams | Bonnie | 2 | 92 | Adams-Bonnie | 92.25 |
TestId0034 | Adams | Bonnie | 3 | 85 | Adams-Bonnie | 92.25 |
TestId0041 | Cannon | Chris | 0 | 88 | Cannon-Chris | 83 |
TestId0042 | Cannon | Chris | 1 | 81 | Cannon-Chris | 83 |
TestId0043 | Cannon | Chris | 2 | 85 | Cannon-Chris | 83 |
TestId0044 | Cannon | Chris | 3 | 78 | Cannon-Chris | 83 |
Example - extracting key values from car data and then unnesting into separate columns
- extractkv - Removes key-value pairs from a source string. See Extract Transform.
unnest
- Unpacks nested data in separate rows and columns. See Unnest Transform.
Source:
You have the following information on used cars. The VIN
column contains vehicle identifiers, and the Properties
column contains key-value pairs describing characteristics of each vehicle. You want to unpack this data into separate columns.
VIN | Properties |
---|---|
XX3 JT4522 | year=2004,make=Subaru,model=Impreza,color=green,mileage=125422,cost=3199 |
HT4 UJ9122 | year=2006,make=VW,model=Passat,color=silver,mileage=102941,cost=4599 |
KC2 WZ9231 | year=2009,make=GMC,model=Yukon,color=black,mileage=68213,cost=12899 |
LL8 UH4921 | year=2011,make=BMW,model=328i,color=brown,mileage=57212,cost=16999 |
Transform:
Add the following transform, which identifies all of the key values in the column as beginning with alphabetical characters.
- The
valueafter
string identifies where the corresponding value begins after the key. - The
delimiter
string indicates the end of each key-value pair.
extractkv col:Properties key:`{alpha}+` valueafter:`=` delimiter:`,`
unnest
transform to unpack this mapped data. In the following, each key is specified, which results in separate columns headed by the named key:
unnest col:extractkv_Properties keys:'year','make','model','color','mileage','cost'
When you drop the unnecessary Properties columns, the dataset now looks like the following:
VIN | year | make | model | color | mileage | cost |
---|---|---|---|---|---|---|
XX3 JT4522 | 2004 | Subaru | Impreza | green | 125422 | 3199 |
HT4 UJ9122 | 2006 | VW | Passat | silver | 102941 | 4599 |
KC2 WZ9231 | 2009 | GMC | Yukon | black | 68213 | 12899 |
LL8 UH4921 | 2011 | BMW | 328i | brown | 57212 | 16999 |
This page has no comments.