Excerpt |
---|
Extracts a set of values based on a specified pattern from a source column of any data type. The generated column contains an array of occurrences of the specified pattern. |
While the new column contains array data, the data type of the new column is sometimes inferred as String.
Your source column (myWidgetInventory
) is formatted in the following manner:
Code Block |
---|
{ "red":"100","white":"1300","blue":"315","purple":"55"} |
The following step extracts the raw inventory contents of each color:
D code |
---|
extractlist col: myWidgetInventory on:`{digit}+` |
Output: The generated column contains data that looks like the following array:
Code Block |
---|
["100","1300","315","55"] |
D code |
---|
extractlist: col:column_ref on:string_literal_pattern delimiter:string_literal_pattern [quote:'quoted_string'] [as:'new_column_name'] |
Token | Required? | Data Type | Description |
---|
extractlist | Y | transform | Name of the transform |
col | Y | string | Source column name |
on | Y | string | String literal or pattern that identifies the values to extract from the source column |
delimiter | Y | string | String literal or pattern that identifies the separator between the values to extract |
quote | N | string | Specifies a quoted object that is omitted from matching delimiters |
as | N | string | Name of the newly generated column |
Identifies the column to which to apply the transform. You can specify only one column.
Required? | Data Type |
---|
Yes | String (column name) |
Tip |
---|
Tip: You can insert the Unicode equivalent character for this parameter value using a regular expression of the form /\uHHHH/ . For example, /\u0013/ represents Unicode character 0013 (carriage return). For more information, see Supported Special Regular Expression Characters. |
For the extractlist
tranform, all instances that match this pattern in the source column are extracted into the array list in the new column. Each occurrence in the generated array corresponds to an individual instance in the source; the new column can contain duplicate values.
To create array elements based only on the delimiter
parameter, set the following regular expression:
Required? | Data Type |
---|
Yes | String (literal, regular expression, or ) |
Include Page |
---|
| delimiter Parameter |
---|
| delimiter Parameter |
---|
|
For this transform, this parameter defines the pattern that separates the values that you want to extract into the array.
Tip |
---|
Tip: You can insert the Unicode equivalent character for this parameter value using a regular expression of the form /\uHHHH/ . For example, /\u0013/ represents Unicode character 0013 (carriage return). For more information, see Supported Special Regular Expression Characters. |
Required? | Data Type |
---|
Yes | String (literal, regular expression, or ) |
D code |
---|
extractlist col: MySourceValues on:`{alpha}+` delimiter:';' quote:'\"' |
Output: Extracts from the MySourceValues column each instance of a string value that occurs before the delimiter. Values between double-quotes are considered string literals and are not processed according to the delimiters defined in the transform.
Include Page |
---|
| quote Parameter |
---|
| quote Parameter |
---|
|
Required? | Data Type |
---|
No | String |
Required? | Data Type |
---|
No | String (column name) |
Include Page |
---|
| EXAMPLE - Extract Values |
---|
| EXAMPLE - Extract Values |
---|
|
Source:
In this example, a list of URLs identifies the items in the shopping carts of visitors to your web site. You want to extract the shopping cart information embedded in the query parameters of the URL.
Username | cartURL |
---|
joe.robinson | http://example123.com/cart.asp?prodid=1001&qty=2 |
steph.schmidt | http://example123.com/cart.asp?prodid=1005&qty=4 |
jack.holmes | http://example123.com/cart.asp?prodid=2102&qty=1 |
tina.jones | http://example123.com/cart.asp?prodid=10412&qty=2 |
Transformation:
The following transformation extracts the list of query values from the URL. Note that the equals sign is included in the matching pattern so that you don't accidentally pick up numeric values from the non-parameter part of the URL:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | extractlist col:cartURL on:`={digit}+` |
---|
p01Name | Column |
---|
p01Value | cartURL |
---|
p02Name | Pattern matching elements in list |
---|
p02Value | `=[digit]+` |
---|
SearchTerm | Extract matches into Array |
---|
|
The two query parameter values have been extracted into an array of values, including the equals sign, which must be removed:
D trans |
---|
RawWrangle | true |
---|
p03Value | '' |
---|
Type | step |
---|
WrangleText | replace col:cartURL with: '' on: `=` global: true |
---|
p01Name | Column |
---|
p01Value | cartURL |
---|
p02Name | Find |
---|
p02Value | '=' |
---|
p03Name | Replace with |
---|
p04Value | true |
---|
p04Name | Match all occurrences |
---|
SearchTerm | Replace text or pattern |
---|
|
You can now unnest these values into separate columns:
D trans |
---|
RawWrangle | true |
---|
Type | step |
---|
WrangleText | unnest col:extractlist_cartURL keys:'[0]','[1]' |
---|
p01Name | Column |
---|
p01Value | extractlist_cartURL |
---|
p02Name | Paths to elements |
---|
p02Value | '[0]','[1]' |
---|
SearchTerm | Unnest Objects into columns |
---|
|
After you rename the two columns to prodId
and Qty
, you can delete the column generated by the first transformation.
Results:
Username | cartURL | prodId | Qty |
---|
joe.robinson | http://example123.com/cart.asp?prodid=1001&qty=2 | 1001 | 2 |
steph.schmidt | http://example123.com/cart.asp?prodid=1005&qty=4 | 1005 | 4 |
jack.holmes | http://example123.com/cart.asp?prodid=2102&qty=1 | 2102 | 1 |
tina.jones | http://example123.com/cart.asp?prodid=10412&qty=2 | 10412 | 2 |
Include Page |
---|
| EXAMPLE - Extractlist Transform |
---|
| EXAMPLE - Extractlist Transform |
---|
|
D s also |
---|
label | wrangle_transform_extractlist |
---|
|