D toc |
---|
Excerpt |
---|
Extracting one or more values from within a column of values can turn data into meaningful and discrete information. This section describes how to extract column data, the methods for which may vary depending on the data type. |
...
Extract vs. Split
Extract and split transformations do not do the same thing:
...
Method | Description |
---|---|
By selection | Select part of a value in the data grid to prompt a series of suggestions on what to do with the data. Typically, extract options are near the top of the suggestions when you select part of a value. |
By column menu | From the menu to the right of the column, select Extract and a sub-menu item to begin configuring a transformation. See Column Menus. |
By Transformer toolbar | At the top of the data grid, click the Extract icon in the Transformer toolbar to begin configuring extract transformations. See Transformer Toolbar. |
By Search panel | In the Search panel, enter extract to build a transformation from scratch. See Search Panel. |
Extract text or patterns
A primary use of extraction is to remove literal or patterned values of text from a column of values. Suppose your dataset included a column of LinkedIn updates. You can use one of the following methods to extract keywords from these values.
...
The following functions can be used to extract values from a Datetime column, as long as the values are present in the formatted date:
Function Name | Description | ||||||
---|---|---|---|---|---|---|---|
DAY Function |
| ||||||
MONTH Function |
...
|
...
|
...
You can also reformat the whole Datetime column using the DATEFORMAT
function. The following reformats the column to show only the two-digit year:
D trans | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Extract numeric values
You can extract numerical data from text values. In the following example, the first number is extracted from the address
column, which would correspond to extracting the street number for the address:
...
|
...
YEAR Function |
| ||||||
HOUR Function |
| ||||||
MINUTE Function |
| ||||||
SECOND Function |
|
You can also reformat the whole Datetime column using the DATEFORMAT
function. The following reformats the column to show only the two-digit year:
D trans | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Empty values in this new column might indicate a formatting problem with the address.
Tip |
---|
Tip: If you set the number of patterns to extract to |
Extract components of a URL
URL components
Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL
column:
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.
The following functions can be used to extract values from a set of URLs:
Query parameters
You can extract query parameter values from an URL. The following example extracts the store_id
value from the storeURL
field value:
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Extract object values
If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.
Suppose your restaurant dataset includes a set of characteristics in the restFeatures
column in the following JSON format:
Code Block |
---|
{
"Credit": "Y",
"Accessible": "Y",
"Restrooms": "Y",
"EatIn": "Y",
"ToGo": "N",
"AlcoholBeer": "Y",
"AlcoholHard": "N",
"TotalTables": "10",
"TotalTableSeats": "36",
"Counter": "Y",
"CounterSeats": "8"
} |
You can use the following transformation to extract the values from TotalTableSeats
and CounterSeats
into separate columns:
D trans | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
After the above is executed, you can perform a simple sum of the TotalTableSeats
and CounterSeats
columns to determine the total number of seats in the restaurant.
Extract array values
In some cases, your data may contain arrays of repeated key-value pairs, where each pair would exist on a separate line. Suppose you have a column called, Events
, which contains date and time information about the musician described in the same row of data. The Events
column might look like the following:
Code Block |
---|
[{"Date":"2018-06-15","Time":"19:00"},{"Date":"2018-06-17","Time":"19:00"},{"Date":"2018-06-19","Time":"20:00"},{"Date":"2018-06-20","Time":"20:00"}] |
The following transformation creates a separate row for each entry in the Events
column, populating the other fields in the new rows with the data from the original row:
Info |
---|
NOTE: This type of transformation can significantly increase the size of your dataset. |
D trans | ||||||||
---|---|---|---|---|---|---|---|---|
|
Extract Values into a List
You can also extract sets of values into an array list of values.
Tip |
---|
Tip: This transformation is useful for extracting types or patterns of information from a single column. |
Extract matches into array
Using
D s item | ||
---|---|---|
|
Transformation:
D trans | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Results:
...
Extract hashtags
Suppose you need to extract the hashtags from customer tweets to another column. In such cases, you can use the {hashtag}
D s item | ||
---|---|---|
|
Source:
The following dataset contains a customer tweets across different locations.
...
Excited to announce that we’ve transitioned Wrangler from a hybrid desktop application to a completely cloud-based service! #dataprep #businessintelligence #CommitToCleanData # London
...
Learnt more about the importance of identifying issues in your data—early and often #CommitToCleanData #predictivetransformations #realbusinessintelligence
...
Clean data is the foundation of your analysis. Learn more about what we consider the five tenets of sound #dataprep, starting with #1a prioritizing and setting targets. #startwiththeuser #realbusinessintelligence #Paris
...
Learn how #NewYorklife
onboarded as part of their #bigdata #dataprep initiative to unlock hidden insights and make them accessible across departments.
...
How can you quickly determine the number of times a user ID appears in your data?#dataprep #pivot #aggregation#machinelearning initiatives #SFO
Transformation:
The following transformation extracts the hashtag messages from customer tweets.
...
|
Extract numeric values
You can extract numerical data from text values. In the following example, the first number is extracted from the address
column, which would correspond to extracting the street number for the address:
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Empty values in this new column might indicate a formatting problem with the address.
Tip |
---|
Tip: If you set the number of patterns to extract to |
Extract components of a URL
URL components
Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL
column:
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.
The following functions can be used to extract values from a set of URLs:
Function Name | Description | ||||||
---|---|---|---|---|---|---|---|
HOST Function |
| ||||||
DOMAIN Function |
| ||||||
SUBDOMAIN Function |
| ||||||
| |||||||
URLPARAMS Function |
|
Query parameters
You can extract query parameter values from an URL. The following example extracts the store_id
value from the storeURL
field value:
D trans | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Extract object values
If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.
Suppose your restaurant dataset includes a set of characteristics in the restFeatures
column in the following JSON format:
Code Block |
---|
{
"Credit": "Y",
"Accessible": "Y",
"Restrooms": "Y",
"EatIn": "Y",
"ToGo": "N",
"AlcoholBeer": "Y",
"AlcoholHard": "N",
"TotalTables": "10",
"TotalTableSeats": "36",
"Counter": "Y",
"CounterSeats": "8"
} |
You can use the following transformation to extract the values from TotalTableSeats
and CounterSeats
into separate columns:
D trans | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
After the above is executed, you can perform a simple sum of the TotalTableSeats
and CounterSeats
columns to determine the total number of seats in the restaurant.
Extract array values
In some cases, your data may contain arrays of repeated key-value pairs, where each pair would exist on a separate line. Suppose you have a column called, Events
, which contains date and time information about the musician described in the same row of data. The Events
column might look like the following:
Code Block |
---|
[{"Date":"2018-06-15","Time":"19:00"},{"Date":"2018-06-17","Time":"19:00"},{"Date":"2018-06-19","Time":"20:00"},{"Date":"2018-06-20","Time":"20:00"}] |
The following transformation creates a separate row for each entry in the Events
column, populating the other fields in the new rows with the data from the original row:
Info |
---|
NOTE: This type of transformation can significantly increase the size of your dataset. |
D trans | ||||||||
---|---|---|---|---|---|---|---|---|
|
Extract Values into a List
You can also extract sets of values into an array list of values.
Tip |
---|
Tip: This transformation is useful for extracting types or patterns of information from a single column. |
Extract matches into array
Using
D s item | ||
---|---|---|
|
Transformation:
D trans | |||||||
---|---|---|---|---|---|---|---|
|
...
|
...
Results:
...
["#dataprep", "#businessintelligence", "#CommitToCleanData", " # London"]
...
["#CommitToCleanData", "#predictivetransformations", "#realbusinessintelligence", "0"]
...
["#dataprep", "#startwiththeuser","#realbusinessintelligence", "# Paris"]
...
["#NewYorklife", "dataprep", "bigdata", "0"]
...
|
...
|
Results:
Before | After |
---|---|
socks, socks, socks | ["socks", "socks", "socks"] |
pants, pants | ["pants", "pants"] |
Extract hashtags
Include Page | ||||
---|---|---|---|---|
|
D s also | ||||
---|---|---|---|---|
|