Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r095

D toc

Excerpt

Extracting one or more values from within a column of values can turn data into meaningful and discrete information. This section describes how to extract column data, the methods for which may vary depending on the data type.

...

Extract vs. Split

Extract and split transformations do not do the same thing:

...

MethodDescription
By selectionSelect part of a value in the data grid to prompt a series of suggestions on what to do with the data. Typically, extract options are near the top of the suggestions when you select part of a value.
By column menuFrom the menu to the right of the column, select Extract and a sub-menu item to begin configuring a transformation. See Column Menus.
By Transformer toolbarAt the top of the data grid, click the Extract icon in the Transformer toolbar to begin configuring extract transformations. See Transformer Toolbar.
By Search panelIn the Search panel, enter extract to build a transformation from scratch. See Search Panel.

Extract text or patterns

A primary use of extraction is to remove literal or patterned values of text from a column of values. Suppose your dataset included a column of LinkedIn updates. You can use one of the following methods to extract keywords from these values. 

...

The following functions can be used to extract values from a Datetime column, as long as the values are present in the formatted date:

Function NameDescription
DAY Function

D excerpt include
pageDAY Function
nopaneltrue

MONTH Function

...

D excerpt include
pageMONTH Function

...

nopanel

...

You can also reformat the whole Datetime column using the DATEFORMAT function. The following reformats the column to show only the two-digit year:

D trans
Typestep
p01NameColumns
p01ValuemyDate
p02NameFormula
p02ValueDATEFORMAT(myDate, "yy")
SearchTermEdit column with formula

Extract numeric values

You can extract numerical data from text values. In the following example, the first number is extracted from the address column, which would correspond to extracting the street number for the address:

...

true

...

YEAR Function

D excerpt include
pageYEAR Function
nopaneltrue

HOUR Function

D excerpt include
pageHOUR Function
nopaneltrue

MINUTE Function

D excerpt include
pageMINUTE Function
nopaneltrue

SECOND Function

D excerpt include
pageSECOND Function
nopaneltrue

You can also reformat the whole Datetime column using the DATEFORMAT function. The following reformats the column to show only the two-digit year:

D trans
Typestep
p01NameColumn to extract fromColumns
p01ValueaddressmyDate
p02NameOptionFormula
p02ValueNumbers
p03NameNumber of matches to extract
SearchTermExtract patterns

Empty values in this new column might indicate a formatting problem with the address.

Tip

Tip: If you set the number of patterns to extract to 2 for the address column, you might extract apartment or suite information.

Extract components of a URL

URL components

Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL column:

D trans
p03ValuemyDomain
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueDOMAIN(myURL)
p03NameNew column name
SearchTermNew formula

In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.

The following functions can be used to extract values from a set of URLs:

Query parameters

You can extract query parameter values from an URL. The following example extracts the store_id value from the storeURL field value:

D trans
p03Valuestore_id
Typestep
p01NameColumn to extract from
p01ValuestoreURL
p02NameOption
p02ValueHTTP Query strings
p03NameFields to extract
SearchTermExtract patterns

Extract object values

If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.

Suppose your restaurant dataset includes a set of characteristics in the restFeatures column in the following JSON format:

Code Block
{
  "Credit": "Y",
  "Accessible": "Y",
  "Restrooms": "Y",
  "EatIn": "Y",
  "ToGo": "N",
  "AlcoholBeer": "Y",
  "AlcoholHard": "N",
  "TotalTables": "10",
  "TotalTableSeats": "36",
  "Counter": "Y",
  "CounterSeats": "8"
}

You can use the following transformation to extract the values from TotalTableSeats and CounterSeats into separate columns:

D trans
p03ValueCounterSeats
Typestep
p01NameColumn
p01ValuerestFeatures
p02NamePaths to elements - 1
p02ValueTotalTableSeats
p03NamePaths to elements - 2
p04ValueSelected
p04NameInclude original column name
SearchTermUnnest Objects into columns

After the above is executed, you can perform a simple sum of the TotalTableSeats and CounterSeats columns to determine the total number of seats in the restaurant.

Extract array values

In some cases, your data may contain arrays of repeated key-value pairs, where each pair would exist on a separate line. Suppose you have a column called, Events, which contains date and time information about the musician described in the same row of data. The Events column might look like the following:

Code Block
[{"Date":"2018-06-15","Time":"19:00"},{"Date":"2018-06-17","Time":"19:00"},{"Date":"2018-06-19","Time":"20:00"},{"Date":"2018-06-20","Time":"20:00"}]

The following transformation creates a separate row for each entry in the Events column, populating the other fields in the new rows with the data from the original row:

 

Info

NOTE: This type of transformation can significantly increase the size of your dataset.

D trans
Typestep
p01NameColumn
p01ValueEvents
SearchTermExpand arrays into rows

Extract Values into a List

You can also extract sets of values into an array list of values. 

Tip

Tip: This transformation is useful for extracting types or patterns of information from a single column.

Extract matches into array

Using 

D s item
itempatterns
, you can extract the values of the column to form a new column of arrays. The following example shows the usage of {any} pattern to extract the cell values and form a new array column.

Transformation:

D trans
p03Value`,`
Typestep
p01NameColumn
p01Valueproduct
p02NamePattern matching elements in the list
p02Value`{any}`
p06Value1
p03NameDelimiter separating each element
SearchTermExtract matches into Array

Results:

...

Extract hashtags

Suppose you need to extract the hashtags from customer tweets to another column. In such cases, you can use the {hashtag} 

D s item
itempattern
 to extract all hashtag values from a customer's tweets into a new column.

Source:

The following dataset contains a customer tweets across different locations.  

...

Excited to announce that we’ve transitioned Wrangler from a hybrid desktop application to a completely cloud-based service! #dataprep #businessintelligence #CommitToCleanData # London

...

Learnt more about the importance of identifying issues in your data—early and often #CommitToCleanData #predictivetransformations #realbusinessintelligence

...

Clean data is the foundation of your analysis. Learn more about what we consider the five tenets of sound #dataprep, starting with #1a prioritizing and setting targets.  #startwiththeuser #realbusinessintelligence #Paris

...

Learn how #NewYorklife

onboarded as part of their #bigdata  #dataprep initiative to unlock hidden insights and make them accessible across departments. 

...

How can you quickly determine the number of times a user ID appears in your data?#dataprep #pivot #aggregation#machinelearning initiatives #SFO

Transformation:

The following transformation extracts the hashtag messages from customer tweets.

...

DATEFORMAT(myDate, "yy")
SearchTermEdit column with formula

Extract numeric values

You can extract numerical data from text values. In the following example, the first number is extracted from the address column, which would correspond to extracting the street number for the address:

D trans
p03Value1
Typestep
p01NameColumn to extract from
p01Valueaddress
p02NameOption
p02ValueNumbers
p03NameNumber of matches to extract
SearchTermExtract patterns

Empty values in this new column might indicate a formatting problem with the address.

Tip

Tip: If you set the number of patterns to extract to 2 for the address column, you might extract apartment or suite information.

Extract components of a URL

URL components

Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL column:

D trans
p03ValuemyDomain
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueDOMAIN(myURL)
p03NameNew column name
SearchTermNew formula

In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.

The following functions can be used to extract values from a set of URLs:

Function NameDescription
HOST Function

D excerpt include
pageHOST Function
nopaneltrue

DOMAIN Function

D excerpt include
pageDOMAIN Function
nopaneltrue

SUBDOMAIN Function

D excerpt include
pageSUBDOMAIN Function
nopaneltrue

SUFFIX Function

D excerpt include
pageSUFFIX Function
nopaneltrue

URLPARAMS Function

D excerpt include
pageURLPARAMS Function
nopaneltrue

Query parameters

You can extract query parameter values from an URL. The following example extracts the store_id value from the storeURL field value:

D trans
p03Valuestore_id
Typestep
p01NameColumn to extract from
p01ValuestoreURL
p02NameOption
p02ValueHTTP Query strings
p03NameFields to extract
SearchTermExtract patterns

Extract object values

If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.

Suppose your restaurant dataset includes a set of characteristics in the restFeatures column in the following JSON format:

Code Block
{
  "Credit": "Y",
  "Accessible": "Y",
  "Restrooms": "Y",
  "EatIn": "Y",
  "ToGo": "N",
  "AlcoholBeer": "Y",
  "AlcoholHard": "N",
  "TotalTables": "10",
  "TotalTableSeats": "36",
  "Counter": "Y",
  "CounterSeats": "8"
}

You can use the following transformation to extract the values from TotalTableSeats and CounterSeats into separate columns:

D trans
p03ValueCounterSeats
Typestep
p01NameColumn
p01ValuerestFeatures
p02NamePaths to elements - 1
p02ValueTotalTableSeats
p03NamePaths to elements - 2
p04ValueSelected
p04NameInclude original column name
SearchTermUnnest Objects into columns

After the above is executed, you can perform a simple sum of the TotalTableSeats and CounterSeats columns to determine the total number of seats in the restaurant.

Extract array values

In some cases, your data may contain arrays of repeated key-value pairs, where each pair would exist on a separate line. Suppose you have a column called, Events, which contains date and time information about the musician described in the same row of data. The Events column might look like the following:

Code Block
[{"Date":"2018-06-15","Time":"19:00"},{"Date":"2018-06-17","Time":"19:00"},{"Date":"2018-06-19","Time":"20:00"},{"Date":"2018-06-20","Time":"20:00"}]

The following transformation creates a separate row for each entry in the Events column, populating the other fields in the new rows with the data from the original row:

 

Info

NOTE: This type of transformation can significantly increase the size of your dataset.

D trans
Typestep
p01NameColumn
p01ValueEvents
SearchTermExpand arrays into rows

Extract Values into a List

You can also extract sets of values into an array list of values. 

Tip

Tip: This transformation is useful for extracting types or patterns of information from a single column.

Extract matches into array

Using 

D s item
itempatterns
, you can extract the values of the column to form a new column of arrays. The following example shows the usage of {any} pattern to extract the cell values and form a new array column.

Transformation:

D trans
p03Value`,`
Typestep
p01NameColumn
p01Value

...

product
p02NamePattern matching elements in the list
p02Value`{

...

Results:

...

["#dataprep", "#businessintelligence", "#CommitToCleanData", " # London"]

...

["#CommitToCleanData",  "#predictivetransformations", "#realbusinessintelligence", "0"]

...

["#dataprep", "#startwiththeuser","#realbusinessintelligence", "# Paris"]

...

["#NewYorklife", "dataprep", "bigdata", "0"]

...

any}`
p06Value1
p03Name

...

Delimiter separating each element
SearchTermExtract matches into Array

Results:

BeforeAfter
socks, socks, socks["socks", "socks", "socks"]
pants, pants["pants", "pants"]

Extract hashtags

Include Page
EXAMPLE - Extract Values
EXAMPLE - Extract Values

D s also
inCQLtrue
label(label = "cleanse")