Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next


D toc

Extracting one or more values from within a column of values can turn data into meaningful and discrete information. This section describes how to extract column data, the methods for which may vary depending on the data type. 

Extract vs. Split

Extract and split transformations do not do the same thing:

...

The following example extracts all words that begin with # in the msg_LinkedIn column:

d-

...

trans
RawWrangletrue
p03Value`\#{alphanum-underscore}+`
Typestep
WrangleTextextractpatterns type: custom col: msg_LinkedIn on: `\#{alphanum-underscore}+` limit: 50
D trans
p03Value`\#{alphanum-underscore}+`
Typestep
p01NameColumn to extract from
p01Valuemsg_LinkedIn
p02NameOption
p02ValueCustom text or pattern
p03NameText to extract
p04Value50
p04NameNumber of matches to extract
SearchTermExtract text or pattern

...

Tip

Tip: If you set the number of patterns to extract to 2 for the address column, you might extract apartment or suite information.

Extract components of a URL

URL components

Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL column:

D trans
p03ValuemyDomain
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueDOMAIN(myURL)
p03NameNew column name
SearchTermNew formula

In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.

The following functions can be used to extract values from a set of URLs:

Query parameters

You can extract query parameter values from an URL. The following example extracts the store_id value from the storeURL field value:

D trans
p03Valuestore_id
Typestep
p01NameColumn to extract from
p01ValuestoreURL
p02NameOption
p02ValueHTTP Query strings
p03NameFields to extract
SearchTermExtract patterns

Extract object values

If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.

...

D trans
Typestep
p01NameColumn
p01ValueEvents
SearchTermExpand arrays into rows

Extract

...

URL components

...

Values into a List

You can also extract sets of values into an array list of values. 

Tip

Tip: This transformation is useful for extracting types or patterns of information from a single column.

Extract matches into array

Using 

D s item
itempatterns
, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL columnthe values of the column to form a new column of arrays. The following example shows the usage of {any} pattern to extract the cell values and form a new array column.

Transformation:

You can extract query parameter values from an URL. The following example extracts the store_id value from the storeURL field value:

D trans
DOMAIN(myURL)
p03ValuemyDomain`,`
Typestep
p01NameFormula typeColumn
p01ValueSingle row formulaproduct
p02NameFormulaPattern matching elements in the list
p02Value`{any}`
p06Value1
p03NameNew column name
SearchTermNew formula

In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.

The following functions can be used to extract values from a set of URLs:

Query parameters

Delimiter separating each element
SearchTermExtract matches into Array

Results:

BeforeAfter
socks, socks, socks["socks", "socks", "socks"]
pants, pants["pants", "pants"]
Excerpt

Extract hashtags

Suppose you need to extract the hashtags from customer tweets to another column. In such cases, you can use the {hashtag} 

D s item
itempattern
 to extract all hashtag values from a customer's tweets into a new column.

Source:

The following dataset contains a customer tweets across different locations.  

User NameLocationCustomer tweets
JamesU.K

Excited to announce that we’ve transitioned Wrangler from a hybrid desktop application to a completely cloud-based service! #dataprep #businessintelligence #CommitToCleanData # London

MarkBerlin

Learnt more about the importance of identifying issues in your data—early and often #CommitToCleanData #predictivetransformations #realbusinessintelligence

CatherineParis

Clean data is the foundation of your analysis. Learn more about what we consider the five tenets of sound #dataprep, starting with #1a prioritizing and setting targets.  #startwiththeuser #realbusinessintelligence #Paris

DaveNew York

Learn how #NewYorklife

onboarded as part of their #bigdata  #dataprep initiative to unlock hidden insights and make them accessible across departments. 

ChristySan Francisco

How can you quickly determine the number of times a user ID appears in your data?#dataprep #pivot #aggregation#machinelearning initiatives #SFO


Transformation:

The following transformation extracts the hashtag messages from customer tweets.

D trans
p03Value

...

Hashtag tweets
Typestep
p01NameColumn

...

p01Value

...

customer_tweets
p02Name

...

Pattern matching elements in the list
p02Value

...

`{hashtag}`
p03Name

...

New column name
SearchTermExtract

...

 

...

matches into Array

Results:

User NameLocationHashtag tweets
JamesU.K

["#dataprep", "#businessintelligence", "#CommitToCleanData", " # London"]

MarkBerlin

["#CommitToCleanData",  "#predictivetransformations", "#realbusinessintelligence", "0"]

CatherineParis

["#dataprep", "#startwiththeuser","#realbusinessintelligence", "# Paris"]

DaveNew York

["#NewYorklife", "dataprep", "bigdata", "0"]

ChristySanFrancisco[ "dataprep", "#pivot", "#aggregation", "#machinelearning"]