Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0712

D toc

Extracting one or more values from within a column of values can turn data into meaningful and discrete information. This section describes how to extract column data, the methods for which may vary depending on the data type. 

Extract vs. Split

Extract and split transformations do not do the same thing:

  • split transformation separates a single column into one or more separate columns based on one or more values in the source column that identify where the data should be split. These delimiters can be determined by the application or specified by the user when defining the transformation.
  • An extract transformation matches literal or pattern values from a source column and stores it in a separate column. 

    Info

    NOTE: The source column is untouched by extract transformations.

Extract methods

In the Transformer page, you can use the following methods to extract values: 

MethodDescription
By selectionSelect part of a value in the data grid to prompt a series of suggestions on what to do with the data. Typically, extract options are near the top of the suggestions when you select part of a value.
By column menuFrom the menu to the right of the column, select Extract and a sub-menu item to begin configuring a transformation. See Column Menus.
By Transformer toolbarAt the top of the data grid, click the Extract icon in the Transformer toolbar to begin configuring extract transformations. See Transformer Toolbar.
By Search panelIn the Search panel, enter extract to build a transformation from scratch. See Search Panel.

Extract text or patterns

A primary use of extraction is to remove literal or patterned values of text from a column of values. Suppose your dataset included a column of LinkedIn updates. You can use one of the following methods to extract keywords from these values. 

Extract single values

The following example transformation extracts the word #bigdata from the column msg_LinkedIn:

D trans
RawWrangletrue
p03Value'#bigdata'
Typestep
WrangleTextextractpatterns type: custom col: msg_LinkedIn on: '#bigdata' limit: 1
p01NameColumn to extract from
p01Valuemsg_LinkedIn
p02NameOption
p02ValueCustom text or pattern
p03NameText to extract
p04Value1
p04NameNumber of matches to extract
SearchTermExtract text or pattern

Notes:

  • The option parameter identifies that the pattern to match is a custom one specified by the user.
  • The Number of matches to extract parameter defaults to 1, meaning that the transformation extracts a maximum of one value from each cell. This value can be set from 1-50. 

Extract values by example

You can generate a new column of values extracted from a source column by entering example values to match with source values. Values with similar patterns may also be matched based on your entered example value.

Tip

Tip: This method provides an easy way to build pattern-based matching for values in a source column.

For more information on transformation by example, see Overview of TBE.

Constrain matching

Within the extract transformation, you can specify literals or patterns before or after which the match is found. This method can be used to remove parts of each cell value from erroneously matching on the literal or pattern that is desired.

The following example extracts the second three-digit element of a phone number, skipping the area code:

D trans
p03Value`{digit}`
Typestep
p05NameIgnore matches between
p01NameColumn to extract from
p01Valuephone_num
p02NameOption
p02ValueCustom text or pattern
p05Value`{start}{digit}{3}\-`
p03NameText to extract
p04Value1
p04NameNumber of matches to extract
SearchTermExtract text or pattern

Extract single patterns

You can also do pattern-based extractions using 

D s item
patterns
patterns
 or regular expressions.

  • Regular expressions are a standards-based method of describing patterns of characters for matching purposes. Regular expressions are very powerful but can be difficult to use.  
  • D s item
    itempattern
     
    is a proprietary method of describing patterns, which is much simpler to use than regular expressions.
  • For more information on both types of patterns, see Text Matching.

The following example extracts all words that begin with # in the msg_LinkedIn column:

D trans
RawWrangletrue
p03Value`\#{alphanum-underscore}+`
Typestep
WrangleTextextractpatterns type: custom col: msg_LinkedIn on: `\#{alphanum-underscore}+` limit: 50
p01NameColumn to extract from
p01Valuemsg_LinkedIn
p02NameOption
p02ValueCustom text or pattern
p03NameText to extract
p04Value50
p04NameNumber of matches to extract
SearchTermExtract text or pattern

Notes:

  • The Text to extract parameter has changed:

    ElementDescription

    Two back-ticks (`)

    Indicate that the expression between them represents a 

    D s item
    itempattern
    .

    \#The slash indicates that the character right after it should be interpreted as a character only; it should not be interpreted as any special character in the pattern.
    {alphanum-underscore}

    This

    D s item
    itempattern
    element is used to indicate a single alphanumeric or underscore character.

    +Adding the plus sign after the above character signifies that the pattern can match on a sequence of alphanumeric or underscore characters of one or more length.
  • The Number of matches to extract parameter has been increased to grab up to 50 hashtags.

Advanced options

OptionDescription
Number of patterns to extract

Set this value to the total number of patterns you wish to extract.

Info

NOTE: This value determines the number of columns that are generated by the extraction. If no value is available, an empty value is written into the corresponding column.

The default is 1.

Ignore caseBy default, pattern matching is case-sensitive. Select this checkbox to ignore case when matching.
Ignore matches betweenYou can enter a pattern here to describe any patterns that should not be part of any match. This option is useful if you have multiple instances of text but want to ignore the first one, for example.

Extract multiple values

In your pattern expressions, you can use the vertical pipe character (|) to define multiple patterns to find. The following example extracts any value from the myDate column that ends in 7 pr in 8:

D trans
p03Value`{end}`
Typestep
p01NameColumn to extract from
p01ValuemyDate
p02NameText to extract
p02Value`{any}+7|{any}+8`
p03NameEnd extracting before
SearchTermExtract text or pattern

You can use the vertical pipe in both 

D s item
itempatterns
 and regular expressions.

Extract first or last characters

You can extract the first or last set of characters from a column into a new column. In the following example, the first five characters from the ProductName column are extracted into a new product identifier column:

D trans
p03Value5
Typestep
p01NameColumn to extract from
p01ValueProductName
p02NameOption
p02ValueFirst characters
p03NameNumber of characters to extract
SearchTermExtract by positions

You can change the Option value to Last characters to extract from the right side of the column value.

Extract and remove

If you need to remove the characters that you extracted, you can use the following transformation. In this case, the first five characters, which were extracted in the previous transformation, are removed:

D trans
Typestep
p01NameColumns
p01ValueProductName
p02NameFormula
p02ValueRIGHT(ProductName, LEN(ProductName)-5)
SearchTermEdit column with formula

Extract by positions

You can extract values between specified index positions within a set of column values. In the following example, the text between the fifth and tenth characters in a column are extracted to a new column.

Tip

Tip: This extraction method is useful if the content before and after the match area is inconsistent and cannot be described using patterns. If it is consistent, you should use the Extract text or pattern transformation.

D trans
p03Value5
Typestep
p01NameColumn to extract from
p01ValueProductName
p02NameOption
p02ValueBetween two positions
p03NameStarting position
p04Value10
p04NameEnding position
SearchTermExtract by positions

Extract by Data Type

You can perform extractions that are specific to a data type or based on failures of the data to match a specified data type.

Extract date values

You can use functions to extract values from Datetime columns.  The example below extracts the year value from the myDate column:

D trans
p03ValuemyYear
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueYEAR(myDate)
p03NameNew column name
SearchTermNew formula

The following functions can be used to extract values from a Datetime column, as long as the values are present in the formatted date:

You can also reformat the whole Datetime column using the DATEFORMAT function. The following reformats the column to show only the two-digit year:

D trans
Typestep
p01NameColumns
p01ValuemyDate
p02NameFormula
p02ValueDATEFORMAT(myDate, "yy")
SearchTermEdit column with formula

Extract numeric values

You can extract numerical data from text values. In the following example, the first number is extracted from the address column, which would correspond to extracting the street number for the address:

D trans
p03Value1
Typestep
p01NameColumn to extract from
p01Valueaddress
p02NameOption
p02ValueNumbers
p03NameNumber of matches to extract
SearchTermExtract patterns

Empty values in this new column might indicate a formatting problem with the address.

Tip

Tip: If you set the number of patterns to extract to 2 for the address column, you might extract apartment or suite information.

Extract object values

If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.

Suppose your restaurant dataset includes a set of characteristics in the restFeatures column in the following JSON format:

Code Block
{
  "Credit": "Y",
  "Accessible": "Y",
  "Restrooms": "Y",
  "EatIn": "Y",
  "ToGo": "N",
  "AlcoholBeer": "Y",
  "AlcoholHard": "N",
  "TotalTables": "10",
  "TotalTableSeats": "36",
  "Counter": "Y",
  "CounterSeats": "8"
}

You can use the following transformation to extract the values from TotalTableSeats and CounterSeats into separate columns:

D trans
p03ValueCounterSeats
Typestep
p01NameColumn
p01ValuerestFeatures
p02NamePaths to elements - 1
p02ValueTotalTableSeats
p03NamePaths to elements - 2
p04ValueSelected
p04NameInclude original column name
SearchTermUnnest Objects into columns

After the above is executed, you can perform a simple sum of the TotalTableSeats and CounterSeats columns to determine the total number of seats in the restaurant.

Extract array values

In some cases, your data may contain arrays of repeated key-value pairs, where each pair would exist on a separate line. Suppose you have a column called, Events, which contains date and time information about the musician described in the same row of data. The Events column might look like the following:

Code Block
[{"Date":"2018-06-15","Time":"19:00"},{"Date":"2018-06-17","Time":"19:00"},{"Date":"2018-06-19","Time":"20:00"},{"Date":"2018-06-20","Time":"20:00"}]

The following transformation creates a separate row for each entry in the Events column, populating the other fields in the new rows with the data from the original row:

 

Info

NOTE: This type of transformation can significantly increase the size of your dataset.

D trans
Typestep
p01NameColumn
p01ValueEvents
SearchTermExpand arrays into rows

Extract components of a URL

URL components

Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL column:

D trans
p03ValuemyDomain
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueDOMAIN(myURL)
p03NameNew column name
SearchTermNew formula

In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.

The following functions can be used to extract values from a set of URLs:

Query parameters

You can extract query parameter values from an URL. The following example extracts the store_id value from the storeURL field value:

D trans
p03Valuestore_id
Typestep
p01NameColumn to extract from
p01ValuestoreURL
p02NameOption
p02ValueHTTP Query strings
p03NameFields to extract
SearchTermExtract patterns