Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

D toc

Excerpt

D s product
rtrue
 utilizes columnar pattern matching to identify data patterns of interest to you and to surface them in the interface for use in building your recipes. Additionally, in your recipe steps, you can apply regular expressions or 
D s item
itempatterns
 to locate patterns and transform the matching data in your datasets.

Overview

pattern is a combination of abstracted character sets and literal characters that can summarize data patterns in a column. Patterns can be applied through one of two methods:

  • Regular expressions are a standardized method of matching data. The syntax of regular expressions is both powerful and not easy to understand.
  • D s item
    itempatterns
     are pattern-matching widgets that provide a layer of abstraction on top of regular expressions. Instead of having to specify the sometimes complex underlying regular expression, you can specify a simple token to represent the underlying expression.

    Tip

    Tip: While regular expressions are a widely used standard,

    D s item
    itempatterns
    are powerful simplifications that can limit the sometimes "greedy" matching issues in regular expressions.

  • For more information on the supported patterns, see Text Matching.

This section provides an overview of the pattern matching features of the platform.

Example Patterns

Within a row, multiple patterns may be applied at different levels of abstraction to describe the data in all fields (columns) of the row. Suppose you have two records like the following:

Code Block
[cz.laping@gmail.com,3987,1446319063821]
[ajuneauk@gmail.com,5289,1447275151508]

The above records can be described by any of the following patterns:

Code Block
[{alpha-numeric}+,{4-digits},{13-digits}]
[{email},{4-digits},{13-digits}]
[{alpha-numeric}+@gmail.com,{4-digits},{13-digits}]
Info

NOTE: The above patterns utilize the syntax of

D s item
itempatterns
. Regular expressions can be used to describe them as well.

In the above case, all three pattern sets capture the data completely. However, please note the differences between the patterns for column 1:

PatternDescription
{alpha-numeric}+This pattern captures alpha-numeric values of one or more characters. So, entries that match on this pattern do not need to be valid email addresses.
{email}This pattern ensures matching only on valid email addresses. So, values that do not match this pattern are likely to be flagged as mismatched within the platform.
{alpha-numeric}+@gmail.comThis partial pattern ensures that the only matches are from gmail.com.

Depending on the specific meaning of the data for your use, any of the above may apply.

Patterns in the Platform

Column Profiling

Pattern matching applied to columns can permit users to see the most common patterns and anomalous patterns of data in a column across the entire sample. Since patterns presented to the user encompass the entire set of values in the sample, you can gather detailed information about the consistency of data in the column across the column. 

Tip

Tip: Column pattern profiling is especially useful after you have addressed the mismatched values in the column.

Based on the patterns surfaced for the column, you can take any of the following actions:

  • Filtering a subset of records. For example, you can review patterns for a column of addresses and filter the rows of data where no street number is provided, based on patterns you select.
  • Standardize values. You can make selections of patterns for the different patterns for phone numbers. See Pattern Matching by Data Type below.
  • Extract values. You can break apart column values based on mismatches in structure. For example, apartment numbers from an address field can be extracted into a new column.
  • Variable levels of abstraction. As demonstrated in the previous example, you may be able to select from multiple matching patterns to determine which one is the best fit for the row values of interest.

Machine Learning

Additionally,

D s product
 collects aggregated information about patterns applied by all users. These patterns are given weight in the set of suggested patterns presented to each user. 

Pattern Matching by Data Type

As part of pattern matching, the platform evaluates the data against the specified data type for the column. Type-specific pattern matching applies to the following data types:

  • Datetime
  • Phone

See Standardize Using Patterns.

Using Patterns

In the application, patterns can be used as the starting point in building your next recipe step, and you can modify or iterate on a pattern definition to preview the results of the specified transformation. Patterns are used in the following actions:

  • Select text to trigger a pattern-based suggestion or suggestions
  • Select patterns of varying level of abstraction to modify column data

Selecting Data

When you select a value in the data grid, your options include pattern-based suggestions. In this manner, you indicate something of interest and enable the platform to interpret your specific interest or broader goal for the selected data. These broader changes are surfaced as pattern-based suggestions in the context panel. 



Browse Pattern History

In fields in the Transform Builder that accept patterns, you can choose to review and select patterns from your recent history:

D caption
Browse Pattern History to review and select recently used patterns

A recently used pattern can be selected and added to the configured recipe step. See Pattern History Panel.

Patterns in Column Details

In the Column Details panel, you can review sets of patterns that describe subsets of the values in the column. When you select one of the patterns, you are prompted with a set of suggested transform steps to apply to the data. See Column Details Panel.

Advanced Uses

In addition to the above basic uses, patterns can be used as the basis for the following advanced uses and more.

UseDescription
Standardize recordsMatch values based on a pattern and then change values to fit this pattern. See Standardize Using Patterns.
Filter recordsKeep or delete records based on patterns of values found in row data. See Filter Data.
Extract valuesExtract values matching a pattern from one column and insert them into a new column of data. See Extract Values.
Generate function outputsUse patterns to generate function outputs in new columns.

User-Defined Patterns

In your recipe steps, you can specify patterns using either of the following methods.

Regular expressions

Regular expressions (regexes) are sequence of characters that can be used to define a pattern. This pattern can be used in the transformations that support regex to identify patterns in your data of interest to you. Example:

Code Block
replace col: myCol with:/$1/ on:/^\((\d\d\d)\)/ global: false

In the above step, the matching pattern expressed in the on clause evaluates in the following manner:

  • The forward slashes around the pattern indicate that it is a regular expression.
  • ^ indicates the start of the value in the myCol column. So, the matching is only made at the beginning of the column.
  • \( and \) are representations in regular expressions of the literal values for parentheses. So, matches are made on those specific characters.
  • The interior set of parentheses are used to define a capture group of values. These values, which correspond to three digits, are captured and inserted as the replacement.

So, the net effect is to search the beginning of a field for values like (555) and replace them with just the digits: 555. This replacement removes the parentheses from the area code part of a phone number.

 

Info

NOTE: Regular expressions are very powerful tools for matching patterns. They can also cause unexpected results. Use of regular expressions is considered a developer-level skill. You should use the

D s item
itempatterns
described below instead.

D s product
 implements a version of regular expressions based off of RE2 and PCRE regular expressions.

D s item
itempatterns

Use 

D s item
itempatterns
 to quickly assemble sophisticated patterns to match in your data. The following example includes the equivalent 
D s item
itempattern
 as the previous regular expression:

Code Block
replace col: myCol with:`$1` on:`^\(({digit}{3})\)` global: false
  • The back-ticks around the pattern indicate that it is a 
    D s item
    itempattern
    .

For more information, see Text Matching.