Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

D toc

Excerpt

This section enables you to work with text matching in

D s product
rtrue
application.

Match Types

D s product
rtrue
 supports the following types of text matching clauses:

  • String literals match specified strings exactly. Written using single quotes ('...') or double quotes ("...").
  • Regular expressions enable pattern-based matching. Regular expressions are written using forward slashes (/.../). The syntax is based on RE2 and PCRE regular expressions.

    Info

    NOTE: Regular expressions are considered a developer-level capability and can have significant consequences if they are improperly specified. Unless you are comfortable with regular expressions, you should use

    D s lang
    itempatterns
    instead.

  • D s lang
    itempatterns
      are custom selectors for patterns in your data and provide a simpler and more readable alternative to regular expressions. They are written using backticks (`...`).

  • Column names are simple text strings in 
    D s lang
    . If the column name contains a space, it must be bracketed in curly braces: {my Column Name}

D s item
itemPatterns
 Syntax

The following tables contain syntax information about 

D s lang
itempatterns
:

Tip

Tip: After using

D s lang
itempatterns
, regular expressions, or string literals in a recipe step, you can reuse them in your transformations where applicable.

Character patterns

These patterns apply to single characters and strings of characters

PatternDescription
%Match any character, exactly once
%?Match any character, zero or one times
%*Match any character, zero or more times
%+match any character, one or more times
%{3}Match any character, exactly three times
%{3,5}Match any character, 3, 4, or 5 times
#Digit character [0-9]
{any}Match any character, exactly once
{alpha}Alpha character [A-Za-z_]
{upper}Uppercase alpha character [A-Z_]
{lower}Lowercase alpha character [a-z_]
{digit}Digit character [0-9]
{delim}Single delimiter character e.g :, ,, |, /, -, ., \s
{delim-ws}Single delimiter and all the whitespace around it
{alpha-numeric}Match a single alphanumeric character
{alphanum-underscore}Match a single alphanumeric character or underscore character
{at-username}Match @username values
{hashtag}Match #hashtag values
{hex}Match hexadecimal number (e.g. 2FA3)

Position patterns

These patterns describe positions relative to the entire string.

PatternDescription
{start}Match the start of the line
{end}Match the end of the line

Type patterns

These patterns can be used to match strings that fit a particular data type, except for Datetime patterns.

PatternDescription
{phone}Match a valid U.S. phone number.
{email}Match a valid email address.
{url}Match a valid URL.
{ip-address}Match a valid IP address.
{hex-ip-address}Match a valid hexadecimal IP address (e.g. 0x0CA40012)
{bool}Match a valid Boolean value.
{street}Match a U.S.-formatted street address (e.g. 123 Main Street)
{occupancy}Match a valid U.S.-formatted occupancy address value (e.g. Apt 2D)
{city}Match a city name within U.S.-formatted address value
{state}Match a valid U.S. state value (e.g. California).
{state-abbrev}Match a valid two-letter U.S. state abbreviation value (e.g. CA)
{zip}Match a valid five-digit zip code

Datetime patterns

PatternDescription
{month}Match full name of month (e.g. January)
{month-abbrev}Match short name of month (e.g. Jan)
{time}Match time value in HOUR:MINUTE:SECOND format (e.g. 11:59:23)
{period}Match time period of the day: AM/PM
{dayofweek}Match long name for day of the week (e.g. Sunday).
{dayofweek-abbrev}Match short name for day of the week (e.g. Sun).
{utcoffset}Match a valid UTC offset value (e.g. -0500, +0400, Z)
Info

NOTE: You can use the Datetime data type formatting tokens as part of your

D s lang
itempatterns
to build a variety of matching patterns for date and time values.

Grouping patterns

PatternDescription
{[...]}character class matches characters in brackets
{![...]}negated class matches characters not in brackets
(...)grouping, including captures
#, %, ?, *, +, {, }, (, ), \, ’, \n, \tescaped characters or pattern modifiers Use a double backslash (\\) to denote an escaped string literal. For more information, see Escaping Strings in Transformations.
|logical OR
  • Logical AND is the implied operator when you concatenate text matching patterns.
  • Logical NOT is managed using negated classes.

See also Capture Group References.

D s lang
itempatterns
 Examples

Basic

Match first three characters:

Code Block
`{start}%{3}`

Match last four letters (numeric or other character types do not match):

Code Block
`{alpha}{4}{end}`

Match first word:

Code Block
`{start}{alpha}+`

Matches date values in general YYYY*MM*dd format:

Code Block
`{yyyy}{delim}{MM}{delim}{dd}`

Matches time values in 12-hour format:

Code Block
`{h}{delim}{mm}{delim}{s}`

In transformations

The following transformation masks credit card number patterns, except for the last four digits:

D trans
p03ValueXXXX-XXXX-XXXX-$1
Typeref
p01NameColumns
p01ValuemyCreditCardNumbers
p02NameFind
p02Value`{start}{digit}{4}{any}{digit}{4}{any}{digit}{4}{any}({digit}{4}){end}`
p03NameReplace with
SearchTermReplace text or patterns

Notes:

  • The inclusion of the {start} and {end} tokens assures that the matches are made only when the pattern is found across the entire value in a cell.
  • The parenthesis in the Find value identify the capture group, which is referenced in the Replace With value as $1. See Capture Group References.

The above transformation matches values based on the structure of the data, instead of the data type.

  • Some values that follow this pattern are not valid credit card numbers, so it's meaningful to check against the data type.
  • If for some reason, you have values that are not credit card numbers yet follow the credit card pattern, those values will be masked as well by this transformation.

So to be safe, you might try the following set of transformations to ensure that you are matching on credit card values.

Step 1: If the number in your source column is valid, write it to a new column.

D trans
RawWrangletrue
p03ValuemyCreditCardNumbersMasked
Typestep
WrangleTextderive type:single value:IFVALID(myCreditCardNumbers,'Creditcard'),$col,'') as: 'myCreditCardNumbersMasked'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueIFVALID(myCreditCardNumbers,'Creditcard'),$col,'')
p03NameNew column name
SearchTermNew formula

Notes:

  • The IFVALID function tests to see if a set of values is valid for a specified data type, 'Creditcard' in this case. For more information on the strings that you can use to test against data type, see Valid Data Type Strings.
  • The $col is a reference to the value in the column where the evaluation is being performed. For more information, see Source Metadata References.

Step 2: The myCreditCardNumbersMasked column now contains values that are valid credit card numbers from your source column. You can now apply the masking step.

D trans
p03ValueXXXX-XXXX-XXXX-$1
Typeref
p01NameColumns
p01ValuemyCreditCardNumbersMasked
p02NameFind
p02Value`{start}{digit}{4}{any}{digit}{4}{any}{digit}{4}{any}({digit}{4}){end}`
p03NameReplace with
SearchTermReplace text or patterns

Step 3: If needed, you can move the masked values back to the source column. 

D trans
Typestep
p01NameColumns
p01ValuemyCreditCardNumbers
p02NameFormula
p02ValueIF(myCreditCardNumbersMasked<>'',myCreditCardNumbersMasked,'')
SearchTermEdit column with formula

The myCreditCardNumbers column now contains only valid credit card numbers that have been asked. The application is likely to infer the data type of the column as String.

Delete the myCreditCardNumbersMasked column.

D s also
inCQLtrue
label((label = "patterns") OR (label = "pattern"))