Page tree

 

Support | BlogContact Us | 844.332.2821

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

D toc

Unlike other types of data, string text data has very few restrictions on the kinds of values that appear in a cell. In the application, this data is typically inferred as String data type. As a result, finding string values that mean the same thing can be a challenge, as minor differences in their content or structure can invalidate a match.

This section provides some methods for matching text values.

  • Some target systems may impose limits on the lengths of imported values. For more information on managing the lengths of your strings, see Manage String Lengths.

Example Data

In the following example, you can see that there are minor differences between the String values in each row of the dataset. These differences are captured in the Description column.

StringDescription
My StringBase string: 'My String'
My String extraBase string + ' extra'
 My StringA space in front of base string
My String A space after base string
MyStringNo space between the two words of base string
My  StringTwo spaces between the two words of base string
My StringBase string + a tab character
My String
 
Base string + a return character
My String
 
Base string + a newline character

When this data is imported into the

D s webapp
rtrue
, it looks like the following, after minor cleanup:

D caption
typefigure
Example data after import
Notes:

  • You can see that white space is demarcated in the imported data. In particular, the line item with two spaces between the words is accurately represented in the
    D s webapp
    .
  • Newlines, carriage returns, tabs, and other non-visible characters are represented with icons.

To normalize these text values, you can use some of the techniques listed on this page to match the problematic string values in this dataset and correct them, as needed.

Finer-Grained Controls

For closer control over string matching and cleanup, you can apply individual transforms to your column of string data. The sections below outline a number of techniques for identifying matches and cleaning up your data.

Trim strings

Info

NOTE: Before you begin matching data, you should perform a TRIM transform to remove whitespace at the beginning and end of the string, unless the whitespace is significant to the meaning and usage of the string data. 

When transforming strings, a key step is to trim off the whitespace at the beginning and ending of the string. For the above dataset, you can use the following command to remove these whitespaces:

D code

set col: String value: TRIM(String)

To remove all whitespace, including spaces in between, you can use the REMOVEWHITESPACE function. See REMOVEWHITESPACE Function.

The previewed data looks like the following, in which five strings are modified and now match the base string:

D caption
typefigure
Trim data to improve matches

Use missing or mismatched value presets

The platform language, 

D s lang
, provides presets to identify missing or mismatched values in a selection of data.

Tip

Tip: In a column's histogram, click the missing or mismatched categories to trigger a set of suggestions.

Missing values preset: The following transform replaces missing URL values with the text string http://www.example.com. The preset ISMISSING([Primary_WebSite_or_URL]) identifies the rows missing data in the specified column:

D code

set col: Primary_Website_or_URL value: 'http://www.example.com' row: ISMISSING([Primary_Website_or_URL])

For more information, see Find Missing Data.

Info

NOTE: If the data type for the column is URL, then the replacement text string must be a valid URL, or the new data is registered as mismatched with the data type.

Mismatched values preset: This transform converts to 00000 all values in the Zip column that are mismatched against the Zipcode data type. In this case, the preset ISMISMATCHED(Zip, ['Zipcode']) identifies the mismatched values in the column, as compared to the Zipcode data type:

D code

set col: Zip value: '00000' row: ISMISMATCHED(Zip, ['Zipcode'])

For more information, see Find Bad Data.

Remove a specific sub-string

An entry in the example data contains an additional word: My String extra. You can use a simple replace command to remove it: 

D code

replace col:String with:'' on:' extra' global:true

The global parameter causes the replacement to be applied to all instances found within a cell value. Otherwise, the replacement occurs only on the first instance.

Replace double spaces

There are multiple ways of removing double spaces, or any pattern, from text values. For best results, you should limit this change to individual columns.

Info

NOTE: For matching string patterns that are short in length, you should be careful to define the scope of match. For example, to remove double spaces from your dataset, you should limit the columns to just the ones containing string values. If you applied the change to all columns in the dataset, meaningful uses of double spacing could be corrupted, such as in JSON data fields.

D code

replace col: String on: '  ' with: ' ' global: true

Break out CamelCase

CamelCase refers to text in which multiple words are joined together by removing the spaces between them. In the example data, the entry MyString is an example of CamelCase. 

Info

NOTE: Regular expressions are very powerful pattern-matching tools. If they are poorly specified in a transform, they can have unexpected results. Please use them with caution.

You can use

D s item
itempatterns
 to break up CamelCase entries in a column of values. The following transforms use regular expressions to identify patterns in a set of values: 

D code

replace col:String with:'$1 $2' on:`({alpha})({upper})` global:true

The first transform locates all instances of uppercase letters followed by lower-case letters. Each instance is replaced by a space, followed by the found string ($1). For more information, see Text Matching.

Reduce strings by words

Remove last word:

For example, you need to remove the last word of a string and the space before it. You can use the following replace transform to do that:

D code

replace col:String with:'' on:` {alpha}+{end}`

When the above is previewed, however, you might notice that ending punctuation is not captured. For example, periods, exclamation points, and question marks at the end of your values are not captured in the
D s item
pattern
pattern
. To capture those values, the on parameter must be expanded: 
D code

replace col:String with:'' on:` {alpha}+({[?.!;\)]}|){end}`

In the second version, a capture group has been inserted in the middle of the on parameter value, as specified by the contents of the parentheses:

  • The bracket-colon notation denotes a set of possible individual characters that could appear at this point in the pattern.
    • Note the backward slash before the right parenthesis in the capture group. This value is used to escape a value, so that this parenthesis is interpreted as another character, instead of the end of the capture group.
  • The vertical pipe (|) denotes a logical OR, meaning that the specified individual characters could appear or the value after the vertical pipe. 
  • Since the value after the vertical pipe is missing, this capture group finds values with or without punctuation at the end of the line.

Reduce total number of words:

You need to cut each value in a column down to a maximum of two words. You can use the following to identify the first two words using capture groups in a 

D s item
pattern
pattern
 and then write that pattern back out, dropping the remainder of the column value:
D code

replace col:String with:'$1$2' on:`{start}({alpha}* )({alpha}*) ({any}*{end})`

For the on pattern:

  • The start pattern identifies the start of each value in the String column.
  • The two alpha capture groups identify the first two words in the string. Note that the space after the second capture group is specified outside of the capture group; if it was part of the capture group, a trailing space is written in the replacement value.
  • The final capture group identifies the remainder of the value in the cell.
    • any captures any single character.
    • The wildcard asterisk captures all values between the any character and the end of the value.