Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r079

...

This section describes techniques to standardize text values in your datasets. In

D s product
rtrue
, you You can use the following techniques to address some common issues you might encounter in the standardization of text and other non-numeric values.

...

  • Some characters, like tab, cannot be represented in this format. 
StringDescription
My StringBase string: 'My String'
My String extraBase string + ' extra'
 My StringA space in front of base string
My String A space after base string
MyStringNo space between the two words of base string
My  StringTwo spaces between the two words of base string
My StringBase string + a tab character
My String
 
Base string + a return character
My String
 
Base string + a newline character

...

D trans
Typestep
p01NameColumns
p01Value*All
p02NameFormula
p02ValueTRIM($col)
SearchTermEdit column with formula

...

Special ValueDescription
*

For the Columns textbox under Advanced, you can use this wildcard to reference all columns in the dataset.

Tip

Tip: You can also select All from the Columns drop-down.

$colWhen multiple columns are referenced in a transform, this special value allows you to reference the source column in a replacement value.

...

  • In the above, the Find term contains a string with two spaces in it.

    Tip

    Tip: If you wish to find two or more spaces, you can use the following

    d-s-itemlang
    itempattern
    in the Find parameter:

    Code Block
    `( )+`
  • The Replace term contains no spaces.

...

Info

NOTE: Regular expressions are very powerful pattern-matching tools. If they are poorly specified in a transform, they can have unexpected results. Please use them with caution.

You can use 

d-s-itemlang
itempatterns
 to break up CamelCase entries in a column of values. The following transforms use regular expressions to identify patterns in a set of values: 

...

When the above is previewed, however, you might notice that ending punctuation is not captured. For example, periods, exclamation points, and question marks at the end of your values are not captured in the 

d-s-lang
item
pattern
pattern
. To capture those values, the Find parameter must be expanded: 

...

You need to cut each value in a column down to a maximum of two words. You can use the following to identify the first two words using capture groups in a 

d-s-lang
itempatternpattern
 and then write that pattern back out, dropping the remainder of the column value:

...

D trans
Typestep
p01NameColumns
p01Value*All
p02NameFormula
p02ValueTRIM($col)
SearchTermEdit column with formula

Notes:

  • The Instead of All above, you can use the asterisk (*) is a  wildcard, which represents all possible value. In this case, it both values for Columns matches with all column names in the dataset.
  • You may need to move columns or use range values to apply this transformation to only non-numeric column types.
  • The $col entry denotes a reference to the current column. So for any column to which this transformation is applied, the source values are pulled from the column itself and then trimmed.

...

D trans
Typestep
p01NameColumns
p01Value*All
p02NameFormula
p02ValueREMOVEWHITESPACE($col)
SearchTermEdit column with formula

...

D trans
Typestep
p01NameColumns
p01Value*All
p02NameFormat
p02ValueRemove symbols
SearchTermRemove symbols

...

The following transformation converts all accented characters (e.g."ä") to unaccented characters (e.g "a").

 

D trans
Typestep
p01NameColumns
p01Value*All
p02NameFormat
p02ValueRemove accents
SearchTermRemove accents from text

...

D trans
Typestep
p01NameColumns
p01Value*All
p02NameFormat
p02ValueTrim leading and trailing quotes
SearchTermTrim quotes

...