...
This section describes techniques to standardize text values in your datasets. In
, you You can use the following techniques to address some common issues you might encounter in the standardization of text and other non-numeric values.
...
- Some characters, like tab, cannot be represented in this format.
String | Description |
---|
My String | Base string: 'My String' |
My String extra | Base string + ' extra' |
My String | A space in front of base string |
My String | A space after base string |
MyString | No space between the two words of base string |
My String | Two spaces between the two words of base string |
My String | Base string + a tab character |
My String | Base string + a return character |
My String | Base string + a newline character |
...
D trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | *All |
---|
p02Name | Formula |
---|
p02Value | TRIM($col) |
---|
SearchTerm | Edit column with formula |
---|
|
...
Special Value | Description |
---|
* | For the Columns textbox under Advanced, you can use this wildcard to reference all columns in the dataset. Tip |
---|
Tip: You can also select All from the Columns drop-down. |
|
$col | When multiple columns are referenced in a transform, this special value allows you to reference the source column in a replacement value. |
...
...
Info |
---|
NOTE: Regular expressions are very powerful pattern-matching tools. If they are poorly specified in a transform, they can have unexpected results. Please use them with caution. |
You can use
to break up CamelCase entries in a column of values. The following transforms use regular expressions to identify patterns in a set of values:
...
When the above is previewed, however, you might notice that ending punctuation is not captured. For example, periods, exclamation points, and question marks at the end of your values are not captured in the
. To capture those values, the Find parameter must be expanded:
...
You need to cut each value in a column down to a maximum of two words. You can use the following to identify the first two words using capture groups in a
and then write that pattern back out, dropping the remainder of the column value:
...
D trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | *All |
---|
p02Name | Formula |
---|
p02Value | TRIM($col) |
---|
SearchTerm | Edit column with formula |
---|
|
Notes:
- The Instead of
All
above, you can use the asterisk (*
) is a wildcard, which represents all possible value. In this case, it both values for Columns matches with all column names in the dataset. - You may need to move columns or use range values to apply this transformation to only non-numeric column types.
- The
$col
entry denotes a reference to the current column. So for any column to which this transformation is applied, the source values are pulled from the column itself and then trimmed.
...
D trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | *All |
---|
p02Name | Formula |
---|
p02Value | REMOVEWHITESPACE($col) |
---|
SearchTerm | Edit column with formula |
---|
|
...
D trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | *All |
---|
p02Name | Format |
---|
p02Value | Remove symbols |
---|
SearchTerm | Remove symbols |
---|
|
...
The following transformation converts all accented characters (e.g."ä") to unaccented characters (e.g "a").
D trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | *All |
---|
p02Name | Format |
---|
p02Value | Remove accents |
---|
SearchTerm | Remove accents from text |
---|
|
...
D trans |
---|
Type | step |
---|
p01Name | Columns |
---|
p01Value | *All |
---|
p02Name | Format |
---|
p02Value | Trim leading and trailing quotes |
---|
SearchTerm | Trim quotes |
---|
|
...