Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0810

This example demonstrates how the quote parameter can be used for more sophisticated splitting of columns of data using the split transform.

Source:

In this example, the following CSV data, which contains contact information, is imported into the application:

Code Block
LastName,FirstName,Role,Company,Address,Status
Wagner,Melody,VP of Engineering,Example.com,"123 Main Street, Oakland, CA 94601",Prospect
Gruber,Hans,"Director, IT",Example.com,"456 Broadway, Burlingame, CA, 94401",Customer
Franks,Mandy,"Sr. Manager, Analytics",Tricorp,"789 Market Street, San Francisco, CA, 94105",Customer

Transformationn:

When this data is pulled into the application, some initial parsing is performed for you:

column2column3column4column5column6column7
LastNameFirstNameRoleCompanyAddressStatus
WagnerMelodyVP of EngineeringExample.com"123 Main Street, Oakland, CA 94601"Prospect
GruberHans"Director, IT"Example.com"456 Broadway, Burlingame, CA, 94401"Customer
FranksMandy"Sr. Manager, Analytics"Tricorp"789 Market Street, San Francisco, CA, 94105"Customer

When you open the Recipe Panel, you should see the following transforms:

D trans
RawWrangletrue
p03Value\"
Typestep
WrangleTextsplitrows col: column1 on: '\n' quote: '\"' quoteEscapeChar: '\"'
p01NameColumn
p01Valuecolumn1
p02NameSplit on
p02Value\n
p03NameIgnore matches between
p04Value\"
p04NameQuote escape character
SearchTermSplit into rows

D trans
RawWrangletrue
p03Value','
WrangleTextsplit type: on col: column1 on: ',' limit: 5 quote: '\"'
p01NameColumn
p03NameMatch pattern
p04Value5
SearchTermSplit column
Typestep
p05NameIgnore matches between
p01Valuecolumn1
p02NameOption
p02ValueOn pattern
p05Value\"
p04NameNumber of Matches

The first transform splits the raw source data into separate rows in the carriage return character (\r), ignoring all values between the double-quote characters. Note that this value must be escaped. The double-quote character does not require escaping. While there are no carriage returns within the actual data, the application recognizes that these double-quotes are identifying single values and adds the quote value.

The second transform splits each row of data into separate columns. Since it is comma-separated data, the application recognizes that this value is the column delimiter, so the on value is set to the comma character (,). In this case, the quoting is necessary, as there are commas in the values in column4 and column6, which are easy to clean up.

To finish clean up of the dataset, you can promote the first row to be your column headers:

D trans
RawWrangletrue
p03Value1
Typestep
WrangleTextrename type: header method: index sourcerownumber: 1
p01NameOption
p01ValueUse row(s) as column names
p02NameType
p02ValueUse a single row to name columns
p03NameRow number
SearchTermRename column with row(s)

You can remove the quotes now. Note that the following applies to two columns:

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplacepatterns col: Address,Role with: '' on: '\"' global: true
p01NameColumn
p01ValueAddress,Role
p02NameFind
p02Value'\"'
p03NameReplace
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or patterns

Now, you can split up the Address column. You can highlight one of the commas and the space after it in the column, but make sure that your final statement looks like the following:

D trans
RawWrangletrue
p03Value','
Typestep
WrangleTextsplit type: on col: column1 on: ',' limit: 5 quote: '\"'
p01NameColumn
p01Valuecolumn1
p02NameOption
p02ValueOn pattern
p03NameMatch pattern
p04Value2
p04NameNumber of Matches
SearchTermSplit column

Notice that there is some dirtiness to the resulting Address3 column:

Address3
CA 94601
CA, 94401
CA, 94105

Use the following to remove the comma. In this case, it's important to leave the space between the two values in the column, so the on value should only be a comma. Below, the width value is two single quotes:

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplacepatterns col: Address3 with: '' on: ',' global: true
p01NameColumn
p01ValueAddress3
p02NameFind
p02Value','
p03NameReplace
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or patterns

You can now split the Address3 column on the space delimiter:

D trans
RawWrangletrue
p03Value' '
Typestep
WrangleTextsplitpatterns col: Address3 type: on on: ' ' limit: 2
p01NameColumn
p01ValueAddress3
p02NameOption
p02Valueby delimiter
p03NameDelimiter
p04Value2
p04NameNumber of columns to create
SearchTermSplit by delimiter

Results:

After you rename the columns, you should see the following:

LastNameFirstNameRoleCompanyAddressCityStateZipcodeStatus
WagnerMelodyVP of EngineeringExample.com123 Main StreetOaklandCA94601Prospect
GruberHansDirector, ITExample.com456 BroadwayBurlingameCA94401Customer
FranksMandySr. Manager, AnalyticsTricorp789 Market StreetSan FranciscoCA94105Customer