This example demonstrates how the quote
parameter can be used for more sophisticated splitting of columns of data using the split
transform.
Source:
In this example, the following CSV data, which contains contact information, is imported into the application:
Code Block |
---|
LastName,FirstName,Role,Company,Address,Status
Wagner,Melody,VP of Engineering,Example.com,"123 Main Street, Oakland, CA 94601",Prospect
Gruber,Hans,"Director, IT",Example.com,"456 Broadway, Burlingame, CA, 94401",Customer
Franks,Mandy,"Sr. Manager, Analytics",Tricorp,"789 Market Street, San Francisco, CA, 94105",Customer |
Transformationn:
When this data is pulled into the application, some initial parsing is performed for you:
column2 | column3 | column4 | column5 | column6 | column7 |
---|
LastName | FirstName | Role | Company | Address | Status |
Wagner | Melody | VP of Engineering | Example.com | "123 Main Street, Oakland, CA 94601" | Prospect |
Gruber | Hans | "Director, IT" | Example.com | "456 Broadway, Burlingame, CA, 94401" | Customer |
Franks | Mandy | "Sr. Manager, Analytics" | Tricorp | "789 Market Street, San Francisco, CA, 94105" | Customer |
When you open the Recipe Panel, you should see the following transforms:
D trans |
---|
RawWrangle | true |
---|
p03Value | \" |
---|
Type | step |
---|
WrangleText | splitrows col: column1 on: '\n' quote: '\"' quoteEscapeChar: '\"' |
---|
p01Name | Column |
---|
p01Value | column1 |
---|
p02Name | Split on |
---|
p02Value | \n |
---|
p03Name | Ignore matches between |
---|
p04Value | \" |
---|
p04Name | Quote escape character |
---|
SearchTerm | Split into rows |
---|
|
D trans |
---|
RawWrangle | true |
---|
p03Value | ',' |
---|
WrangleText | split type: on col: column1 on: ',' limit: 5 quote: '\"' |
---|
p01Name | Column |
---|
p03Name | Match pattern |
---|
p04Value | 5 |
---|
SearchTerm | Split column |
---|
Type | step |
---|
p05Name | Ignore matches between |
---|
p01Value | column1 |
---|
p02Name | Option |
---|
p02Value | On pattern |
---|
p05Value | \" |
---|
p04Name | Number of Matches |
---|
|
The first transform splits the raw source data into separate rows in the carriage return character (\r
), ignoring all values between the double-quote characters. Note that this value must be escaped. The double-quote character does not require escaping. While there are no carriage returns within the actual data, the application recognizes that these double-quotes are identifying single values and adds the quote value.
The second transform splits each row of data into separate columns. Since it is comma-separated data, the application recognizes that this value is the column delimiter, so the on
value is set to the comma character (,
). In this case, the quoting is necessary, as there are commas in the values in column4
and column6
, which are easy to clean up.
To finish clean up of the dataset, you can promote the first row to be your column headers:
D trans |
---|
RawWrangle | true |
---|
p03Value | 1 |
---|
Type | step |
---|
WrangleText | rename type: header method: index sourcerownumber: 1 |
---|
p01Name | Option |
---|
p01Value | Use row(s) as column names |
---|
p02Name | Type |
---|
p02Value | Use a single row to name columns |
---|
p03Name | Row number |
---|
SearchTerm | Rename column with row(s) |
---|
|
You can remove the quotes now. Note that the following applies to two columns:
D trans |
---|
RawWrangle | true |
---|
p03Value | '' |
---|
Type | step |
---|
WrangleText | replacepatterns col: Address,Role with: '' on: '\"' global: true |
---|
p01Name | Column |
---|
p01Value | Address,Role |
---|
p02Name | Find |
---|
p02Value | '\"' |
---|
p03Name | Replace |
---|
p04Value | true |
---|
p04Name | Match all occurrences |
---|
SearchTerm | Replace text or patterns |
---|
|
Now, you can split up the Address
column. You can highlight one of the commas and the space after it in the column, but make sure that your final statement looks like the following:
D trans |
---|
RawWrangle | true |
---|
p03Value | ',' |
---|
Type | step |
---|
WrangleText | split type: on col: column1 on: ',' limit: 5 quote: '\"' |
---|
p01Name | Column |
---|
p01Value | column1 |
---|
p02Name | Option |
---|
p02Value | On pattern |
---|
p03Name | Match pattern |
---|
p04Value | 2 |
---|
p04Name | Number of Matches |
---|
SearchTerm | Split column |
---|
|
Notice that there is some dirtiness to the resulting Address3
column:
Address3 |
---|
CA 94601 |
CA, 94401 |
CA, 94105 |
Use the following to remove the comma. In this case, it's important to leave the space between the two values in the column, so the on
value should only be a comma. Below, the width
value is two single quotes:
D trans |
---|
RawWrangle | true |
---|
p03Value | '' |
---|
Type | step |
---|
WrangleText | replacepatterns col: Address3 with: '' on: ',' global: true |
---|
p01Name | Column |
---|
p01Value | Address3 |
---|
p02Name | Find |
---|
p02Value | ',' |
---|
p03Name | Replace |
---|
p04Value | true |
---|
p04Name | Match all occurrences |
---|
SearchTerm | Replace text or patterns |
---|
|
You can now split the Address3
column on the space delimiter:
D trans |
---|
RawWrangle | true |
---|
p03Value | ' ' |
---|
Type | step |
---|
WrangleText | splitpatterns col: Address3 type: on on: ' ' limit: 2 |
---|
p01Name | Column |
---|
p01Value | Address3 |
---|
p02Name | Option |
---|
p02Value | by delimiter |
---|
p03Name | Delimiter |
---|
p04Value | 2 |
---|
p04Name | Number of columns to create |
---|
SearchTerm | Split by delimiter |
---|
|
Results:
After you rename the columns, you should see the following:
LastName | FirstName | Role | Company | Address | City | State | Zipcode | Status |
---|
Wagner | Melody | VP of Engineering | Example.com | 123 Main Street | Oakland | CA | 94601 | Prospect |
Gruber | Hans | Director, IT | Example.com | 456 Broadway | Burlingame | CA | 94401 | Customer |
Franks | Mandy | Sr. Manager, Analytics | Tricorp | 789 Market Street | San Francisco | CA | 94105 | Customer |