For many recipes, the first step is to split data from a single column into multiple columns. This section describes the various methods that can be used for splitting a single column into one or more columns, based on character- or pattern-matching or position within the column's values.

Split by Delimiter

When data is initially imported into , data in each row may be split on a single delimiter. In the following example, you can see that the tab key is a single clear delimiter:

<IMSI^MSIDN^IMEI>	DATETTIME/TIMEZONE OFFSET/DURATION	MSWCNT:BASCNT^BASTRA	CALL_TYPE/CORRESP_IDN/DISCONNECT REASON
<310170097665881^13011330554^011808005351311>	2014-12-12T00:06:13/-5/1.55	MSC001:BSC002^BTS783	MOT/00000000000:11
<310170097665881^13011330554^011808005351311>	2014-12-12T02:27:26/-5/0.00	MSC001:BSC002^BTS783	SMS/00000000000:
<310-170-097665881^13011330554^011808005351311>	2014-12-12T03:24:20/-5/0	MSC001:BSC001^BTS783	SMS/00000000000:

However, when this data is imported, it may be rendered in the data grid in the following structure:

column2column3column4column5
<IMSI^MSIDN^IMEI>
DATETTIME/TIMEZONE OFFSET/DURATION
MSWCNT:BASCNT^BASTRA
CALL_TYPE/CORRESP_IDN:DISCONNECT REASON
<310170097665881^13011330554^011808005351311>
2014-12-12T00:06:13/-5/1.55
MSC001:BSC002^BTS783
MOT/00000000000:11
<310170097665881^13011330554^011808005351311>
2014-12-12T02:27:26/-5/0.00
MSC001:BSC002^BTS783
SMS/00000000000:
<310-170-097665881^13011330554^011808005351311>
2014-12-12T03:24:20/-5/0
MSC001:BSC001^BTS783
SMS/00000000000:

Notes:

Split on single delimiter

For column2, you can split the column into separate columns based on the caret delimiter:

NOTE: The Number of columns to create value reflects the total number of new columns to generate.

 

Results:

Below is how the data in column2 is transformed:

column1column6column7
<IMSI
MSIDN
IMEI>
<310170097665881
13011330554
011808005351311>
<310170097665881
13011330554
011808005351311>
<310-170-097665881
13011330554
011808005351311>

For column3, suppose that you want to keep the DATETIME and TIMEZONE OFFSET values in the same column, preserving the forward slash to demarcate these two values. The DURATION values are to be split into a separate column:

Split column by multiple delimiters

After splitting column3, the data resembled the following:

column3
DATETTIME/TIMEZONE OFFSET
2014-12-12T00:06:13/-5
2014-12-12T02:27:26/-5
2014-12-12T03:24:20/-5

Suppose you want to break down the components of this date-time data into separate columns for year, month, day, hour, minute, second, and offset. The following could be use to do so:

Split column between delimiters

Suppose that for column4, you want to split the column such that the middle part section is removed. You could use the previous transformation and then delete the middle column. You can also use the following transformation, which identifies that starting and editing delimiters that demarcate the separator between fields, effectively removing the middle column:

Split by Position

You can also perform column splits based on numerical positions in column values. These splitting options are useful for highly regular data that is of consistent length.

Suppose you have the following coordination information in three dimensions (x, y, and z). Note that the data is very regular, with leading zeroes for values that are less than 1000.

column1
POSXPOSYPOSZ
000100040001
012405210555
100220046554
202056789011
379274329832

Split column by positions

The above data could be split based on positions within a column's value:

Results:

column2column3column4
POSXPOSYPOSZ
000100040001
012405210555
100220046554
202056789011
379274329832

Split columns between positions

Suppose that you wish to split the above source data such that the middle column is removed:

Results:

column2column3
POSXPOSZ
00010001
01240555
10026554
20209011
37929832

Split column at regular interval

The above transformation could be simplified even further, since the splits happen at regular intervals:

Results:

The results would be the same as the first example.

Encoding Issues

If you are attempting to split columns based on non-ASCII characters that appear in the dataset, your transformations may fail. 

In these cases, you should change the encoding that is applied to the dataset.

Steps:

  1. In the Import Data page, select the dataset to import. 
  2. When the dataset card appears in the right column, click the Edit Settings link.
  3. From the drop-down, select a more appropriate encoding to apply to the file. 
  4. Import the data and wrangle.
  5. Try your split transformation on the dataset.

NOTE: Administrators can change the encoding that is applied by default to all file-based imported to the application. See Configure Global File Encoding Type.

Splitting Rows

When a dataset is imported, the application attempts to split the data into individual rows, based on any available end of line delimiters. This transformation is performed automatically and is not included in your initial set of steps. 

If the data is not consistently formatted, the rows may not be properly split. If so, you can disable the automatic splitting of rows. 

Steps:

  1. In the Import Data page, select the dataset to import. 
  2. When the dataset card appears in the right column, click the Edit Settings link.
  3. Deselect the Detect Structure checkbox. 
  4. Import the data and wrangle.

The steps used to detect structure are listed as the first steps of your recipe, which allows you to modify them as needed. For more information, see Initial Parsing Steps.

See Import Data Page.