Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

Source:

The following dataset contains counts of support emails processed by each member of the support team for individual customers over a six-month period. In this case, you are interested in the total number of emails processed for each customer.

Unfortunately, the data is ragged, as there are no entries for a support team member if he or she has not answered an email for a customer. 

custIdstartDateendDatesupportEmailCount
C0017/15/201512/31/2015["Max":"2","Ted":"0","Sally":"12","Jack":"6","Sue":"4"]
C0027/15/201512/31/2015["Sally":"4","Sue":"3"]
C0037/15/201512/31/2015["Ted":"12","Sally":"2"]
C0047/15/201512/31/2015["Jack":"7","Sue":"4","Ted":"5"]

If the data is imported from a CSV file, you might need to make some simple Replace Text or Pattern transformations to clean up the data to look like the above example.

Transformation:

Use the following transformation to extract just the numeric values from the supportEmailCount array:

D trans
RawWrangletrue
Typestep
WrangleTextextractlist col:supportEmailCount on:`{digit}+`
p01NameColumn
p01ValuesupportEmailCount
p02NamePattern matching elements in list
p02Value`{digit}+`
SearchTermExtract matches into Array

You should now have a column extractlist_supportEmailCount containing a ragged array. You can use the following transformations to convert this data to a comma-separated list of values:

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplace col: extractlist_supportEmailCount on: `[` with: '' global: true
p01NameColumn
p01Valueextractlist_supportEmailCount
p02NameFind
p02Value`[`
p03NameReplace with
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or pattern

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplace col: extractlist_supportEmailCount on: `]` with: '' global: true
p01NameColumn
p01Valueextractlist_supportEmailCount
p02NameFind
p02Value`]`
p03NameReplace with
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or pattern

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplace col: extractlist_supportEmailCount on: `"` with: '' global: true
p01NameColumn
p01Valueextractlist_supportEmailCount
p02NameFind
p02Value`"`
p03NameReplace with
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or pattern

Convert the column to String data type.

You can now split out the column into separate columns containing individual values in the modified source. The limit parameter specifies the number of splits to create, resulting in 5 new columns, which is the maximum number of entries in the source arrays.

D trans
RawWrangletrue
p03Value','
Typestep
WrangleTextsplit col:extractlist_supportEmailCount on:',' limit:4
p01NameColumn
p01Valueextractlist_supportEmailCount
p02NameOption
p02ValueOn pattern
p03NameMatch pattern
p04Value4
p04NameNumber of columns to create
SearchTermSplit by delimiter

You might have to set the type for each generated column to Integer. If you try to use a New Formula transformation to calculate the sum of all of the generated columns, it only returns values for the first row because the missing rows are null values.

In the columns containing null values, select the missing value bar in the data histogram. Select the Replace suggestion card, and modify the transformation to write a 0 in place of the null value, as follows:

D trans
RawWrangletrue
p03Valueismissing([extractlist_supportEmailCount3])
Typestep
WrangleTextset col: extractlist_supportEmailCount3 value: '0' row: ismissing([extractlist_supportEmailCount3])
p01NameColumns
p01Valueextractlist_supportEmailCount3
p02NameFormula
p02Value'0'
p03NameGroup rows by
SearchTermEdit column with formula

Repeat this step for any other column containing null values.

You can now use the following to sum the values in the generated columns:

D trans
RawWrangletrue
Typestep
WrangleTextderive type:single value:(extractlist_supportEmailCount1 + extractlist_supportEmailCount2 + extractlist_supportEmailCount3 + extractlist_supportEmailCount4 + extractlist_supportEmailCount5)
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value(extractlist_supportEmailCount1 + extractlist_supportEmailCount2 + extractlist_supportEmailCount3 + extractlist_supportEmailCount4 + extractlist_supportEmailCount5)
SearchTermNew formula

Results:

After renaming the generated column to totalSupportEmails and dropping the columns used to create it, your dataset should look like the following:

custIdstartDateendDatesupportEmailCounttotalSupportEmails
C0017/15/201512/31/2015["Max":"2","Ted":"0","Sally":"12","Jack":"6","Sue":"4"]24
C0027/15/201512/31/2015["Sally":"4","Sue":"3"]7
C0037/15/201512/31/2015["Ted":"12","Sally":"2"]14
C0047/15/201512/31/2015["Jack":"7","Sue":"4","Ted":"5"]16