Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0821

This example illustrates the different uses of the following transformations to replace or extract cell data:

  • set - defines the values to use in a predefined column. See Set Transform.

    Tip

    Tip: Use the derive transform to generate a new column containing a defined set of values. See Derive Transform.

  • replace - replaces a string literal or pattern appearing in the values of a column with a specific string. See Replace Transform.
  • extract - extracts a pattern-based value from a column and stores it in a new column. See Extract Transform.

Source:

The following dataset contains contact information that has been gathered by your marketing platform from actions taken by visitors on your website. You must clean up this data and prepare it for use in an analytics platform.

LeadIdLastNameFirstNameTitlePhoneRequest
LE160301001JonesCharlesChief Technical Officer415-555-1212reg
LE160301002LyonsEdward 415-012-3456download whitepaper
LE160301003MartinMaryCEO510-555-5555delete account
LE160301004SmithTaliaEngineer510-123-4567free trial

 

Transformation:

Title column: For example, you first notice that some data is missing. Your analytics platform recognizes the string value, "#MISSING#" as an indicator of a missing value. So, you click the missing values bar in the Title column. Then, you select the Replace suggestion card. Note that the default replacement is a null value, so you click Edit and update it:

D trans
RawWrangletrue
Typestep
WrangleTextset col: Title value: if(ismissing([Title]),'#MISSING#',Title)
p01NameColumns
p01ValueTitle
p02NameFormula
p02Valueif(ismissing([Title]),'#MISSING#',Title)
SearchTermEdit column with formula

Request column: In the Request column, you notice that the reg entry should be cleaned up. Add the following transformation, which replaces that value:

D trans
RawWrangletrue
p03ValueRegistration
Typestep
WrangleTextreplace col:Request with:'Registration' on:`{start}reg{end}`
p01NameColumn
p01ValueRequest
p02NameFind
p02Value`{start}reg{end}`
p03NameReplace with
SearchTermReplace text or pattern

The above transformation uses a

D s lang
itempattern
rtrue
 as the expression of the on: parameter. This expression indicates to match from the start of the cell value, the string literal reg, and then the end of the cell value, which matches on complete cell values of reg only.

This transformation works great on the sample, but what happens if the value is Reg with a capital R? That value might not be replaced. To improve the transformation, you can modify the transformation with the following

D s lang
itempattern
 in the on parameter, which captures differences in capitalization:

D trans
RawWrangletrue
p03Value'Registration'
Typestep
WrangleTextreplace col:Request with:'Registration' on:`{start}{[R|r]}eg{end}`
p01NameColumn
p01ValueRequest
p02NameFind
p02Value`{start}{[R|r]}eg{end}`
p03NameReplace with
SearchTermReplace text or pattern

Add the above transformation to your recipe. Then, it occurs to you that all of the values in the Request column should be capitalized in title or proper case:

D trans
RawWrangletrue
Typestep
WrangleTextset col:Request value:proper(Request)
p01NameColumns
p01ValueRequest
p02NameFormula
p02Valueproper(Request)
SearchTermEdit column with formula

Now, all values are capitalized as titles.

Phone column: You might have noticed some issues with the values in the Phone column. In the United States, the prefix 555 is only used for gathering information; these are invalid phone numbers.

In the data grid, you select the first instance of 555 in the column. However, it selects all instances of that pattern, including ones that you don't want to modify. In this case, continue your selection by selecting the similar instance of 555 in the other row. In the suggestion cards, you click the Replace Text or Pattern transformation.

Notice, however, that the default Replace Text or Pattern transformation has also highlighted the second 555 pattern in one instance, which could be a problem in other phone numbers not displayed in the sample. You must modify the selection pattern for this transformation. In the on: parameter below, the

D s lang
itempattern
 has been modified to match only the instances of 555 that appear in the second segment in the phone number format:

D trans
RawWrangletrue
p03Value'#INVALID#'
Typestep
WrangleTextreplace col: Phone on: `{start}%{3}-555-%*{end}` with: '#INVALID#' global: true
p01NameColumn
p01ValuePhone
p02NameFind
p02Value`{start}%{3}-555-%*{end}`
p03NameReplace with
p04Valuetrue
p04NameMatch all occurrences
SearchTermReplace text or pattern

Note the wildcard construct has been added (%*). While it might be possible to add a pattern that matches on the last four characters exactly (%{4}), that matching pattern would not capture the possibility of a phone number having an extension at the end of it. The above expression does.

Info

NOTE: The above transformation creates values that are mismatched with the Phone Number data type. In this example, however, these mismatches are understood to be for the benefit of the system consuming your

D s item
output
output
.

LeadId column: You might have noticed that the lead identifier column (LeadId) contains some embedded information: a date value and an identifier for the instance within the day. The following steps can be used to break out this information. The first one creates a separate working column with this information, which allows us to preserve the original, unmodified column:

D trans
RawWrangletrue
p03Value'LeadIdworking'
Typestep
WrangleTextderive type:single value:LeadId as:'LeadIdworking'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueLeadId
p03NameNew column name
SearchTermNew formula

You can now work off of this column to create your new ones. First, you can use the following replace transformation to remove the leading two characters, which are not required for the new columns:

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplace col:LeadIdworking with:'' on:'LE'
p01NameColumn
p01ValueLeadIdworking
p02NameFind
p02Value'LE'
p03NameReplace with
SearchTermReplace text or pattern

Notice that the date information is now neatly contained in the first characters of the working column. Use the following to extract these values to a new column:

D trans
RawWrangletrue
p03Value`{start}%{6}`
Typestep
WrangleTextextract col: LeadIdworking on: `{start}%{6}`
p01NameColumn to extract from
p01ValueLeadIdworking
p02NameOption
p02ValueCustom text or pattern
p03NameText to extract
SearchTermExtract text or pattern

The new LeadIdworking2 column now contains only the date information. Cleaning up this column requires reformatting the data, retyping it as a Datetime type, and then applying the dateformat function to format it to your satisfaction. These steps are left as a separate exercise.

For now, let's just rename the column:

D trans
RawWrangletrue
p03Value'LeadIdDate'
Typestep
WrangleTextrename col:LeadIdworking1 to:'LeadIdDate'
p01NameOption
p01ValueManual rename
p02NameColumn
p02ValueLeadIdworking1
p03NameNew column name
SearchTermRename columns

In the first working column, you can now remove the date information using the following:

D trans
RawWrangletrue
p03Value''
Typestep
WrangleTextreplace col: LeadIdworking on: `{start}%{6}` with: ''
p01NameColumn
p01ValueLeadIdworking
p02NameFind
p02Value`{start}%{6}`
p03NameReplace with
SearchTermReplace text or pattern

You can rename this column to indicate it is a daily identifier:

D trans
RawWrangletrue
p03Value'LeadIdDaily'
Typestep
WrangleTextrename col:LeadIdworking to:'LeadIdDaily'
p01NameOption
p01ValueManual rename
p02NameColumn
p02ValueLeadIdworking
p03NameNew column name
SearchTermRename columns

Results:

LeadIdLeadIdDailyLeadIdDateLastNameFirstNameTitlePhoneRequest
LE160301001001160301JonesCharlesChief Technical Officer#INVALID#Registration
LE160301002002160301LyonsEdward#MISSING#415-012-3456Download Whitepaper
LE160301003003160301MartinMaryCEO#INVALID#Delete Account
LE160301004004160301SmithTaliaEngineer510-123-4567Free Trial