This example shows how to use parsing functions for evaluating input values against the function-specific data type. |
Functions:
Source:
The following table contains data on a series of races.
raceId | disqualified | date | racerId | time_sc |
---|---|---|---|---|
1 | FALSE | 2/1/20 | 1 | 24.22 |
2 | f | 2/8/20 | 1 | 25 |
3 | no | 2/8/20 | 1 | 24.11 |
4 | n | 1-Feb-20 | 2 | 26.1 |
5 | TRUE | 8-Feb-20 | 2.2 | -25.22 |
6 | t | 2/8/2020 10:16:00 AM | 2 | 25.44 |
7 | yes | 2/1/20 | 3 | 24 |
8 | y | 2/8/20 | 33 | 29.22 |
9 | 0 | 2/8/20 | 3 | 24.78 |
10 | 1 | 1-Feb-20 | 4 | 26.2.1 |
11 | FALSE | 8-Feb-20 | 28.22 sec | |
12 | FALSE | 2/8/2020 10:16:00 AM | 4 | 27.11 |
As you can see, this dataset has variation in values (FALSE, f, no, n) and problems with the data.
Transformation:
When the data is first imported, it may be properly typed for each column. To use the parsing functions, these columns should be converted to String data type:
Now, you can parse individual columns.
disqualified column:
racerId column:
time_sc column:
date column:
For the date column, the PARSEDATE function supports a default set of Datetime formats. Since some of the listed formats are different from these defaults, you must specify all of the formats. These formats are specified as an array of string values as the second argument of the function:
Tip: For the PARSEDATE function, it's useful to use the Preview to verify that all of the dates in the column are represented in the array of output formats. You can see the available output formats through the data type menu at the top of a column in the Transformer Page. |
After all of the date values have been standardized to the output format of the PARSEDATE function, you may choose to remove the time element of the values:
Results:
After executing the above steps, the data appears as follows. Notes on each column's output are below the table.
raceId | disqualified | date | racerId | time_sc |
---|---|---|---|---|
1 | false | 2020-02-01 | 1 | 24.22 |
2 | false | 2020-02-08 | 1 | 25 |
3 | false | 2020-02-08 | 1 | 24.11 |
4 | false | 2020-02-01 | 2 | 26.1 |
5 | true | 2020-02-08 | null | -25.22 |
6 | true | 2020-02-08 | 2 | 25.44 |
7 | true | 2020-02-01 | 3 | 24 |
8 | true | 2020-02-08 | 33 | 29.22 |
9 | false | 2020-02-08 | 3 | 24.78 |
10 | true | 2020-02-01 | 4 | null |
11 | false | 2020-02-08 | null | null |
12 | false | 2020-02-08 | 4 | 27.11 |
disqualified column:
false
or true
.racerId column:
33
remains, since it is a valid Integer. This value should be fixed manually.time_sc:
25.00
as 25
in output.-25.22
remains. However, since this is time-based data, it needs to be fixed.date column:
yyyy-MM-dd HH:mm:ss
. Time data has been stripped.