Generate a new column where the values are the output of the |
String literal example:
derive type: single value: 'passed' as:'status' |
Output: Generates a new column called status
, each row of which contains passed
for its value.
Column reference example:
derive type: single value:productName as:'orig_productName' |
Output: Generates a new column called orig_productName
, which contains all of the values in productName
, effectively serving as a backup of the source column.
Function reference example:
derive type: single value:SQRT(POW(a,2) + POW(b,2)) as:'c' |
Output: Generates a new column called c
, which is the calculation of the Pythagorean theorem for the values stored in a
and b
. For more information on this example, see POW Function.
Window function example:
You can use window functions in your derive
transforms:
derive type: multiple col: avgRolling value: ROLLINGAVERAGE(POS_Sales, 7, 0) group: saleDate order: saleDate |
Output: Calculate the value in the column of avgRolling
to be the rolling average of the POS_Sales
values for the preceding seven days, grouped and ordered by the saleDate
column. For more information, see Window Functions.
derive type: single|multiple value:(expression) [order: order_col] [group: group_col] [as:'new_column_name'] |
Token | Required? | Data Type | Description |
---|---|---|---|
derive | Y | transform | Name of the transform |
type | Y | string | Type of formula: single (single row) or multiple (multi-row) |
value | Y | string | Expression that generates the value to store in the new column |
order | N | string | Column or column names by which to sort the dataset before the value expression is applied |
group | N | string | If you are using aggregate or window functions, you can specify a group expression to identify the subset of records to apply the value expression. |
as | N | string | Name of the newly generated column |
Type of formula in the transformation:
Value | Description |
---|---|
single | Formula calculations are contained within a single row of values. |
multiple | Formula calculations involve multiple rows of inputs. |
Required? | Data Type |
---|---|
Yes | String (single or multiple ) |
Required? | Data Type |
---|---|
Yes | String (literal, column reference, function call, or combination) |
This parameter specifies the column on which to sort the dataset before applying the specified function. For combination sort keys, you can add multiple comma-separated columns.
NOTE: The order parameter must unambiguously specify an ordering for the data, or the generated results may vary between job executions. |
NOTE: If it is present, the dataset is first grouped by the |
NOTE: The |
Tip: To sort in reverse order, prepend the column name with a dash ( |
Required? | Data Type |
---|---|
No | String (column name) |
The ProdId
column contains three values: P001
, P002
, and P003
, and you add the following transformation:
derive type: single value:SUM(Sales) group:ProdId as:'SalesByProd' |
The above transform generates the SalesByProd
column, which contains the sum of the Sales
values, as grouped according to the three product identifiers.
If the value parameter contains aggregate or window functions, you can apply the group
parameter to specify subsets of records across which the value computation is applied.
You can specify one or more columns by which to group using comma-separated column references.
Required? | Data Type |
---|---|
No | String (column name) |
derive type: single value:(colX * colY) as:'areaXY' |
Output: Generates a new column containing the product of the values in columns colX
and colY
. New column is explicitly named, areaXY
.
Required? | Data Type |
---|---|
No | String (column name) |
The following dataset is used for performing some simple statistical analysis using the derive
transform.
Source:
StudentId | TestNumber | TestScore |
---|---|---|
S001 | 1 | 78 |
S001 | 2 | 85 |
S001 | 3 | 81 |
S002 | 1 | 84 |
S002 | 2 | 92 |
S002 | 3 | 77 |
S003 | 1 | 83 |
S003 | 2 | 88 |
S003 | 3 | 85 |
Transformation:
First, you can calculate the total average score across all tests:
In their unformatted form, the output values are lengthy. You can edit the above transform to nest the value
statement with proper formatting using the numformat
function:
You might also be interested to know how individual students fared and to identify which tests caused the greatest challenges for the students:
To calculate total scores for each student, add the following. Since each individual test score is a whole number, no rounding formatting is required.
Results:
StudentId | TestNumber | TestScore | avgScore | avgScorebyStudentId | ScoreByTest | totalScoreByStudentId |
---|---|---|---|---|---|---|
S001 | 1 | 78 | 83.67 | 81.33 | 81.67 | 244 |
S001 | 2 | 85 | 83.67 | 81.33 | 88.33 | 244 |
S001 | 3 | 81 | 83.67 | 81.33 | 81.00 | 244 |
S002 | 1 | 84 | 83.67 | 84.33 | 81.67 | 253 |
S002 | 2 | 92 | 83.67 | 84.33 | 88.33 | 253 |
S002 | 3 | 77 | 83.67 | 84.33 | 81.00 | 253 |
S003 | 1 | 83 | 83.67 | 85.33 | 81.67 | 256 |
S003 | 2 | 88 | 83.67 | 85.33 | 88.33 | 256 |
S003 | 3 | 85 | 83.67 | 85.33 | 81.00 | 256 |