Derive Transform

Note

Transforms are a part of the underlying language, which is not directly accessible to users. This content is maintained for reference purposes only. For more information on the user-accessible equivalent to transforms, see Transformation Reference.

Generate a new column where the values are the output of the value expression. Expression can be calculated based on values specified in the group parameter. Output column can be named as needed.

Basic Usage

String literal example:

derive type: single value: 'passed' as:'status'

Output: Generates a new column called status, each row of which contains passed for its value.

Column reference example:

derive type: single value:productName as:'orig_productName'

Output: Generates a new column called orig_productName, which contains all of the values in productName, effectively serving as a backup of the source column.

Function reference example:

derive type: single value:SQRT(POW(a,2) + POW(b,2)) as:'c'

Output: Generates a new column called c, which is the calculation of the Pythagorean theorem for the values stored in a and b. For more information on this example, see POW Function.

Window function example:

You can use window functions in your derive transforms:

derive type: multiple col: avgRolling value: ROLLINGAVERAGE(POS_Sales, 7, 0) group: saleDate order: saleDate

Output: Calculate the value in the column of avgRolling to be the rolling average of the POS_Sales values for the preceding seven days, grouped and ordered by the saleDate column. For more information, see Window Functions.

Syntax and Parameters

derive type: single|multiple value:(expression) [order: order_col] [group: group_col] [as:'new_column_name']

Token	Required?	Data Type	Description
derive	Y	transform	Name of the transform
type	Y	string	Type of formula: `single` (single row) or `multiple` (multi-row)
value	Y	string	Expression that generates the value to store in the new column
order	N	string	Column or column names by which to sort the dataset before the `value` expression is applied
group	N	string	If you are using aggregate or window functions, you can specify a `group` expression to identify the subset of records to apply the `value` expression.
as	N	string	Name of the newly generated column

For more information on syntax standards, see Language Documentation Syntax Notes.

type

Type of formula in the transformation:

Value	Description
single	Formula calculations are contained within a single row of values.
multiple	Formula calculations involve multiple rows of inputs.

Usage Notes:

Required?	Data Type
Yes	String (`single` or `multiple`)

value

Identifies the expression that is applied by the transform. The value parameter can be one of the following types:

test predicates that evaluate to Boolean values (value: myAge == '30' yields a true or false value), or
computational expressions ( value: abs(pow(myCol,3)) ).

The expected type of value expression is determined by the transform type. Each type of expression can contain combinations of the following:

literal values: value: 'Hello, world'
column references: value: amountOwed * 10
functions: value: left(myString, 4)
combinations: value: abs(pow(myCol,3))

The types of any generated values are re-inferred by the platform.

Usage Notes:

Required?	Data Type
Yes	String (literal, column reference, function call, or combination)

order

This parameter specifies the column on which to sort the dataset before applying the specified function. For combination sort keys, you can add multiple comma-separated columns.

Note

The order parameter must unambiguously specify an ordering for the data, or the generated results may vary between job executions.

Note

If it is present, the dataset is first grouped by the group value before it is ordered by the values in the order column.

Note

The order column does not need to be sorted before the transform is executed on it.

Tip

To sort in reverse order, prepend the column name with a dash (-MyDate).

Usage Notes:

Required?	Data Type
No	String (column name)

group

Identifies the column by which the dataset is grouped for purposes of applying the transform.

Note

Transforms that use the group parameter can result in non-deterministic re-ordering in the data grid. However, you should apply the group parameter, particularly on larger datasets, or your job may run out of memory and fail. To enforce row ordering, you can use the sort transform. For more information, see Sort Transform.

The ProdId column contains three values: P001, P002, and P003, and you add the following transformation:

derive type: single value:SUM(Sales) group:ProdId as:'SalesByProd'

The above transform generates the SalesByProd column, which contains the sum of the Sales values, as grouped according to the three product identifiers.

If the value parameter contains aggregate or window functions, you can apply the group parameter to specify subsets of records across which the value computation is applied.

Note

You can specify one or more columns by which to group using comma-separated column references.

Usage Notes:

Required?	Data Type
No	String (column name)

Name of the new column that is being generated. If the as parameter is not specified, a default name is used.

derive type: single value:(colX * colY) as:'areaXY'

Output: Generates a new column containing the product of the values in columns colX and colY. New column is explicitly named, areaXY.

Usage Notes:

Required?	Data Type
No	String (column name)

Examples

Tip

For additional examples, see Common Tasks.

Example - Basic Derive Examples

The following dataset is used for performing some simple statistical analysis using the derive transform.

Source:

StudentId	TestNumber	TestScore
S001	1	78
S001	2	85
S001	3	81
S002	1	84
S002	2	92
S002	3	77
S003	1	83
S003	2	88
S003	3	85

Transformation:

First, you can calculate the total average score across all tests:

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	average(Score)
Parameter: New column name	'avgScore'

In their unformatted form, the output values are lengthy. You can edit the above transform to nest the value statement with proper formatting using the numformat function:

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	numformat(average(Score),'##.00')
Parameter: New column name	'avgScore'

You might also be interested to know how individual students fared and to identify which tests caused the greatest challenges for the students:

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	numformat(average(Score),'##.00')
Parameter: Group rows by	StudentId
Parameter: New column name	'avgScorebyStudentId'

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	numformat(average(Score),'##.00')
Parameter: Group rows by	TestNumber
Parameter: New column name	'avgScoreByTest'

To calculate total scores for each student, add the following. Since each individual test score is a whole number, no rounding formatting is required.

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	sum(Score)
Parameter: Group rows by	StudentId
Parameter: New column name	'totalScoreByStudentId'

Results:

StudentId	TestNumber	TestScore	avgScore	avgScorebyStudentId	ScoreByTest	totalScoreByStudentId
S001	1	78	83.67	81.33	81.67	244
S001	2	85	83.67	81.33	88.33	244
S001	3	81	83.67	81.33	81.00	244
S002	1	84	83.67	84.33	81.67	253
S002	2	92	83.67	84.33	88.33	253
S002	3	77	83.67	84.33	81.00	253
S003	1	83	83.67	85.33	81.67	256
S003	2	88	83.67	85.33	88.33	256
S003	3	85	83.67	85.33	81.00	256

Example - Rounding Functions

This example demonstrates how the rounding functions work together.

Functions:

Item	Description
FLOOR Function	Computes the largest integer that is not more than the input value. Input can be an Integer, a Decimal, a column reference, or an expression.
CEILING Function	Computes the ceiling of a value, which is the smallest integer that is greater than the input value. Input can be an Integer, a Decimal, a column reference, or an expression.
ROUND Function	Rounds input value to the nearest integer. Input can be an Integer, a Decimal, a column reference, or an expression. Optional second argument can be used to specify the number of digits to which to round.
MOD Function	Returns the modulo value, which is the remainder of dividing the first argument by the second argument. Equivalent to the `%` operator.

Source:

rowNum	X
1	-2.5
2	-1.2
3	0
4	1
5	1.5
6	2.5
7	3.9
8	4
9	4.1
10	11

Transformation:

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	FLOOR(X)
Parameter: New column name	'floorX'

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	CEILING(X)
Parameter: New column name	'ceilingX'

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	ROUND (X)
Parameter: New column name	'roundX'

Transformation Name	`New formula`
Parameter: Formula type	Single row formula
Parameter: Formula	(X % 2)
Parameter: New column name	'modX'

Results:

rowNum	X	modX	roundX	ceilingX	floorX
1	-2.5		-2	-2	-3
2	-1.2		-1	-1	-2
3	0	0	0	0	0
4	1	1	1	1	1
5	1.5		2	2	1
6	2.5		3	3	2
7	3.9		4	4	3
8	4	0	4	4	4
9	4.1		4	5	4
10	11	1	11	11	11

In this section:

Derive Transform

Basic Usage

Syntax and Parameters

Examples

Search results