Page tree

Trifacta SaaS



Contents:

   

Contents:


A window transformation performs calculations on a row based on row values that are related to it. Windowing functions can perform calculations based on time, relative row positions, and rolling windows.

For example, you might wish to calculate the average percentage of CPU usage over 24-hour intervals based on log entries. From the rows of data, you can create a window function that calculates the average value in the CPU usage column over the 24-hour period, as defined based on date values for each log entry.

Key distinction:

  • In a window function, the output of each row's calculation is specific to the row. 
  • In an aggregate function, the output for a row is the same value for all rows that are used in the calculation. 
  • For more information on aggregation, see Create Aggregations.

Basic Structure

You can use windowing functions with the following transformation types:

  • window - creates a new column called window
  • New formula - creates a new column that you name
  • Edit with formula - modifies the values in a column based on a formula that you specify.

Group by parameter

You can use the Group by parameter to define the column of values by which rows of data are grouped for calculation purposes. For example, if your Group by column contains months, your calculations are computed for each month represented in the column values.

NOTE: Transforms that use the group parameter can result in non-deterministic re-ordering in the data grid. However, you should apply the group parameter, particularly on larger datasets, or your job may run out of memory and fail.  To enforce row ordering, you can use the sort transform. For more information, see Sort Transform.

Order by parameter

When using window functions, you can use the Order by parameter to specify the column or columns by which to sort the output. 

Source:

The following table contains the sales data of a company for all the four regions in the last three months.

Month

Sales

Region

2021-01-01

800

East

2021-01-01

1500

West

2021-01-01

1000

North

2021-01-01

2000

South

2021-02-01

1250

East

2021-02-01

800

West

2021-02-01

1100

North

2021-02-01

700

South

2021-03-01

900

East

2021-03-01

1000

West

2021-03-01

1400

North

2021-03-01

800

South

Transformation:

In the following transformation, you can calculate the rolling average of sales You apply the  ROLLINGAVERAGE   and specify that the results are to be ordered by the Sales column.

Transformation Name Window
Parameter: Formulas ROLLINGAVERAGE (Sales, 0,1)
Parameter: Order by Sales

Results:

The following dataset shows the ROLLINGAVERAGE ordered by Sales column.

Month

Sales

Region

RollingAverage

2021-02-01

700

South

750

2021-01-01

800

East

800

2021-02-01

800

West

800

2021-03-01

800

South

850

2021-03-01

900

East

950

2021-01-01

1000

North

1000

2021-03-01

1000

West

1050

2021-02-01

1100

North

1175

2021-02-01

1250

East

1325

2021-03-01

1400

North

1450

2021-01-01

1500

West

1750

2021-01-01

2000

South

2000

Compute over Time Windows

You may need to create windows of time within your data that are not cleanly segmented by basic units of time measurement. For example, you may need to create a custom time period, called a session, based on timestamps recorded in event-based data. 

A session is usually defined as  a group of events that occur within a given time frame. For example, you may need to perform calculations based on five-minute intervals within your logging data. If a user opens your shopping website, logs in, searches items, and then logs out within a five-minute interval, that can be grouped under a single session. However, if the user's interaction lasted six minutes, the logged events may span multiple windowed sessions in the data.

You can use the SESSION function to create time boxes based on a time period that you specify. When the function is applied to your column of timestamp values , the application assigns an ID to events that belong to the same session. 

From the following example, you can create a Session ID. After you create the session ID, you can find the volume of data consumed by the individual user.

Source:

User NameTimeStampActivityVolume (in Kb)
Bob 02/11/21 08:01:13Read1024
William02/11/21 08:01:00Read1024
John02/11/21 08:01:17Read1024
Christy02/11/21 08:01:17Read1024
William02/11/21 08:03:33Read520
Christy02/11/21 08:02:01Password change1024
Bob02/11/21 08:07:23Adding items to cart2048
William02/11/21 08:05:45Read520
William02/11/21 08:11:56 Account settings2048
John02/11/21 08:15:11Password change2048
Bob02/11/21 08:34:00Proceeding to payment2048
Bob02/11/21 08:43:03 logout2048
Christy02/11/21 09:03:43Read1024
Christy02/11/21 09:10:00logout1024

Transformation:

Transformation Name Window
Parameter: Formulas SESSION (TimeStamp, 5, minute)
Parameter: Group by User Name
Parameter: Order by TimeStamp

Since the new column is named window, you should rename it:

Transformation Name Rename columns
Parameter: Option Manual rename
Parameter: Column window
Parameter: New column name SESSIONID

With this session ID, you can calculate the maximum volume of data consumed by each session ID and by each user.

Transformation Name New formula
Parameter: Formula type Multiple row formula
Parameter: Formula MAX(Volume (in Kb))
Parameter: Sort rows by SessionID
Parameter: Group rows by User Name, SessionID
Parameter: New column name Volume_Consumed (in Kb)

Results:

User Name

TimeStamp

Activity

Volume (in Kb)

SessionID

max_Volume (in Kb)

William

02/11/21 08:01:00

Read

1024

1

1024

William

02/11/21 08:03:33

Read

520

1

1024

William

02/11/21 08:05:45

Read

520

1

1024

William

02/11/21 08:11:56

Account settings

2048

2

2048

Bob

02/11/21 08:01:13

Read

1024

1

1024

Bob

02/11/21 08:07:23

Adding items to cart

2048

2

2048

Bob

02/11/21 08:34:00

Proceeding to payment

2048

3

2048

Bob

02/11/21 08:43:03

logout

2048

4

2048

Christy

02/11/21 08:01:17

Read

1024

1

1024

Christy

02/11/21 08:02:01

Password change

1024

1

1024

Christy

02/11/21 09:03:43

Read

1024

2

1024

Christy

02/11/21 09:10:00

logout

1024

3

1024

John

02/11/21 08:01:17

Read

1024

1

1024

John

02/11/21 08:15:11

Password change

2048

2

2048

Calculate over preceding and following rows

The PREV and NEXT functions enable you to fetch data from a previous row or a subsequent row, which is helpful for identifying relative changes or trends in your data.  

Source:

The following dataset contains orders for different product types over a given time period. You can apply the PREV and NEXT functions to calculate the previous orders and the next orders to analyze the trend of orders and derive the average of orders for a product group.

Product_Type

Order_date

Order

Laptop

2021-01-05

300

Laptop

2021-01-26

1780

Laptop

2021-01-09

500

Laptop

2021-01-31

1200

SmartPhone

2021-01-24

1400

SmartPhone

2021-01-26

2200

SmartPhone

2021-01-07

700

Tablet

2021-01-21

600

Tablet

2021-01-23

900

Transformation:

You can also calculate the percentage of change in orders over time. The following transformation calculates the change between the current order and the previous one and then divides that value over the previous value to calculate the percent change between the rows:

Transformation Name Window
Parameter: Formulas (Order - PREV(Order, 1)) / PREV(Order, 1) * 100
Parameter: Group by Product_Type
Parameter: Order by Order

After you rename the column to ChangeinOrder, you can apply the  NUMFORMAT  function to clean up and format the ChangeinOrder values. The following transformation reformats the ChangeinOrder column to display two decimal places.

Transformation Name Edit with formula
Parameter: Column ChangeinOrder
Parameter: Formula NUMFORMAT(ChangeinOrder, '##.##')

Similarly, you can apply the NEXT function and calculate the Change in orders for upcoming months.

Results :

Product_Type

Order_date

Order

NEXTOrder

ChangeinOrder

Laptop

2021-01-05

300

500


Laptop

2021-01-09

500

1200

66.67

Laptop

2021-01-31

1200

1780

140

Laptop

2021-01-26

1780


48.33

SmartPhone

2021-01-07

700

400


SmartPhone

2021-01-24

1400

2200

100

SmartPhone

2021-01-26

2200


57.14

Tablet

2021-01-21

600

900


Tablet

2021-01-23

900


50

See:

Fill Empty Values

You can use the FILL function to fill empty or null values in your data with the last non-empty value in the group.

Source:

For example, the following dataset contains the daily orders received. Note the missing values due to weekends. You can assume that the no orders were received for Saturday and Sunday ,

Date

DayOfWeek

OrdersDay

OrdersTotal

2021-03-10

Wednesday

100

100

2021-03-11

Thursday

112

212

2021-03-12

Friday

320

532

2021-03-13

Saturday



2021-03-14

Sunday



2021-03-15

Monday

300

832

Transformation:

You have to clean up the data to fill the values for OrdersDay column. You can use the following function to fill the empty and null values. This function tests the the OrdersDay column to check if the column is empty or null. If so, the value ' 0 ' is written in the column, else the value of the column ( $col ) is written. 

Transformation Name Edit with formula
Parameter: Column OrdersDay
Parameter: Formula IF(OrdersDay == '' || ISNULL(OrdersDay), '0', $col)

You can see the values of Friday is taken for Saturday and Sunday and filled it accordingly as per the FILL function. 

Transformation Name Edit with formula
Parameter: Column OrdersTotal
Parameter: Formula IF (OrdersDay == '0', FILL (OrdersTotal, -1,0),$col)
Parameter: Order by Date

Results:

Date

DayOfWeek

OrdersDay

OrdersTotal

2021-03-10

Wednesday

100

100

2021-03-11

Thursday

112

212

2021-03-12

Friday

320

532

2021-03-13

Saturday

0

532

2021-03-14

Sunday

0

532

2021-03-15

Monday

300

832

See:

Calculate Rank

The RANK function enables you to create rankings in your data based on calculations by returning a ranking value for each row with the specified group of values. When used,  some rows might receive the same value as other rows.  For example, if there are three tie values in a group, the same rank is assigned to the rows and the next three ranks are skipped.

The DENSERANK function enables you to generate a ranked order of values within a group. If there are tie values in a group, it does not skip rank in case of tie values. For example, if two rows are listed as rank 2, then the fourth row receives rank 3. 

Source:

The following dataset contains total Sales information by quarter. You can use the RANK and DENSERANK to identify the quarters with the highest sales.

YearQuarterSales
201811000
201822000
201833000
201842000
201911000
20192500
201939000
201943000
20201500
20202500
20203200
20204400

Transformation:

RANK:

Transformation Name Window
Parameter: Formula type Multiple row formula
Parameter: Formula RANK()
Parameter: Sort rows by Sales
Parameter: New column name SalesRank

DENSERANK:

Transformation Name Window
Parameter: Formula type Multiple row formula
Parameter: Formula DENSERANK()
Parameter: Sort rows by Sales
Parameter: New column name SalesDenseRank

Results:

For the RANK function,  when multiple rows share the same rank, the next rank is not consecutive, whereas for the DENSERANK function, the next rank is consecutive.

Year

Quarter

Sales

SalesDenseRank

SalesRank

2020

3

200

1

1

2020

4

400

2

2

2020

2

500

3

3

2020

1

500

3

3

2019

2

500

3

3

2019

1

1000

4

6

2018

1

1000

4

6

2018

4

2000

5

8

2018

2

2000

5

8

2019

4

3000

6

10

2018

3

3000

6

10

2019

3

9000

7

12

See:

Calculate Rolling Functions

Rolling calculations enable you to compute a function over a changing set of rows. Rolling calculations are useful for computing the current state of a measure within your data. 

For example, in the above sample data, you can find the rolling sum and rolling average of the sales for the year. You can use the above example data to find the rolling sum and rolling average.

Source:

From the following dataset, you can calculate the rolling calculations such as  ROLLINGSUM, ROLLINGAVERAGE, ROLLINGMAX, and ROLLINGMIN.

YearQuarterSales
201811000
201822000
201833000
201842000
201911000
20192500
201939000
201943000
20201500
20202500
20203200
20204400

Transformation:

Transformation Name Window
Parameter: Formulas ROLLINGSUM (Sales, 0,1)
Parameter: Formulas ROLLINGAVERAGE (Sales, 0,1)
Parameter: Formulas ROLLINGMAX (Sales, 0, 1)
Parameter: Formulas ROLLINGMIN (Sales, 0,1)
Parameter: Order by Sales

You can rename the required columns accordingly. 

Results:

Year

Quarter

Sales

RollingSumSales

RollingAverageSales

RollingMinSales

RollingMaxSales

2020

3

200

600

300

200

400

2020

4

400

900

450

400

500

2020

2

500

1000

500

500

500

2020

1

500

1000

500

500

500

2019

2

500

1500

750

500

1000

2019

1

1000

2000

1000

1000

1000

2018

1

1000

3000

1500

1000

2000

2018

4

2000

4000

2000

2000

2000

2018

2

2000

5000

2500

2000

3000

2019

4

3000

6000

3000

3000

3000

2018

3

3000

12000

6000

3000

9000

2019

3

9000

9000

9000

9000

9000

See:

Rolling date functions

The Rolling date functions enable you to calculate forward or backward of the current row within the specified column. For example, when dealing with business calendars, you might want to know if the date falls on a holiday or weekend; based on that, you can roll the date forward or backward according to the business calendar.

Source:

The following example dataset shows the order date, order quantity that belongs to a product group.  You are interested in finding the rolling minimum and maximum dates for the product group, as well as the rolling mode value. You can use ROLLINGMINDATE ROLLINGMAXDATE , and ROLLINGMODEDATE functions.

Order_date

Order_quantity

Product_Group

2021-04-14

750

PG001

2021-07-13

1500

PG001

2021-08-31

355

PG002

2021-02-16

2000

PG002

2021-05-13

867

PG002

2021-06-18

1010

PG002

2021-11-15

909

PG003

2021-10-16

200

PG003

2021-09-09

200

PG004

2021-01-01

900

PG004

2021-12-07

707

PG004

Transformation:

Transformation Name Window
Parameter: Formulas ROLLINGSUM (Sales, 0,1)
Parameter: Formulas ROLLINGMAXDATE (Order_date, 0,1)
Parameter: Formulas ROLLINGMINDATE (Order_date, 0, 1)
Parameter: Formulas ROLLINGMODEDATE (Order_date, 0,1)
Parameter: Order by Order_date

Results:

Order_date

Order_quantity

Product_Group

RollingMaxdate

RollingMindate

RollingModedate

2021-01-01

900

PG004

2021-02-16

2021-01-01

2021-01-01

2021-02-16

2000

PG002

2021-04-14

2021-02-16

2021-02-16

2021-04-14

750

PG001

2021-05-13

2021-04-14

2021-04-14

2021-05-13

867

PG002

2021-06-18

2021-05-13

2021-05-13

2021-06-18

1010

PG002

2021-07-13

2021-06-18

2021-06-18

2021-07-13

1500

PG001

2021-08-31

2021-07-13

2021-07-13

2021-08-31

355

PG002

2021-09-09

2021-08-31

2021-08-31

2021-09-09

200

PG004

2021-10-16

2021-09-09

2021-09-09

2021-10-16

200

PG003

2021-11-15

2021-10-16

2021-10-16

2021-11-15

909

PG003

2021-12-07

2021-11-15

2021-11-15

2021-12-07

707

PG004

2021-12-07

2021-12-07

2021-12-07

See:

This page has no comments.