supports computation of counts of rows, columns, and ad-hoc values within your data, so that you can make assessments of the quality, consistency, and statistical validity of your data.
Any computed counts that you see in the Transformer page are computed from the displayed sample. |
These computed counts reflect the entire dataset, only if the data grid is displaying the full dataset:
Data grid sample is the full dataset. |
When the job is executed, however, any computations of counts are applied across the entire dataset.
When you run a job, you can enable the profiling of the job results, which renders a visual profile and some statistics on the dataset. This profile is available for review through the application. For more information, see Overview of Visual Profiling.
In the status bar at the bottom of the data grid, you can review the current count of rows and columns in the displayed sample.
Tip: The row and column counts in the status bar may be useful for comparing the changes to these metrics between steps. For example, you can click step 2 in your recipe and then review these metrics. When you click step 3, these metrics may change. |
Row counts: Depending on your method of sampling, the row counts may change. For more information, see Overview of Sampling.
Column counts: By default, all columns in the panel are displayed. Column counts should change only if you delete or hide them. For more information on toggling display of columns, see Visible Columns Panel.
For more information, see Data Grid Panel.
You can use the following functions to identify and compute the row counts in your dataset.
Function Name | Description | ||
---|---|---|---|
COUNT Function |
| ||
ROWNUMBER Function | |||
SOURCEROWNUMBER Function |
|
These transformations allow you to compute counts of literals or patterns in a cell's values. Then, you can perform calculations on this new column of values to compute metrics across the dataset.
The following example computes the number of references in the tweet
column for My Company
:
Suppose, however, that the company has multiple ways in which it is reference. It could be:
My Company
My Co
My Company, Inc.
You can modify the above transformation to use a to capture these variations:
If needed, you can use the following to add up all of the counts in tweetCompanyReferences
to determine the total number.
NOTE: Keep in mind that this sum reflects only the sum of values in the sample in the data grid. When you run a job containing this calculation, it is applied across all rows in the dataset. |
You can also collect counts of values between two patterns within a cell's value. In this manner, you can analyze a more constrained substring of the cell value.
The following transformation calculates the URLs in each row of the msgText
column, assuming that the URL begins with http:// or https:// and ends with .com or .net:
You can perform calculations based on groups that you define as part of the calculation. These groupings, called aggregations, are powerful tools for delivering insightful analysis on your data.
In the following example, several aggregated computations, including the COUNT
function are performed on transactional data, which is grouped by region (regionId
) and product (prodId
):
NOTE: The above calculation inserts two new columns into the dataset. Alternatively, you can choose to do a full replacement of the dataset with these aggregated counts. For more information, see Pivot Data. |
You can use a set of functions that count occurrences, based on conditions. In the following list of functions:
COUNTA
counts values that are non-null. COUNTIF
tabulates counts provided a specified condition is met.Function Name | Description |
---|---|
COUNTIF Function | |
COUNTA Function | |
COUNTAIF Function | |
COUNTDISTINCT Function | |
COUNTDISTINCTIF Function |
The following transformation counts the rows where the length of msgText
is longer than 140 characters, grouped by userId
: