Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

D toc

Excerpt

D s product
rtrue
 supports computation of counts of rows, columns, and ad-hoc values within your data, so that you can make assessments of the quality, consistency, and statistical validity of your data.

Important Note on Counts

Warning

Any computed counts that you see in the Transformer page are computed from the displayed sample.

These computed counts reflect the entire dataset, only if the data grid is displaying the full dataset:

D caption
typefigure
Data grid sample is the full dataset. 

When the job is executed, however, any computations of counts are applied across the entire dataset.

Visual Profiling

When you run a job, you can enable the profiling of the job results, which renders a visual profile and some statistics on the dataset. This profile is available for review through the application. For more information, see Overview of Visual Profiling.

Row and Column Counts

In the status bar at the bottom of the data grid, you can review the current count of rows and columns in the displayed sample. 

Tip

Tip: The row and column counts in the status bar may be useful for comparing the changes to these metrics between steps. For example, you can click step 2 in your recipe and then review these metrics. When you click step 3, these metrics may change.

Row counts: Depending on your method of sampling, the row counts may change. For more information, see Overview of Sampling.

Column counts: By default, all columns in the panel are displayed. Column counts should change only if you delete or hide them. 

Computed row counts

You can use the following functions to identify and compute the row counts in your dataset. 

Function NameDescription
COUNT Function

D excerpt include
pageCOUNT Function
nopaneltrue

Tip

Tip Typically, this function is used as part of an aggregation, in which rows are grouped according to shared values in other columns. This function can also be applied without grouping, which is called a flat aggregate. More information on how to apply aggregated counts is below.


ROWNUMBER Function

D excerpt include
pageROWNUMBER Function
nopaneltrue

SOURCEROWNUMBER Function 

D excerpt include
pageSOURCEROWNUMBER Function
nopaneltrue

Info

NOTE: This function may fail to return results if the original source row information is not available. For example, if you have performed a join between multiple datasets, the source row number information cannot be computed. Similarly, if you compute this function and then perform a join, the results may not make sense.


Tip

Tip: You can pair this function later with the MIN or MAX functions to compute the highest and lowest row number information.


Count by Pattern

These transformations allow you to compute counts of literals or patterns in a cell's values. Then, you can perform calculations on this new column of values to compute metrics across the dataset.

Count pattern or text

The following example computes the number of references in the tweet column for My Company:

D trans
p03ValuetweetCompanyReferences
Typestep
p01NameOption
p01ValueText or pattern
p02NameText or pattern to count
p02Value'My Company'
p03NameNew column name
SearchTermCount matches

Suppose, however, that the company has multiple ways in which it is reference. It could be:

  • My Company
  • My Co
  • My Company, Inc.

You can modify the above transformation to use a 

D s lang
itempattern
 to capture these variations:

D trans
p03ValuetweetCompanyReferences
Typestep
p01NameOption
p01ValueText or pattern
p02NameText or pattern to count
p02Value`(My Company|My Co|My Company, Inc.)`
p03NameNew column name
SearchTermCount matches

If needed, you can use the following to add up all of the counts in tweetCompanyReferences to determine the total number. 

Info

NOTE: Keep in mind that this sum reflects only the sum of values in the sample in the data grid. When you run a job containing this calculation, it is applied across all rows in the dataset.

D trans
p03Valuesum_tweetCompanyReferences
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueSUM(tweetCompanyReferences)
p03NameNew column name
SearchTermNew formula

Count between patterns

You can also collect counts of values between two patterns within a cell's value. In this manner, you can analyze a more constrained substring of the cell value.

The following transformation calculates the URLs in each row of the msgText column, assuming that the URL begins with http:// or https:// and ends with .com or .net:

D trans
p03ValueSelected
p06NameIgnore case
p01NameOption
p06ValueSelected
p03NameInclude as part of the match
p07ValuecountURLs
p04Value`(\.com|\.net)`
SearchTermCount matches
p07NameNew column name
Typestep
p05NameIncludes as part of the match
p01ValueBetween two delimiters
p02NameStarting pattern
p02Value`(http\:\/\/|https\:\/\/)`
p05ValueSelected
p04NameEnding pattern

Count Functions

Aggregated counts

You can perform calculations based on groups that you define as part of the calculation. These groupings, called aggregations, are powerful tools for delivering insightful analysis on your data. 

In the following example, several aggregated computations, including the COUNT function are performed on transactional data, which is grouped by region (regionId) and product (prodId):

D trans
p03ValueSUM(sales)
Typestep
p05NameType
p01NameGroup by 1
p01ValueregionId
p02NameGroup by 2
p02ValueprodId
p05ValueGroup by as new column(s)
p03NameValues 1
p04ValueCOUNT()
p04NameValues 2
SearchTermGroup by

Info

NOTE: The above calculation inserts two new columns into the dataset. Alternatively, you can choose to do a full replacement of the dataset with these aggregated counts. For more information, see Pivot Data.

Conditional count functions

You can use a set of functions that count occurrences, based on conditions. In the following list of functions:

  • Some of the conditions are implicit in the function itself. For example, COUNTA counts values that are non-null. 
  • Some conditions are specified as part of the function. For example, COUNTIF tabulates counts provided a specified condition is met.
Function NameDescription
COUNTIF Function

D excerpt include
pageCOUNTIF Function
nopaneltrue

COUNTA Function

D excerpt include
pageCOUNTA Function
nopaneltrue

COUNTAIF Function

D excerpt include
pageCOUNTAIF Function
nopaneltrue

COUNTDISTINCT Function

D excerpt include
pageCOUNTDISTINCT Function
nopaneltrue

COUNTDISTINCTIF Function

D excerpt include
pageCOUNTDISTINCTIF Function
nopaneltrue

The following transformation counts the rows where the length of msgText is longer than 140 characters, grouped by userId:

D trans
p03ValueGroup by as new column(s)
Typestep
p01NameGroup by 1
p01ValueuserId
p02NameValues 1
p02ValueCOUNTIF(LEN(msgText)>140)
p03NameType
SearchTermGroup by

D s also
inCQLtrue
label((label = "discovery_tasks") OR (label = "transformation_ui") OR (label = "count"))