Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r092

D toc

Excerpt

When you edit your dataset's recipe, the Transformer page is opened, where you begin your wrangling tasks on a sample of the dataset. Through this interface, you build your transformation recipe and see the results in real-time as applied to the sample. When you are satisfied with what you see, you can execute a job against the entire dataset.

Goal

Your data transformation is complete when you have done the following:

  • Cleansed your data of invalid, missing, or inaccurate values
  • Enhanced your dataset as needed with data from other datasets
  • Modified your dataset to constrain its values to meet the target schema
  • Executed job against the entire dataset
  • Exported the results from your dataset and recipe for use in downstream systems
Tip

Tip: Before you begin transforming, you should know the target schema that your transformed data must match. A schema is the set of columns and their data types, which define the constraints of your dataset. You can import this target schema as a dataset and use it during recipe development to serve as a mapping for your transformations.

Recommended Methods for Building Recipes

D s product
rtrue
 supports the following methods for building recipes in the Transformer page. These methods are listed in order of ease of use:

  1. Select something. When you select elements of data in the Transformer page, you are prompted with a set of suggestions for steps that you can take on the selection or patterns matching the selection. You can select columns or one or more values within columns.

    Tip

    Tip: The easiest method for building recipes is to select items in the application. Over time, the application learns from your selections and prompts you with suggestions based on your previous use.

  2. Toolbar and column menus: In the Transformer page, you can access pre-configured transformations through the Transformer toolbar or through the column context menus. 

    Tip

    Tip: Use the toolbar for global transformations across your dataset and the column menu for transformations on one or more selected columns.

    1. When a Transformer toolbar item is selected, the Transform Builder is pre-populated with settings and values to get you started. As needed you can modify the step to meet your needs. 
    2. The column menus contain the most common transformations for individual or multiple columns. Often, no additional configuration is required. 
    3. Select multiple columns. Continue selecting columns to be prompted with a different set of suggestions applicable to all of them.
  3. Search and browse for transformations. Using the Search panel and the Transform Builder, you can rapidly assemble recipe steps through a simple, menu-driven interface. When you choose to add a step, you search for your preferred transformation in the Search panel. When one is selected, the Transform Builder is pre-populated from your selection in the Search panel.

    Tip

    Tip: Use the Transform Builder for performing modifications to the transformation you selected from the Search panel or a suggestion card.

Sample

Loading very large datasets in

D s product
 can overload your browser or otherwise impact performance, so the application is designed to work on a sample of data. After you have finished your recipe working on a sample, you execute the recipe across the entire dataset.

The default sample is the first set of rows of source data in the dataset, the number of which is determined by the platform. For smaller datasets, the entire dataset can be used as your sample. In the Transformer page, it's listed as Initial Data in the upper-left corner.

In some cases, the default sample might be inadequate or of the wrong type. To generate a new sample, click the name of the sample in the upper-left corner.

Info

NOTE: Collecting new samples requires system resources and storage. In some environments, collecting samples incurs monetary cost.

Tip

Tip: You should consider collecting a new sample if you have included a step to change the number of rows in your dataset or have otherwise permanently modified data (keep, delete, lookup, join, or pivot operations). If you subsequently remove the step that made the modification, the generated sample is no longer valid and is removed. This process limits unnecessary growth in data samples.

On the right side of the Transformer page, you can launch a new sampling job on your dataset from the Samples panel. You may have to open it first.

Cleanse

Data cleansing tasks address issues in data quality, which can be broadly categorized as follows:

  • Consistency. Values that describe the same thing should agree with each other. For example, numeric values should have the same precision. String values should be consistently structured to mean the same thing. 
  • Validity. Values should be constrained to the requirements of each field's data type. For example, a DateOfSale field should be a valid date. 
  • Reliability. Values in the same field in different records should mean the same thing. For example, the value 15 in the Temperature field of two different records should not mean Centigrade in one record and Fahrenheit in the other record. 

When data is initially imported, it can contain multiple columns, rows, or specific values that you don't need for your final output. Specifically, this phase can involve the following basic activities:

  • Remove unused columns
  • Address missing and mismatched data
  • Change data types
  • Improve consistency, validity, and reliability of the data

D s minrows

First recipe steps:

When a dataset sample is first loaded into the Transformer page,

D s product
 attempts to split out the unstructured data to form regular, tabular data. If your data appears to contain a header row, it can be used for the titles of the columns.

D caption
typefigure
Transformer page

In the above image, some initial parsing steps have been applied to structure the data into tabular format, but these steps are not added as formal parts of the recipe. They are hidden from view in the recipe. By default, these steps are automatically added to the recipe when you permit the application to detect the structure of the imported data.

The data resulting from these initial transforms is displayed in the data grid

  • Your recipe is displayed in the Recipe panel on the right side. You might have to open this panel to see it. 
  • When you select items in the data grid, suggestion cards are displayed for you to begin building transform steps. 
  • These suggestions can be modified to build more complex or subtle commands in the Transform Builder. 
  • Don't forget to use the Transformer toolbar, which pre-configures the Transform Builder with the configuration required for a useful transformation. 
  • You can use the column menu to apply changes to an individual column. 

Use a row to create headers:

In most cases, the names of your columns are inferred from the first row of the data in the dataset. If you need to specify a different row, please complete the following:

  1. Click the Search icon in the menu bar.

  2. In the Search panel textbox, type: header

  3. The transformation is displayed in the Transform Builder. Specify the following properties:
    D trans
    Typestep
    p01NameOption
    p01ValueUse row as header
    p02NameRow
    p02Value1
    SearchTermRename columns
  4. If you need to specify a different row to use, you can specify a specific row number to use in the Row textbox.
  5. To add this or any transform in development to your recipe, click Add. This button is disabled if the step is invalid. 

Generate metadata:

On the left side of the data grid, you might notice a set of black dots. If you hover over one of these, the original row number from the source data is listed. Since the data transformation process can change the number of rows or their order, you might want to retain the original order of the rows.

Tip

Tip: Some operations, such as unions and joins, can invalidate source row number information. To capture this data into your dataset, it's best to add this transformation early in your recipe.


To retain the original row numbers in a column called, rowId, please complete the following:

D trans
p03ValuerowId
Typeref
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$sourcerownumber
p03NameNew column name
SearchTermNew formula

You can use a similar transformation to generate the full path and filename to file-based sources:

 

D trans
p03Valuefilepath
Typeref
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$filepath
p03NameNew column name
SearchTermNew formula

Delete unused columns:

Your data might contain columns that are not of use to you, so it's in your interest to remove them to simplify the dataset. To delete a column, click the caret next to the column's title and select Delete.

Tip

Tip: If you are unsure of whether to delete the column, you can use the same caret menu to hide the column for now. Hidden columns do appear in the output.

Tip

Tip: You can also delete multiple columns, including ranges of columns.

Check column data types:

When a dataset is imported, 

D s product
 attempts to identify the data type of the column from the first set of rows in the column. At times, however, type inference can be incorrect. 

Tip

Tip: Before you start performing transformations on your data based on mismatched values, you should check the data type for these columns to ensure that they are correct.

Display only columns of interest:

You can choose which columns you want to display in the data grid, which can be useful to narrow your focus to problematic areas.

In the Status bar at the bottom of the screen, click the Eye icon. 

Review data quality:

After you have removed unused data, you can examine the quality of data within each column just below the column title. 

D caption
typefigure
Column header with data quality bar

The horizontal bar, known, as the data quality bar, identifies the quality of the data in the column by the following colors:

ColorDescription
greenThese values are valid for the column data type.
redThese values do not match those of the column type.
grayThere are no values for the column in these rows.
Tip

Tip: When you select values in the data quality bar, those values are highlighted in the sample rows, and suggestions are displayed at the bottom of the screen in the suggestion cards to address the selected rows.

Suggestion Cards:

Based on your selections and its knowledge of common data patterns, 

D s product
 prompts you with suggested transformations. You can then select pre-configured transformations in the right panel of the Transformer page to quickly add steps.

Tip

Tip: Where possible, you should try to create your transforms by selecting data and then selecting the appropriate suggestion card. In some cases, you might need to modify the details of the recipe.

In the following example, the missing values in the SUBSCRIBER_AGE column have been selected, and a set of suggestion cards is displayed. 

D caption
typefigure
Selecting missing values
Tip

Tip: When previewing a recipe step, you can use the checkboxes in the status bar to display only affected rows, columns, or both, which helps you to assess the effects of your step.

Depending on the nature of the data, you might want to keep, delete, or modify the values. Since the data is missing, the Delete card has been selected.

  • To accept this suggest, click Add.
  • You can modify the step if needed. An example is provided later.

For more background information, see Overview of Predictive Transformation.

Change data types:

If a column contains a high concentration of mismatched data (red), the column might have been identified as the wrong data type. For example, your dataset includes internal identifiers that are primarily numeric data (e.g. 10000022) but have occasional alphabetical characters in some values (e.g.  1000002A). The column for this data might be typed for integer values, when it should be treated as string values. 

Tip

Tip:

D s product
maintains statistical information and enable some transformation steps based upon data type.

  1. To change a column's data type, click the icon to the left of the column title. 
  2. Select the new data type. 
  3. Review the mismatched values for the column to verify that their count has dropped.

Explore column details:

As needed, you can explore details about the column's data, including statistical information such as outliers. From the caret drop-down next to a column name, select Column Details

Review histograms:

Include Page
Column Histogram Basics
Column Histogram Basics

Modify

After you have performed initial cleansing of your data, you might need to perform modifications to the data to properly format it for the target system, specify the appropriate level of aggregation, or perform some other modification. 

In the following example, the improperly capitalized word BALTIMORE has been selected, so that you can change it to its propercase spelling (Baltimore). Those rows are highlighted in the row data, and a set of suggestions for how to fix has been provided in the Selection Details panel.

D caption
typefigure
Selecting values to modify

Depending on the nature of your data, you might want to keep or change the values, or you can remove the problematic rows altogether. 

Tip

Tip: When you select one of the suggestion cards, the implied changes are previewed in the Transformer page, so you can see the effects of the change. This previewing capability enables you to review and tweak your changes before they are formally applied. You can always remove a transform step if it is incorrect or even re-run the recipe to generate a corrected set of results, since source data is unchanged.

In this case, select the Replace transformation. However, there are a couple of minor issues with the provided suggestion.

  • Since the platform has no idea about the meaning of the selection, it might initially suggest removing the text altogether. In this case, you want to change the spelling.
  • In the transformation, the Find parameter value contains the pattern used to identify the selection. In this case, it is selecting all values that are capitalized. For now, you only want to fix BALTIMORE.

So, some aspects of this transform must be changed. Click Edit.

Transform Builder:

When you modify a transform step, you can make changes in the Transform Builder, which is a simple, menu-driven interface for modifying your transformations:

D caption
typefigure
Modifying steps in the Transform Builder

In the Transform Builder, you can replace the pattern with the specific string to locate: BALTIMORE. The new value, which is currently blank, can be populated with the replacement value: BaltimoreClick Add.

The step is added to the recipe and automatically applied to the data sample displayed in the Transformer page. You can continue to add new steps through the Transform Builder.

Enrichment

Before you deliver your data to the target system, you might need to enhance or augment the dataset with new columns or values from other datasets.  

Union datasets:

You can append a dataset of identical structure to your currently loaded one to expand the data volume. For example, you can string together daily log data to build weeks of log information using the Union page. 

Join datasets:

You can also join together two or more datasets based on a common set of values. For example, you are using raw sales data to build a sales commission dataset:

  • Your sales transaction dataset contains a column for the salesman's identifier, which indicates the employee who should receive the commission. 
  • You might want to join your sales transaction dataset to the employee dataset, which provides information on the employee's name and commission rate by the internal identifier. 
  • If there is no corresponding record in the employee dataset, a commission is not rewarded, and the sales transaction record should not be present in the commission dataset.

This commission dataset is created by performing an inner join between the sales transaction dataset and the employee dataset. In the Search panel, enter join to join data.

Lookup values:

In some cases, you might need to include or replace values in your dataset with other columns from another dataset. For example, transactional data can reference product and customer by internal identifiers. You can create lookups into your master data set to retrieve user-friendly versions of customer and product IDs.

Info

NOTE: The reference data that you are using for lookups must be loaded as a dataset into

D s product
first.

To perform a lookup for a column of values, click the caret drop-down next to the column title and select Lookup...

Sampling

The data that you see in the Transformer page is a sample of your entire dataset. 

  • If your dataset is small enough, the sample is the entire dataset. 
  • For larger datasets, 
    D s product
     auto-generates an initial data sample from the first rows of your dataset.

For larger datasets, you must learn how to generate new samples, which can provide different perspectives on your data and, in complex flows, enhance performance.

Tip

Tip: Sampling is an important concept in

D s product
.

Profile

As part of the transformation process, you can generate and review visual profiles of individual columns and your entire dataset. These interactive profiles can be very helpful in identifying anomalies, outliers, and other issues with your data. 

D s also
inCQLtrue
label((label = "transform_ui") OR (label = "transform_tasks"))