Overview

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime. In the parameterization enables you to manage executions of the same recipe steps across serialized datasets the paths to which can be managed via variable. 

Example

Suppose you have imported data from a file system source, which has the following source path to weekly transaction logs:

<file_system>:///source/transactions/2018/01/29/transactions.csv

In the above, you can infer a date pattern in the form of 2018/01/29, which suggests that there may be a pattern of paths to transaction files.  Based on the pattern, it'd be useful to be able to do the following:

 

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:

 

<file_system>:///source/transactions/YYYY/MM/DD/transactions.csv

The above example implements a Datetime parameter on the path values, creating a dataset with parameters

Parameter Types

You can use the following types of parameters to create datasets with parameters:

For more information, see Create Dataset with Parameters.

Mismatched Schemas

 expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

Tips:

Limitations

Creating Dataset with Parameters

From file system

When browsing for data on your default storage layer, you can choose to parameterize elements of the path. Through the Import Data page, you can select elements of the path, apply one of the supported parameter types and then create the dataset with parameters.

NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.

For more information, see Create Dataset with Parameters.

From relational source

If you are creating a dataset from a relational source, you can apply parameters to the custom SQL that pulls the data from the source. 

NOTE: Within a SELECT statement, you cannot add parameters between the SELECT and FROM keywords.

For more information, see Create Dataset with SQL.

Matching parameters

When a dataset with parameters is imported for use, all matching source files or tables are automatically unioned together. 

NOTE: Sources for a dataset with parameters should have matching schemas.

The initial sample that is loaded in the Transformer page is drawn from the first matching source file or table. If the initial sample is larger than the first file, rows may be pulled from other source objects.

Managing Datasets with Parameters

Datasets with parameters in your flows

After you have imported a dataset with parameters into your flow:

For more information, see Flow View Page.

Tip: You can review details on the parameters applied to your dataset. See Dataset Details Page.

Sampling from datasets with parameters

When a dataset with parameters is first loaded into the Transformer page, the initial sample is loaded from the first found match in the range of matching datasets. 

NOTE: To work with data that appears in files other than the first match in the dataset, you must create a new sample in the Transformer page. Any sampling operations performed within the Transformer page sample across all matching sources of the dataset.

For more information, see Overview of Sampling.

Scheduling for datasets with parameters

Schedules can been applied to a dataset with parameters. 

NOTE: When resolving date range rules for scheduling a dataset with parameters, the time of actual execution is used. So, if the schedule is late to execute for some reason, the actual start time, instead of the scheduled time, is used.

For more information, see Flow View Page.

Sharing for datasets with parameters

NOTE: When a flow containing parameters is copied, any changes to parameter values in the copied flow also affect parameters in the original flow.

For more information, see Overview of Sharing.

Housekeeping

Since the  never touches the source data, after a source that is matched for a dataset with parameters has been executed, you should consider removing it from the source system or adjusting any applicable ranges on the matching parameters. Otherwise, outdated data may continue to factor into operations on the dataset with parameters.

NOTE: Housekeeping of data on a source file system or relational source is outside the scope of the . Please contact your IT staff to assist as needed.

Runtime Parameter Overrides

When you choose to run a job on a dataset with parameters from the user interface, any variables are specified using their default values. 

Through the Run Job page, you can specify different values to apply to variables for the job. 

NOTE: Values applied through the Run Job page to variables override the default values for the current execution of the job. Default values for the next job are not modified.

For more information, see Run Job Page.

In the Job Results page, click View Parameters to view the parameter names and values that were used as part of the job, including the list of matching datasets. See Job Results Page.

Operationalization with Parameters

CLI

Through the CLI, you can execute jobs against datasets with parameters. When you download the CLI bundle from the application, the path in the sources file references the first matching dataset among all of the matches. You cannot apply any variable overrides at job runtime.

Tip: External to the platform, you can create a script to manage parameterization. For an example of how to do it, please see CLI Example - Parameterize Job Runs.

For more information, see CLI for Jobs.

Configuration

Disable

By default, parameterization is enabled. For more information on disabling this feature, see Miscellaneous Configuration.