In , parameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution.
These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.
Project owners or workspace administrators can define parameters that apply across the project or workspace environment. These parameters can be referenced by any user in the environment, but only a user with admin access can define, modify, or delete these parameters.
Tip: Environment parameters are a useful means of ensuring that all users of the project or workspace share common reference values to buckets, output locations, and more. Environment parameter definitions can be exported and then imported into other projects or workspaces to ensure commonality across the enterprise. The values assigned to environment parameters can be modified after they have been imported into a new project or workspace.
NOTE: You must have admin access to the project or workspace to define environment parameters.
In this example, you have three , each of which has a different set of resources, although the only difference between them is the name of the S3 bucket in which they are stored:
|Environment Name||S3 Bucket Name|
In your Dev workspace, you can create an environment parameter called the following:
The default value for this parameter is set to:
When creating imported datasets in this workspace, you insert the environment parameter for the source bucket for each one.
For your Test and Prod environments:
When you later export your flows from Dev and move them to Test and Prod, the imported flows automatically connect to the correct bucket for the target environment, since the bucket name is referenced by an environment parameter.
You can export environment parameters from one environment and import them to another. For example, you may be building your flows in a Dev workspace before they are exported and imported into a Prod workspace. If your flows make use of environment parameters from the Dev space, you may want to export the parameters and their values from the Dev workspace for migration to the Prod workspace.
NOTE: As part of the import process, you must reconcile name conflicts between imported environment parameters and the parameters that already exist in the workspace.
For more information, see Manage Environment Parameters.
In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.
Suppose you have imported data from a file system source, which has the following source path to weekly transactions:
In the above, you can infer a date pattern in the form of
2018/01/29, which suggests that there may be a pattern of paths to transaction files. Based on the pattern, it'd be useful to be able to do the following:
In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:
The above example implements a Datetime parameter on the path values, creating a dataset with parameters.
You can use the following types of parameters to create datasets with parameters:
For more information, see Create Dataset with Parameters.
The source files or tables for a dataset with parameters should have consistent structures. Since the sources are parsed with the same recipe or recipes, variations in schema could cause breakages in the recipe or initial parsing steps, which are applied based on the schema of the first matching source.
NOTE: All datasets imported through a single parameter are expected to have exactly matching schemas. For more information on variations, see Mismatched Schemas below.
Tip: If there have been changes to the schema of the sources of your dataset with parameters, you can edit the dataset and update the parameters. See Library Page.
Parameters in paths for imported datasets are rendered as regular expressions. Depending on the number of parameters and the comparative depth of them in a parameterized dataset, the process of performing all pattern checks can grow large, impacting import performance.
Tip: When specifying an imported dataset with parameters, you should attempt to be as specific as possible in your parameter definitions.
NOTE: When importing one or more Excel files as a parameterized dataset, you select worksheets to include from the first file. If there are worksheets in other Excel files that match the names of the worksheets that you selected, those worksheets are also imported. All worksheets are unioned together into a single imported dataset with parameters. Pattern-based parameters are not supported for import of Excel worksheets.
expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.
If schemas do not match:
Backreferences. The following example matches on
cxc yet generates an error:
Lookahead assertions: The following example matches on
a, but only when it is part of an
ab pattern. It generates an error:
When browsing for data on your default storage layer, you can choose to parameterize elements of the path. Through the Import Data page, you can select elements of the path, apply one of the supported parameter types and then create the dataset with parameters.
NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.
You can choose to search hidden folders.
NOTE: Including hidden folders must be enabled by an administrator. For more information, see Dataprep Project Settings Page.
Tip: If your imported dataset is stored in a bucket, you can parameterize the bucket name, which can be useful if you are migrating flows between environments or must change the bucket at some point in the future.
For more information, see Create Dataset with Parameters.
If you are creating a dataset from a relational source, you can apply parameters to the custom SQL that pulls the data from the source.
NOTE: Avoid using parameters in places in the SQL statement that change the structure of the data. For example, within a SELECT statement, you should not add parameters between the SELECT and FROM keywords.
For more information, see Create Dataset with SQL.
When a dataset with parameters is imported for use, all matching source files or tables are automatically unioned together.
NOTE: Sources for a dataset with parameters should have matching schemas.
The initial sample that is loaded in the Transformer page is drawn from the first matching source file or table. If the initial sample is larger than the first file, rows may be pulled from other source objects.
After you have imported a dataset with parameters into your flow:
For more information, see Flow View Page.
Tip: You can review details on the parameters applied to your dataset. See Dataset Details Page.
When a dataset with parameters is first loaded into the Transformer page, the initial sample is loaded from the first found match in the range of matching datasets. If this match is a multi-sheet Excel file, the sample is taken from the first sheet in the file.
To work with data that appears in files other than the first match in the dataset, you must create a new sample in the Transformer page. Any sampling operations performed within the Transformer page sample across all matching sources of the dataset.
If you have created a variable with your dataset, you can apply a variable value to override the default at sampling time. In this manner, you can specify sampling to occur from specific source files from your dataset with parameters.
For more information, see Overview of Sampling.
Schedules can been applied to a dataset with parameters. When resolving date range rules for scheduling a dataset with parameters, the schedule time is used.
For more information, see Add Schedule Dialog.
By default, when a flow containing parameters is copied, any changes to parameter values in the copied flow also affect parameters in the original flow. To separate these parameters, you have the following options:
NOTE: For copying flows using parameterized datasets, you should duplicate the datasets, which creates separate copies of parameters and their values in the new flow. If datasets are not copied, then parameter changes in the copied flow modify the values in the source flow.
For more information, see Overview of Sharing.
Since never touches the source data, after a source that is matched for a dataset with parameters has been executed, you should consider removing it from the source system or adjusting any applicable ranges on the matching parameters. Otherwise, outdated data may continue to factor into operations on the dataset with parameters.
NOTE: Housekeeping of source data is outside the scope of . Please contact your IT staff to assist as needed.
You can specify flow parameters and their default values, which can be invoked in the recipe steps of your flow. Wherever the flow parameter is invoked, it is replaced by the value you set for the parameter. Uses:
Flow parameter types:
Literal values: These values are always of String data type.
Tip: You can wrap flow parameter references in your transformations with one of the
NOTE: Wildcards are not supported.
Suppose you need to process your flow across several regions of your country. These regions are identified using a region ID value:
From the Flow View context menu, you select Manage parameters. In the Parameters tab, you specify the parameter name:
You must specify a default value. To verify that this critical parameter is properly specified before job execution, you set the default value to:
The above setting implies two things:
After the flow parameter has been created, you can invoke it in a transformation step using the following syntax.
Where the parameter is referenced, the default or applicable override value is applied. For more examples, see Create Flow Parameter.
If your flow references a recipe or dataset that is sourced from an upstream flow, the flow parameters from that flow are available in your current flow. That value of the parameter at time of execution is passed to the current flow.
NOTE: Downstream values and overrides of parameters that share the same name take precedence. When you execute the downstream flow, the parameter value is applied to the current flow and to all upstream objects. For more information, see "Order of Evaluation" below.
Flow parameters are created at the flow level from the context menu in Flow View. See Manage Parameters Dialog.
Flow parameters can be edited, deleted, and overridden through the Flow View context menu. See Manage Parameters Dialog.
You can also apply overrides to your flow parameters as part of your plan definition. For more information, see Plan View Page.
You can specify variable and timestamp parameters to apply to the file or table paths of your outputs.
NOTE: Output parameters are independent of dataset parameters.
You can create the following types of output parameters:
Tip: These types of parameters can be applied to file or table paths. An output path can contain multiple parameters.
Suppose you are generating a JSON file as the results of job execution.
Since this job is scheduled and will be executed on a regular interval, you want to insert a timestamp as part of the output, so that your output filenames are unique and timestamped:
In this case, you would create an output parameter of timestamp type as part of the write settings for the job you are scheduling.
When you are creating or editing a publishing action in the Run Jobs page, you can click the Parameterize destination link that appears in the right panel.
Tip: For outputs that are stored in buckets, you can parameterize the name of the bucket.
For more information, see Create Outputs.
Whenever you execute a job using the specified publishing action, the output parameters are applied.
After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.
For more information, see Run Job Page.
In addition to parameterizing the paths to imported datasets or outputs, you can also apply parameters to the buckets where these assets are stored. For example, if you are developing flows in one workspace and deploying them into a production workspace, it may be useful to create a parameter for the name of the bucket where outputs are written for the workspace.
Bucket names can be parameterized for the buckets in the following datastores:
For more information:
For each of the following types of parameter, you can apply override values as needed.
|dataset parameters||When you run a job, you can apply override values to variables for your imported datasets. See Run Job Page.|
At the flow level, you can apply override values to flow parameters. These values are passed into the recipe and the rest of the flow for evaluation during recipe development and job execution.
|output parameters||When you define your output objects in Flow View, you can apply override values to the parameterized output paths on an as-needed basis when you specify your job settings. See Run Job Page.|
Wherever a parameter value or override is specified in the following list, the value is applied to all matching parameters within the execution tree. Suppose you have created a parameter called
varRegion, which is referenced in your imported dataset, recipe, and output object. If you specify an override value for
varRegion in the Run Job page, that value is applied to the data you import (dataset parameter), the recipe during execution (flow parameter), and the path of the output that you generate (output parameter). Name matches are case-sensitive.
NOTE: Override values are applied to upstream flows, as well. Any overrides specified in the current flow are passed to downstream flows, where they can be overridden as needed.
Parameter values are evaluated based on the following order of precedence (highest to lowest):
NOTE: The following does not apply to environment parameters, which cannot be overridden.
Run-time overrides: Parameter values specified at run-time for jobs.
NOTE: The override value is applied to all subsequent operations in the platform. When a job is submitted to the job queue, any overrides are applied at that time. Changes to override values do not affect jobs that are already in flight.
NOTE: You can specify run-time override values when executing jobs through the APIs. See API Workflow - Run Job.
See Run Job Page.
When running a job based on datasets with parameters, results are written into separate folders for each parameterized path.
NOTE: During job execution, a canary file is written for each set of results to validate the path. For datasets with parameters, if the path includes folder-level parameterization, a separate folder is created for each parameterized path. During cleanup, only the the canary files and the original folder path are removed. The parameterized folders are not removed. This is a known issue.
NOTE: Due to a limitation in , when you run a job on a parameterized dataset containing more than 100 files, the input paths data must be compressed, which results in non-readable location values in the console. Running jobs on datasets sourced from more than 6000 files may fail.
When you choose to run a job on a dataset with parameters from the user interface, any variables are specified using their default values.
Through the Run Job page, you can specify different values to apply to variables for the job.
NOTE: Values applied through the Run Job page to variables override the default values for the current execution of the job. Default values for the next job are not modified.
NOTE: When you edit an imported dataset, if a variable is renamed, a new variable is created using the new name. Any override values assigned under the old variable name for the dataset must be re-applied. Instances of the variable and override values used in other imported datasets remain unchanged.
For more information, see Run Job Page.
You can schedule jobs for datasets with parameters.
NOTE: When a job is executed, the expected time of execution is used during execution. For scheduled jobs, this value is the scheduled time. For example, if a job scheduled for 08:00 begins execution at 08:05, any parameters that reference "now" time use 08:00 during the job run.
For a scheduled job:
See Schedule a Job.
In the Job Details page:
See Job Details Page.