In Dataprep by Trifacta®, parameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution.Parameter types:
- Environment Parameters: A workspace administrator or project owner can specify parameters that are available across the environment, including default values for them.
- Dataset Parameters: You can parameterize the paths to inputs for your imported datasets, creating datasets with parameters. For file-based imported datasets, you can parameterize the bucket where the source is stored.
- Flow Parameters: You can create parameters at the flow level, which can be referenced in any recipe in the flow.
- Output Parameters: When you run a job, you can create parameters for the output paths for file- or table-based outputs.
These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.
Project owners or workspace administrators can define parameters that apply across the project or workspace environment. These parameters can be referenced by any user in the environment, but only a user with admin access can define, modify, or delete these parameters.
Tip: Environment parameters are a useful means of ensuring that all users of the project or workspace share common reference values to buckets, output locations, and more. Environment parameter definitions can be exported and then imported into other projects or workspaces to ensure commonality across the enterprise. The values assigned to environment parameters can be modified after they have been imported into a new project or workspace.
NOTE: You must have admin access to the project or workspace to define environment parameters.
- Names of environment parameters must begin with
- You cannot use environment parameters in recipes.
- You cannot use environment parameters in plans.
- Environment parameter names are unique within the environment.
Example - parameterized bucket names
In this example, you have three Trifacta workspaces, each of which has a different set of resources, although the only difference between them is the name of the S3 bucket in which they are stored:
|Environment Name||S3 Bucket Name|
In your Dev workspace, you can create an environment parameter called the following:
The default value for this parameter is set to:
When creating imported datasets in this workspace, you insert the environment parameter for the source bucket for each one.
For your Test and Prod environments:
- Export your environment parameters from Dev.
- Import them into Test and Prod. During import, the importing user can map the imported parameters to existing parameters in the environment.
- In the imported environments, an administrator can manage the imported parameters and values as needed.
When you later export your flows from Dev and move them to Test and Prod, the imported flows automatically connect to the correct bucket for the target environment, since the bucket name is referenced by an environment parameter.
Export and Import
You can export environment parameters from one environment and import them to another. For example, you may be building your flows in a Dev workspace before they are exported and imported into a Prod workspace. If your flows make use of environment parameters from the Dev space, you may want to export the parameters and their values from the Dev workspace for migration to the Prod workspace.
NOTE: As part of the import process, you must reconcile name conflicts between imported environment parameters and the parameters that already exist in the workspace.
For more information, see Manage Environment Parameters.
Datasets with Parameters
In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.
Suppose you have imported data from a file system source, which has the following source path to weekly transactions:
In the above, you can infer a date pattern in the form of
2018/01/29, which suggests that there may be a pattern of paths to transaction files. Based on the pattern, it'd be useful to be able to do the following:
- Import data from parallel paths for other weeks' data.
- Sample across all of the available datasets.
- Execute jobs based on runtime variables that you set for other transaction sets fitting the pattern.
In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:
The above example implements a Datetime parameter on the path values, creating a dataset with parameters.
You can use the following types of parameters to create datasets with parameters:
- Datetime parameters: Apply parameters to date and time values appearing in source paths.
- When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.
- Variables: Define variable names and default values for a dataset with parameters.
- Variable parameters can be applied to elements of the source path or to the bucket name, if applicable.
- Modify these values at runtime to parameterize execution.
- Pattern parameters:
- Wildcards: Apply wildcards to replace path values.
- Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.
: The platform supports a simplified means of expressing patterns.
- For more information on Patterns , see Text Matching.
For more information, see Create Dataset with Parameters.
Guidelines for Sources
The source files or tables for a dataset with parameters should have consistent structures. Since the sources are parsed with the same recipe or recipes, variations in schema could cause breakages in the recipe or initial parsing steps, which are applied based on the schema of the first matching source.
NOTE: All datasets imported through a single parameter are expected to have exactly matching schemas. For more information on variations, see Mismatched Schemas below.
Tip: If there have been changes to the schema of the sources of your dataset with parameters, you can edit the dataset and update the parameters. See Library Page.
Parameters in paths for imported datasets are rendered as regular expressions. Depending on the number of parameters and the comparative depth of them in a parameterized dataset, the process of performing all pattern checks can grow large, impacting import performance.
Tip: When specifying an imported dataset with parameters, you should attempt to be as specific as possible in your parameter definitions.
NOTE: When importing one or more Excel files as a parameterized dataset, you select worksheets to include from the first file. If there are worksheets in other Excel files that match the names of the worksheets that you selected, those worksheets are also imported. All worksheets are unioned together into a single imported dataset with parameters. Pattern-based parameters are not supported for import of Excel worksheets.
Dataprep by Trifacta expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.
If schemas do not match:
- When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.
- If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.
- If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.
- If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.
- Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:
- After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
- Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
- Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas.
- Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
- Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
- You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
- If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Initial Parsing Steps.
- You cannot create datasets with parameters from uploaded data.
- You cannot create dataset with parameters from multiple file types.
- File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
- You cannot create parameters across text and binary file types.
- For parameterized datasets sourced from Cloud Storage, only the first 100,000 files are read.
- For datasources that require conversion, such as Excel, PDF, or JSON files, you can create a dataset with parameters from a maximum of 500 converted files.
- Parameter and variable names can be up to 255 characters in length.
- For regular expressions, the following reference types are not supported due to the length of time to evaluate:
Backreferences. The following example matches on
cxcyet generates an error:
Lookahead assertions: The following example matches on
a, but only when it is part of an
abpattern. It generates an error:
- For some source file types, such as Parquet, the schemas between source files must match exactly.
Creating Dataset with Parameters
From file system
When browsing for data on your default storage layer, you can choose to parameterize elements of the path. Through the Import Data page, you can select elements of the path, apply one of the supported parameter types and then create the dataset with parameters.
NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.
- You can choose to search nested folders for files that match your specified pattern.
You can choose to search hidden folders.
NOTE: Including hidden folders must be enabled by an administrator. For more information, see Dataprep Project Settings Page.
Tip: If your imported dataset is stored in a bucket, you can parameterize the bucket name, which can be useful if you are migrating flows between environments or must change the bucket at some point in the future.
For more information, see Create Dataset with Parameters.
From relational source
If you are creating a dataset from a relational source, you can apply parameters to the custom SQL that pulls the data from the source.
NOTE: Avoid using parameters in places in the SQL statement that change the structure of the data. For example, within a SELECT statement, you should not add parameters between the SELECT and FROM keywords.
For more information, see Create Dataset with SQL.
When a dataset with parameters is imported for use, all matching source files or tables are automatically unioned together.
NOTE: Sources for a dataset with parameters should have matching schemas.
The initial sample that is loaded in the Transformer page is drawn from the first matching source file or table. If the initial sample is larger than the first file, rows may be pulled from other source objects.
Managing Datasets with Parameters
Datasets with parameters in your flows
After you have imported a dataset with parameters into your flow:
- You can review any parameters that have been applied to the dataset through the Parameterization in Flow view.
- When the dataset with parameters is selected, you can use the right panel to review and edit the parameters that are applied to it.
- You can override the default value applied to the parameter through Flow View. See Manage Parameters Dialog.
For more information, see Flow View Page.
Tip: You can review details on the parameters applied to your dataset. See Dataset Details Page.
Sampling from datasets with parameters
When a dataset with parameters is first loaded into the Transformer page, the initial sample is loaded from the first found match in the range of matching datasets. If this match is a multi-sheet Excel file, the sample is taken from the first sheet in the file.
To work with data that appears in files other than the first match in the dataset, you must create a new sample in the Transformer page. Any sampling operations performed within the Transformer page sample across all matching sources of the dataset.
If you have created a variable with your dataset, you can apply a variable value to override the default at sampling time. In this manner, you can specify sampling to occur from specific source files from your dataset with parameters.
For more information, see Overview of Sampling.
Scheduling for datasets with parameters
Schedules can been applied to a dataset with parameters. When resolving date range rules for scheduling a dataset with parameters, the schedule time is used.
For more information, see Add Schedule Dialog.
Sharing for datasets with parameters
By default, when a flow containing parameters is copied, any changes to parameter values in the copied flow also affect parameters in the original flow. To separate these parameters, you have the following options:
- Optionally, when the flow is copied, you can copy the underlying datasets.
- As a workaround, you can export and import the flow into the same system and replace the datasets in the imported flow.
NOTE: For copying flows using parameterized datasets, you should duplicate the datasets, which creates separate copies of parameters and their values in the new flow. If datasets are not copied, then parameter changes in the copied flow modify the values in the source flow.
For more information, see Overview of Sharing.
Since Dataprep by Trifacta never touches the source data, after a source that is matched for a dataset with parameters has been executed, you should consider removing it from the source system or adjusting any applicable ranges on the matching parameters. Otherwise, outdated data may continue to factor into operations on the dataset with parameters.
NOTE: Housekeeping of source data is outside the scope of Dataprep by Trifacta. Please contact your IT staff to assist as needed.
You can specify flow parameters and their default values, which can be invoked in the recipe steps of your flow. Wherever the flow parameter is invoked, it is replaced by the value you set for the parameter. Uses:
- Dynamically affect recipe steps
- Improve flow usability; build fewer flows and recipes to maintain
- Parameters are evaluated at design time in the Transformer page and at runtime during job execution
- All parameter values can be overridden, as needed.
Flow parameter types:
Literal values: These values are always of String data type.
Tip: You can wrap flow parameter references in your transformations with one of the
NOTE: Wildcards are not supported.
- Patterns . For more information, see Text Matching.
- Regular expressions.
- Flow parameters are converted to constants in macros. Use of the macro in other recipes results in the constant value being applied.
- A flow parameter cannot be used in some transformation steps or fields.
Suppose you need to process your flow across several regions of your country. These regions are identified using a region ID value:
From the Flow View context menu, you select Manage parameters. In the Parameters tab, you specify the parameter name:
You must specify a default value. To verify that this critical parameter is properly specified before job execution, you set the default value to:
The above setting implies two things:
- If the above value appears in the output, then an override value for the parameter was not specified when the job was executed, which prevents the default value being used erroneously.
- Before the job is executed, you must specify an override value. You can specify an override:
- At the flow level to assist in recipe development.
- At run time to insert the proper region value for the job run.
After the flow parameter has been created, you can invoke it in a transformation step using the following syntax.
Where the parameter is referenced, the default or applicable override value is applied. For more examples, see Create Flow Parameter.
Upstream flow parameters
If your flow references a recipe or dataset that is sourced from an upstream flow, the flow parameters from that flow are available in your current flow. That value of the parameter at time of execution is passed to the current flow.
NOTE: Downstream values and overrides of parameters that share the same name take precedence. When you execute the downstream flow, the parameter value is applied to the current flow and to all upstream objects. For more information, see "Order of Evaluation" below.
Creating flow parameters
Flow parameters are created at the flow level from the context menu in Flow View. See Manage Parameters Dialog.
Managing flow parameters
Flow parameters can be edited, deleted, and overridden through the Flow View context menu. See Manage Parameters Dialog.
Flow parameters in plans
You can also apply overrides to your flow parameters as part of your plan definition. For more information, see Plan View Page.
You can specify variable and timestamp parameters to apply to the file or table paths of your outputs.
NOTE: Output parameters are independent of dataset parameters.
You can create the following types of output parameters:
- Datetime parameters: Insert date and time values in output paths based on the job's start time.
- Variables: Define variable names and default values for an output parameter. Modify these values at runtime to parameterize execution.
Tip: These types of parameters can be applied to file or table paths. An output path can contain multiple parameters.
Suppose you are generating a JSON file as the results of job execution.
Since this job is scheduled and will be executed on a regular interval, you want to insert a timestamp as part of the output, so that your output filenames are unique and timestamped:
In this case, you would create an output parameter of timestamp type as part of the write settings for the job you are scheduling.
Creating output parameters
When you are creating or editing a publishing action in the Run Jobs page, you can click the Parameterize destination link that appears in the right panel.
Tip: For outputs that are stored in buckets, you can parameterize the name of the bucket.
For more information, see Create Outputs.
Using output parameters
Whenever you execute a job using the specified publishing action, the output parameters are applied.
After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.
For more information, see Run Job Page.
Bucket Name Parameters
In addition to parameterizing the paths to imported datasets or outputs, you can also apply parameters to the buckets where these assets are stored. For example, if you are developing flows in one workspace and deploying them into a production workspace, it may be useful to create a parameter for the name of the bucket where outputs are written for the workspace.
Bucket names can be parameterized for the buckets in the following datastores:
- Cloud Storage
For more information:
For each of the following types of parameter, you can apply override values as needed.
|dataset parameters||When you run a job, you can apply override values to variables for your imported datasets. See Run Job Page.|
At the flow level, you can apply override values to flow parameters. These values are passed into the recipe and the rest of the flow for evaluation during recipe development and job execution.
NOTE: Overrides applied at the flow level are passed into all recipes and other objects in the flow. Wherever there is case-sensitive match between the name of the overridden parameter and a parameter name in the flow, the override value is applied. These values can be overridden by ad-hoc values. See "Order of Precedence" below.
|output parameters||When you define your output objects in Flow View, you can apply override values to the parameterized output paths on an as-needed basis when you specify your job settings. See Run Job Page.|
Order of Parameter Evaluation
Wherever a parameter value or override is specified in the following list, the value is applied to all matching parameters within the execution tree. Suppose you have created a parameter called
varRegion, which is referenced in your imported dataset, recipe, and output object. If you specify an override value for
varRegion in the Run Job page, that value is applied to the data you import (dataset parameter), the recipe during execution (flow parameter), and the path of the output that you generate (output parameter). Name matches are case-sensitive.
NOTE: Override values are applied to upstream flows, as well. Any overrides specified in the current flow are passed to downstream flows, where they can be overridden as needed.
Parameter values are evaluated based on the following order of precedence (highest to lowest):
NOTE: The following does not apply to environment parameters, which cannot be overridden.
Run-time overrides: Parameter values specified at run-time for jobs.
NOTE: The override value is applied to all subsequent operations in the platform. When a job is submitted to the job queue, any overrides are applied at that time. Changes to override values do not affect jobs that are already in flight.
NOTE: You can specify run-time override values when executing jobs through the APIs. See API Workflow - Run Job.
See Run Job Page.
- Flow level overrides: At the flow level, you can specify override values, which are passed into the flow's objects. These values can be overridden by overrides set in the above locations. See Manage Parameters Dialog.
- Default values: If no overrides are specified, the default values are applied:
- Inherited (upstream) values: Any parameter values that are passed into a flow can be overridden by any matching override specified within the downstream flow.
Run Jobs with Parameters
When running a job based on datasets with parameters, results are written into separate folders for each parameterized path.
NOTE: During job execution, a canary file is written for each set of results to validate the path. For datasets with parameters, if the path includes folder-level parameterization, a separate folder is created for each parameterized path. During cleanup, only the the canary files and the original folder path are removed. The parameterized folders are not removed. This is a known issue.
NOTE: Due to a limitation in Dataflow, when you run a job on a parameterized dataset containing more than 100 files, the input paths data must be compressed, which results in non-readable location values in the Dataflow console. Running jobs on datasets sourced from more than 6000 files may fail.
Runtime Parameter Overrides
When you choose to run a job on a dataset with parameters from the user interface, any variables are specified using their default values.
Through the Run Job page, you can specify different values to apply to variables for the job.
NOTE: Override values applied to a job are not validated. Invalid overrides may cause your job to fail.
NOTE: Values applied through the Run Job page to variables override the default values for the current execution of the job. Default values for the next job are not modified.
NOTE: When you edit an imported dataset, if a variable is renamed, a new variable is created using the new name. Any override values assigned under the old variable name for the dataset must be re-applied. Instances of the variable and override values used in other imported datasets remain unchanged.
For more information, see Run Job Page.
Scheduling jobs on datasets with parameters
You can schedule jobs for datasets with parameters.
NOTE: When a job is executed, the expected time of execution is used during execution. For scheduled jobs, this value is the scheduled time. For example, if a job scheduled for 08:00 begins execution at 08:05, any parameters that reference "now" time use 08:00 during the job run.
For a scheduled job:
- Parameter values are evaluated based on the scheduled time of execution. Relative times are evaluated based on the scheduled time of execution.
- If there are interruptions in service due to maintenance windows or other reasons, scheduled jobs are queued for execution on restart. These queued jobs are attempted only once.
See Schedule a Job.
Parameters in Job Details
In the Job Details page:
- Data sources tab: For file-based parameterized datasets, you can review the files that were matched at runtime for the specified parameters.
- Parameters tab: View the parameter names and values that were used as part of the job, including the list of matching datasets.
See Job Details Page.
This page has no comments.