Page tree

 

Contents:


 In Trifacta® Wrangler Enterpriseparameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution. Parameter types:

  • Dataset Parameters: You can parameterize the paths to inputs for your imported datasets, creating datasets with parameters.
  • Output Parameters: When you run a job, you can create parameters for the output paths for file- or table-based outputs.

These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.

Datasets with Parameters

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

Example

Suppose you have imported data from a file system source, which has the following source path to weekly transaction logs:

<file_system>:///source/transactions/2018/01/29/transactions.csv

In the above, you can infer a date pattern in the form of 2018/01/29, which suggests that there may be a pattern of paths to transaction files.  Based on the pattern, it'd be useful to be able to do the following:

  • Import data from parallel paths for other weeks' data.
  • Sample across all of the available datasets.
  • Execute jobs based on runtime variables that you set for other transaction sets fitting the pattern.


  • Pass in parameterized values through API to operationalize the execution of jobs across weeks of transaction data.

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:



<file_system>:///source/transactions/YYYY/MM/DD/transactions.csv

The above example implements a Datetime parameter on the path values, creating a dataset with parameters

Parameter Types

You can use the following types of parameters to create datasets with parameters:

  • Datetime parameters: Apply parameters to date and time values appearing in source paths.
    • When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.
  • Variables: Define variable names and default values for a dataset with parameters. Modify these values at runtime to parameterize execution.
  • Pattern parameters: 
    • Wildcards: Apply wildcards to replace path values.
    • Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.
    • Trifacta patterns: The platform supports a simplified means of expressing patterns. 

For more information, see Create Dataset with Parameters.

Mismatched Schemas

Trifacta Wrangler Enterprise expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

  • When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.
  • If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.
  • If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.
    • If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.
  • Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:

Tips:

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas. 
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Overview of Parameterization.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.
  • You cannot apply parameters to write or publishing operations.
  • For regular expression patterns, the following reference types are not supported due to the length of time to evaluate:
    • Backreferences. The following example matches on axa, bxb, and cxc yet generates an error:

      ([a-c])x\1
    • Lookahead assertions: The following example matches on a, but only when it is part of an ab pattern. It generates an error:

      a(?=b)
  • For some source file types, such as Parquet, the schemas between source files must match exactly.

Creating Dataset with Parameters

From file system

When browsing for data on your default storage layer, you can choose to parameterize elements of the path. Through the Import Data page, you can select elements of the path, apply one of the supported parameter types and then create the dataset with parameters.

NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.

For more information, see Create Dataset with Parameters.

From relational source

If you are creating a dataset from a relational source, you can apply parameters to the custom SQL that pulls the data from the source. 

NOTE: Avoid   using parameters in places in the SQL statement that change the structure of the data. For example, within a SELECT statement, you should not add parameters between the SELECT and FROM keywords.

For more information, see Create Dataset with SQL.


Matching parameters

When a dataset with parameters is imported for use, all matching source files or tables are automatically unioned together. 

NOTE: Sources for a dataset with parameters should have matching schemas.

The initial sample that is loaded in the Transformer page is drawn from the first matching source file or table. If the initial sample is larger than the first file, rows may be pulled from other source objects. 

Managing Datasets with Parameters

Datasets with parameters in your flows

After you have imported a dataset with parameters into your flow:

  • You can review any parameters that have been applied to the dataset through the Parameterization in Flow view. 
  • When the dataset with parameters is selected, you can use the right panel to review and edit the parameters that are applied to it. 
  • You can change the default value applied to the parameter through the Parameters panel in Flow View.

For more information, see Flow View Page.

Tip: You can review details on the parameters applied to your dataset. See Dataset Details Page.

Sampling from datasets with parameters

When a dataset with parameters is first loaded into the Transformer page, the initial sample is loaded from the first found match in the range of matching datasets. If this match is a multi-sheet Excel file, the sample is taken from the first sheet in the file.

With parameters:

To work with data that appears in files other than the first match in the dataset, you must create a new sample in the Transformer page.   Any sampling operations performed within the Transformer page sample across all matching sources of the dataset.

With variables:

If you have created a variable with your dataset, you can apply a variable value to override the default at sampling time. In this manner, you can specify sampling to occur from specific source files from your dataset with parameters.

For more information, see Overview of Sampling.

Scheduling for datasets with parameters

Schedules can been applied to a dataset with parameters. When resolving date range rules for scheduling a dataset with parameters, the schedule time is used.

For more information, see Add Schedule Dialog.

Sharing for datasets with parameters

NOTE: When a flow containing parameters is copied, any changes to parameter values in the copied flow also affect parameters in the original flow. As a workaround, you can export and import the flow into the same system and replace the datasets in the imported flow. This is a known issue.

For more information, see Overview of Sharing.

Housekeeping

Since Trifacta Wrangler Enterprise never touches the source data, after a source that is matched for a dataset with parameters has been executed, you should consider removing it from the source system or adjusting any applicable ranges on the matching parameters. Otherwise, outdated data may continue to factor into operations on the dataset with parameters.

NOTE: Housekeeping of source data is outside the scope of Trifacta Wrangler Enterprise. Please contact your IT staff to assist as needed.

Output Parameters

You can specify variable and timestamp parameters to apply to the file or table paths of your outputs.

NOTE: Output parameters are independent of dataset parameters.

Parameter Types

You can create the following types of output parameters:

  • Datetime parameters: Insert date and time values in output paths based on the job's start time.
  • Variables: Define variable names and default values for an output parameter. Modify these values at runtime to parameterize execution.

Tip: These types of parameters can be applied to file or table paths. An output path can contain multiple parameters.

Example

Suppose you are generating a JSON file as the results of job execution. 

/outputs/myFlow/myOutput.json

Since this job is scheduled and will be executed on a regular interval, you want to insert a timestamp as part of the output, so that your output filenames are unique and timestamped:

/outputs/myFlow/myOutput_<timestamp>.json

In this case, you would create an output parameter of timestamp type as part of the write settings for the job you are scheduling. 

Creating output parameters

When you are creating or editing a publishing action in the Run Jobs page, you can click the Parameterize destination link that appears in the right panel. See Run Job Page.

Using output parameters

Whenever you execute a job using the specified publishing action, the output parameters are applied. 

After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.

For more information, see Run Job Page

Run Jobs with Parameters


Runtime Parameter Overrides

When you choose to run a job on a dataset with parameters from the user interface, any variables are specified using their default values. 

Through the Run Job page, you can specify different values to apply to variables for the job. 

NOTE: Override values applied to a job are not validated. Invalid overrides may cause your job to fail.

NOTE: Values applied through the Run Job page to variables override the default values for the current execution of the job. Default values for the next job are not modified.

NOTE: When you edit an imported dataset, if a variable is renamed, a new variable is created using the new name. Any override values assigned under the old variable name for the dataset must be re-applied. Instances of the variable and override values used in other imported datasets remain unchanged.

For more information, see Run Job Page.

In the Job Details page, click the Parameters tab to view the parameter names and values that were used as part of the job, including the list of matching datasets. See Job Details Page.

Scheduling Jobs

You can schedule jobs for datasets with parameters. See Schedule a Job.

Operationalization with Parameters

APIs

Through the API, you can apply runtime parameters to datasets with parameters during job execution. For more information, see API JobGroups Create v4.

For more information on working with parameters and the APIs, see API Workflow - Run Job on Dataset with Parameters.

Use of parameters to create imported datasets through the API is not supported.

Configuration

Disable

By default, parameterization is enabled. Parameters:

  • Parameterization
  • Output parameterization

    NOTE: Output parameterization requires that basic Parameterization be enabled, too.

For more information on disabling, see Workspace Admin Page.


This page has no comments.