Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r096

D toc

d-excerpt

In 

D s product
rtrue
parameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution.

Parameter types:

  • Environment Parameters: A workspace administrator or project owner can specify parameters that are available across the environment, including default values for them.
  • Dataset Parameters: You can parameterize the paths to inputs for your imported datasets, creating datasets with parameters. For file-based imported datasets, you can parameterize the bucket where the source is stored.
  • Flow Parameters: You can create parameters of multiple types at the flow level, which can be referenced in any recipe in the flow.
  • Output Parameters: When you run a job, you can create parameters for the output paths for file- or table-based outputs.

These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.

Environment Parameters

Project owners or workspace administrators can define parameters that apply across the project or workspace environment. These parameters can be referenced by any user in the environment, but only a user with admin access can define, modify, or delete these parameters.

...

  • Names of environment parameters must begin with env..

Limitations

  • Environment parameter names are unique within the environment.
  • You cannot use environment parameters in recipes.
  • You cannot use environment parameters in plans.Environment parameter
  • names are unique within the environmentYou cannot use environment parameters in schedule triggers.

Example - parameterized bucket names

...

When you later export your flows from Dev and move them to Test and Prod, the imported flows automatically connect to the correct bucket for the target environment, since the bucket name is referenced by an environment parameter.

Export and Import

You can export environment parameters from one environment and import them to another. For example, you may be building your flows in a Dev workspace before they are exported and imported into a Prod workspace. If your flows make use of environment parameters from the Dev space, you may want to export the parameters and their values from the Dev workspace for migration to the Prod workspace. 

Info

NOTE: As part of the import process, you must reconcile name conflicts between imported environment parameters and the parameters that already exist in the workspace.

For more information, see Manage Environment Parameters.

Datasets with Parameters

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

Example

Suppose you have imported data from a file system source, which has the following source path to weekly transactions:

Code Block
<file_system>:///source/transactions/2018/01/29/transactions.csv

In the above, you can infer a date pattern in the form of 2018/01/29, which suggests that there may be a pattern of paths to transaction files.  Based on the pattern, it'd be useful to be able to do the following:

  • Import data from parallel paths for other weeks' data.
  • Sample across all of the available datasets.
  • Execute jobs based on runtime variables that you set for other transaction sets fitting the pattern.

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:

...

Example - parameterized database names

In the previous example, bucket names differed between the development and production workspaces. Similarly, there may be different values for the database to use between development and production. 

Tip

Tip: If you are creating datasets using custom SQL, this method can be used for migrating flows between workspaces. If the flows reference imported datasets that use custom SQL, you only need to update the value of the environment parameter in the target workspace and reset the connection identifier in order to reconnect the imported flow to its data sources. 


Environment NameDatabase Name
Devmyco-DB-dev
Testmyco-DB-test
Prodmyco-DB-prod

In your Dev workspace, you can create the environment parameter as follows:

Code Block
env.db-source

Export and Import

You can export environment parameters from one environment and import them to another. For example, you may be building your flows in a Dev workspace before they are exported and imported into a Prod workspace. If your flows make use of environment parameters from the Dev space, you may want to export the parameters and their values from the Dev workspace for migration to the Prod workspace. 

Info

NOTE: As part of the import process, you must reconcile name conflicts between imported environment parameters and the parameters that already exist in the workspace.

For more information, see Manage Environment Parameters.

Datasets with Parameters

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

Example

Suppose you have imported data from a file system source, which has the following source path to weekly transactions:

Code Block
<file_system>:///source/transactions/YYYY2018/MM01/DD29/transactions.csv

The above example implements a Datetime parameter on the path values, creating a dataset with parameters

Parameter Types

You can use the following types of parameters to create datasets with parameters:

  • Datetime parameters: Apply parameters to date and time values appearing in source paths.
    • When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.
  • Variables: Define variable names and default values for a dataset with parameters. 
    • Variable parameters can be applied to elements of the source path or to the bucket name, if applicable.
    • Modify these values at runtime to parameterize execution.
  • Pattern parameters: 
    • Wildcards: Apply wildcards to replace path values.
    • Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.
    • D s lang
      itempatterns
      : The platform supports a simplified means of expressing patterns. 
      • For more information on 
        D s lang
        itempatterns
        , see Text Matching.

For more information, see Create Dataset with Parameters.

Guidelines for Sources

The source files or tables for a dataset with parameters should have consistent structures. Since the sources are parsed with the same recipe or recipes, variations in schema could cause breakages in the recipe or initial parsing steps, which are applied based on the schema of the first matching source. 

Info

NOTE: All datasets imported through a single parameter are expected to have exactly matching schemas. For more information on variations, see Mismatched Schemas below.

Tip

Tip: If there have been changes to the schema of the sources of your dataset with parameters, you can edit the dataset and update the parameters. See Library Page.

Parameters in paths for imported datasets are rendered as regular expressions. Depending on the number of parameters and the comparative depth of them in a parameterized dataset, the process of performing all pattern checks can grow large, impacting import performance.

Tip

Tip: When specifying an imported dataset with parameters, you should attempt to be as specific as possible in your parameter definitions.

Mismatched Schemas

D s product
 expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

  • When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.
  • If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.
  • If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.
    • If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.
  • Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:

Tips:

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas. 
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Initial Parsing Steps.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.

...

In the above, you can infer a date pattern in the form of 2018/01/29, which suggests that there may be a pattern of paths to transaction files.  Based on the pattern, it'd be useful to be able to do the following:

  • Import data from parallel paths for other weeks' data.
  • Sample across all of the available datasets.
  • Execute jobs based on runtime variables that you set for other transaction sets fitting the pattern.

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:

Code Block
<file_system>:///source/transactions/YYYY/MM/DD/transactions.csv

The above example implements a Datetime parameter on the path values, creating a dataset with parameters

Parameter Types

You can use the following types of parameters to create datasets with parameters:

  • Datetime parameters: Apply parameters to date and time values appearing in source paths.
    • When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.
  • Variables: Define variable names and default values for a dataset with parameters. 
    • Variable parameters can be applied to elements of the source path or to the bucket name, if applicable.
    • Modify these values at runtime to parameterize execution.
  • Pattern parameters: 
    • Wildcards: Apply wildcards to replace path values.
    • Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.
    • D s lang
      itempatterns
      : The platform supports a simplified means of expressing patterns. 
      • For more information on 
        D s lang
        itempatterns
        , see Text Matching.

For more information, see Create Dataset with Parameters.

Guidelines for Sources

The source files or tables for a dataset with parameters should have consistent structures. Since the sources are parsed with the same recipe or recipes, variations in schema could cause breakages in the recipe or initial parsing steps, which are applied based on the schema of the first matching source. 

Info

NOTE: All datasets imported through a single parameter are expected to have exactly matching schemas. For more information on variations, see Mismatched Schemas below.

Tip

Tip: If there have been changes to the schema of the sources of your dataset with parameters, you can edit the dataset and update the parameters. See Library Page.

Parameters in paths for imported datasets are rendered as regular expressions. Depending on the number of parameters and the comparative depth of them in a parameterized dataset, the process of performing all pattern checks can grow large, impacting import performance.

Tip

Tip: When specifying an imported dataset with parameters, you should attempt to be as specific as possible in your parameter definitions.

Info

NOTE: When importing one or more Excel files as a parameterized dataset, you select worksheets to include from the first file. If there are worksheets in other Excel files that match the names of the worksheets that you selected, those worksheets are also imported. All worksheets are unioned together into a single imported dataset with parameters. Pattern-based parameters are not supported for import of Excel worksheets.

Mismatched Schemas

D s product
 expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

  • When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.
  • If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.
  • If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.
    • If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.
  • Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:

Tips:

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas. 
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Initial Parsing Steps.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.

  • For datasources that require conversion, such as Excel, PDF, or JSON files, you can create a dataset with parameters from a maximum of 500 converted files.
  • Parameter and variable names can be up to 255 characters in length.
  • For regular expressions, the following reference types are not supported due to the length of time to evaluate:
    • Backreferences. The following example matches on axabxb, and cxc yet generates an error:

      Code Block
      ([a-c])x\1
    • Lookahead assertions: The following example matches on a, but only when it is part of an ab pattern. It generates an error:

      Code Block
      a(?=b)
  • For some source file types, such as Parquet, the schemas between source files must match exactly.

...

Info

NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patternsdirectories and be as specific as possible with your wildcards and patterns.

Tip

Tip: For best results when parameterizing directories in your file path, include the trailing slash (/ ) as part of your parameterized value.

Options:

  • You can choose to search nested folders for files that match your specified pattern.

Tip

Tip: If your imported dataset is stored in a bucket, you can parameterize the bucket name, which can be useful if you are migrating flows between environments or must change the bucket at some point in the future.

For more information, see Create Dataset with Parameters.

...

You can specify flow parameters and their default permitted values, which can be invoked in the recipe steps of your flow. Wherever the flow parameter is invoked, it is replaced by the value you set for the parameter.

Uses:

  • Dynamically affect recipe steps
  • Improve flow usability; build fewer flows and recipes to maintain
  • Parameters are evaluated at design time in the Transformer page and at runtime during job execution
  • All parameter values can be overridden, as needed.

Flow parameter types:

  • Variable type flow parameters:
    • Literal values: These values are always of String data type.

      Tip

      Tip: You can wrap flow parameter references in your transformations with one of the PARSE functions

    . For more information, see Create Flow Parameter
    • .

      Info

      NOTE: Wildcards are not supported.

    • D s lang
      itempatterns
      . For more information, see Text Matching.
    • Regular expressions.
  • Selector type flow parameters:
    • These parameters are specified as a list of permitted values. 
    • The first value is the default value.
    • The parameter values are treated as String values. 

Limitations

  • Flow parameters are converted to constants in macros. Use of the macro in other recipes results in the constant value being applied.
  • A flow parameter cannot be used in some transformation steps or fields. 

Example - variable type

Suppose you need to provide variable date ranges to process your flow across several regions of your country. These regions are identified using a region ID value: pacificmountaincentral, or eastern

From the Flow View context menu, you select Manage parameters. In the Parameters tab, you specify the parameter name:

Code Block
paramRegion

You must specify a default value. To verify that this critical parameter is properly specified before job execution, you set the default value to:

Code Block
##UNSPECIFIED##

The above setting implies two things:

...

your transformations. For example, you may need to be able to transform the data for the preceding 30, 60, 90, or custom days. 

You can create a Variable parameter called:

Code Block
$paramDaysBack

To verify that this critical parameter is properly specified before job execution, you set the default value to:

Code Block
##UNSPECIFIED##

The above setting implies two things:

  • If the above value appears in the output, then an override value for the parameter was not specified when the job was executed, which prevents the default value being used erroneously.
  • Before the job is executed, you must specify an override value. You can specify an override:
    • At the flow level to assist in recipe development.
    • At run time to insert the proper DaysBack value for the job run. 
  • Since all values are treated as String values, you must interpret the numeric values within your transformation steps as Integer data types.

For more examples, see Create Flow Parameter.

Example - selector type

Suppose you need to process your flow across several regions of your country. These regions are identified using a region ID value. 

From the Flow View context menu, you select Manage parameters. In the Parameters tab, you specify Selector for the Type.

Specify the parameter name:

Code Block
paramRegion

You can then add Choice values for the permitted regions:

Code Block
WEST
MOUNTAIN
CENTRAL
EAST
OTHER

After the flow parameter has been created, you can invoke it in a transformation step using the following syntax.

...

Where the parameter is referenced, the default or applicable override value is applied. 

You can test for the parameter by using the following conditional in your transformation step and selecting from one of the listed values:

Code Block
$paramRegion == <selected_value>

For more examples, see Create Flow Parameter.

...

Tip

Tip: These types of parameters can be applied to file or table paths. An output path can contain multiple parameters.

Example

Suppose you are generating a JSON file as the results of job execution. 

...

When you are creating or editing a publishing action in the Run Jobs Job page, you can click the Parameterize destination link that appears in the right panel.

...

Bucket names can be parameterized for the buckets in the following datastores:

  • S3
Info

NOTE: Bucket names for

D s tfs
cannot be parameterized.

...

For more information:

...

Override TypeDescription
dataset parametersWhen you run a job, you can apply override values to variables for your imported datasets. See Run Job Page.
flow parameters

At the flow level, you can apply override values to flow parameters. These values are passed into the recipe and the rest of the flow for evaluation during recipe development and job execution.

Info

NOTE: Overrides applied at the flow level are passed into all recipes and other objects in the flow. Wherever there is case-sensitive match between the name of the overridden parameter and a parameter name in the flow, the override value is applied. These values can be overridden by ad-hoc values. See "Order of Precedence" below.

output parametersWhen you define your output objects in Flow View, you can apply override values to the parameterized output paths on an as-needed basis when you specify your job settings. See Run Job Page

below.

When flows are referenced in a plan task, their parameters and default values are available for override.

output parametersWhen you define your output objects in Flow View, you can apply override values to the parameterized output paths on an as-needed basis when you specify your job settings. See Run Job Page.
environment parameters

Environment parameters cannot be overridden.

Triggers

In flow and plan triggers, you can apply overrides to parameter values that are inherited from the object being triggered. 

Info

NOTE: In a trigger, displayed parameter values may be inherited from the plan or flow that is being triggered. To ensure that the proper value is used, you should set a specific value for the override in the trigger. This is a known issue.

Order of Parameter Evaluation

...

Info

NOTE: The following does not apply to environment parameters, which cannot be overridden.

  1. RunRuntime or trigger-time based overrides:  Parameter values specified at run-time for runtime for ad-hoc or scheduled jobs. 

    Info

    NOTE: The override value is applied to all subsequent operations in the platform. When a job is submitted to the job queue, any overrides are applied at that time. Changes to override values do not affect jobs that are already in flight.

    Info

    NOTE: You can specify run-time override values when executing jobs through the APIs. See API Workflow - Run Job.

    See Run Job Page.

    Flow level overrides: At the flow

    Task - Run Job.

    1. For more information on runtime overrides, see Run Job Page.

    2. For more information for trigger-based overrides for flows, see Add Schedule Dialog.
    3. For more information on trigger-based overrides for plans, see Plan View for Triggers.
  2. Flow- or plan-level overrides: At the flow or plan level, you can specify override values, which are passed into the flow's objects. These values can be overridden by overrides set in the above locations. See See Manage Parameters Dialog.
  3. Default values: If no overrides are specified, the default values are applied:
    1. Imported datasets: See Create Dataset with Parameters.
    2. Flow parameters: See Manage Parameters Dialog.
    3. Output parameters: See Run Job Page.
  4. Inherited (upstream) values: Any parameter values that are passed into a flow can be overridden by any matching override specified within the downstream flow.

...

Info

NOTE: When you edit an imported dataset, if a variable is renamed, a new variable is created using the new name. Any override values assigned under the old variable name for the dataset must be re-applied. Instances of the variable and override values used in other imported datasets remain unchanged.

For more information, see Run Job Page.

...

.

For more information, see Run Job Page.

Scheduling jobs on datasets with parameters

...

Info

NOTE: When a job is executed, the expected time of execution is used during execution. For scheduled jobs, this value is the scheduled time. For example, if a job scheduled for 08:00 begins execution at 08:05, any parameters that reference "now" time use 08:00 during the job run.

For a scheduled job:

  • Parameter values are evaluated based on the scheduled time of execution. Relative times are evaluated based on the scheduled time of execution.
  • If there are interruptions in service due to maintenance windows or other reasons, scheduled jobs are queued for execution on restart. These queued jobs are attempted only once.

...

" time use 08:00 during the job run.

For a scheduled job:

  • Parameter values are evaluated based on the scheduled time of execution. Relative times are evaluated based on the scheduled time of execution.
  • If there are interruptions in service due to maintenance windows or other reasons, scheduled jobs are queued for execution on restart. These queued jobs are attempted only once.

See Schedule a Job.

Parameters in Job Details

In the Job Details page:

  • Data sources tab: For file-based parameterized datasets, you can review the files that were matched at runtime for the specified parameters.
  • Parameters tab: View the parameter names and values that were used as part of the job, including the list of matching datasets. 

See Job Details Page.

D s also
labelparameterization