Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

D toc

d-excerpt

In 

D s product
rtrue
parameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution.

Parameter types:

  • Environment Parameters: A workspace administrator or project owner can specify parameters that are available across the environment, including default values for them.
  • Dataset Parameters: You can parameterize the paths to inputs for your imported datasets, creating datasets with parameters. For file-based imported datasets, you can parameterize the bucket where the source is stored.
  • Flow Parameters: You can create parameters at the flow level, which can be referenced in any recipe in the flow.
  • Output Parameters: When you run a job, you can create parameters for the output paths for file- or table-based outputs.

These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.

Datasets with Parameters

...

Environment Parameters

Project owners or workspace administrators can define parameters that apply across the project or workspace environment. These parameters can be referenced by any user in the environment, but only a user with admin access can define, modify, or delete these parameters.

Tip

Tip: Environment parameters are a useful means of ensuring that all users of the project or workspace share common reference values to buckets, output locations, and more. Environment parameter definitions can be exported and then imported into other projects or workspaces to ensure commonality across the enterprise. The values assigned to environment parameters can be modified after they have been imported into a new project or workspace.


Info

NOTE: You must have admin access to the project or workspace to define environment parameters.

  • Names of environment parameters must begin with env..

Limitations

  • You cannot use environment parameters in recipes.

  • You cannot use environment parameters in plans.

  • Environment parameter names are unique within the environment.

Example - parameterized bucket names

In this example, you have three 

D s item
itemworkspaces
, each of which has a different set of resources, although the only difference between them is the name of the S3 bucket in which they are stored:

Environment NameS3 Bucket Name
Devmyco-s3-dev
Testmyco-s3-test
Prodmyco-s3-prod

In your Dev workspace, you can create an environment parameter called the following: 

Code Block
env.bucket-source

The default value for this parameter is set to:

Code Block
myco-s3-dev

When creating imported datasets in this workspace, you insert the environment parameter for the source bucket for each one.

For your Test and Prod environments:

  1. Export your environment parameters from Dev. 
  2. Import them into Test and Prod. During import, the importing user can map the imported parameters to existing parameters in the environment. 
  3. In the imported environments, an administrator can manage the imported parameters and values as needed.

When you later export your flows from Dev and move them to Test and Prod, the imported flows automatically connect to the correct bucket for the target environment, since the bucket name is referenced by an environment parameter.

Export and Import

You can export environment parameters from one environment and import them to another. For example, you may be building your flows in a Dev workspace before they are exported and imported into a Prod workspace. If your flows make use of environment parameters from the Dev space, you may want to export the parameters and their values from the Dev workspace for migration to the Prod workspace. 

Info

NOTE: As part of the import process, you must reconcile name conflicts between imported environment parameters and the parameters that already exist in the workspace.

For more information, see Manage Environment Parameters.

Datasets with Parameters

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

...

  • Datetime parameters: Apply parameters to date and time values appearing in source paths.
    • When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.
    Variables: Define variable names and default values for a dataset with parameters.
    • of the Datetime values.
  • Variables: Define variable names and default values for a dataset with parameters. 
    • Variable parameters can be applied to elements of the source path or to the bucket name, if applicable.
    • Modify these values at runtime to parameterize execution.
  • Pattern parameters: 
    • Wildcards: Apply wildcards to replace path values.
    • Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.
    • D s lang
      itempatterns
      : The platform supports a simplified means of expressing patterns. 
      • For more information on 
        D s lang
        itempatterns
        , see Text Matching.

...

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas. 
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Initial Parsing Steps.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.

...

Info

NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patternspatterns.


Tip

Tip: If your imported dataset is stored in a bucket, you can parameterize the bucket name, which can be useful if you are migrating flows between environments or must change the bucket at some point in the future.

For more information, see Create Dataset with Parameters.

...

  • Literal values: These values are always of String data type.

    Tip

    Tip: You can wrap flow parameter references in your transformations with one of the PARSE functions. For more information, see Create Flow Parameter.

    Info

    NOTE: Wildcards are not supported.

  • D s lang
    itempatterns
    . For more information, see Text Matching.
  • Regular expressions.

Limitations

  • Flow parameters are converted to constants in macros. Use of the macro in other recipes results in the constant value being applied.
  • A flow parameter cannot be used in some transformation steps or fields. 

...

In this case, you would create an output parameter of timestamp type as part of the write settings for the job you are scheduling. 

Creating output parameters

When you are creating or editing a publishing action in the Run Jobs page, you can click the Parameterize destination link that appears in the right panel. See Run Job Page.

Using output parameters

Whenever you execute a job using the specified publishing action, the output parameters are applied. 

After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.

For more information, see Run Job Page

Parameter Overrides

For each type for the job you are scheduling. 

Creating output parameters

When you are creating or editing a publishing action in the Run Jobs page, you can click the Parameterize destination link that appears in the right panel.

Tip

Tip: For outputs that are stored in buckets, you can parameterize the name of the bucket.

For more information, see Create Outputs.

Using output parameters

Whenever you execute a job using the specified publishing action, the output parameters are applied. 

After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.

For more information, see Run Job Page

Bucket Name Parameters

In addition to parameterizing the paths to imported datasets or outputs, you can also apply parameters to the buckets where these assets are stored. For example, if you are developing flows in one workspace and deploying them into a production workspace, it may be useful to create a parameter for the name of the bucket where outputs are written for the workspace.

Bucket names can be parameterized for the buckets in the following datastores:

  • S3

...

Info

NOTE: Bucket names for

D s tfs
cannot be parameterized.

...

  • For more information on examples of parameterized bucket names, see "Environment Parameters" above.

For more information:

Parameter Overrides

For each of the following types of parameter, you can apply override values as needed.

...

Wherever a parameter value or override is specified in the following list, the value is applied to all matching parameters within the execution tree. Suppose you have created a parameter called varRegion, which is referenced in your imported dataset, recipe, and output object. If you specify an override value for varRegion in the Run Job page, that value is applied to the data you import (dataset parameter), the recipe during execution (flow parameter), and the path of the output that you generate (output parameter). Name matches are case-sensitive.

Info

NOTE: Override values are applied to upstream flows, as well. Any overrides specified in the current flow are passed to downstream flows, where they can be overridden as needed.

Parameter values are evaluated based on the following order of precedence (highest to lowest):

Info

NOTE: The following does not apply to environment parameters, which cannot be overridden.

  1. Run-time overrides: Parameter values specified at run-time for jobs. 

    Info

    NOTE: The override value is applied to all subsequent operations in the platform. When a job is submitted to the job queue, any overrides are applied at that time. Changes to override values do not affect jobs that are already in flight.

    Info

    NOTE: You can specify run-time override values when executing jobs through the APIs. See API Workflow - Run Job.

    See Run Job Page.

  2. Flow level overrides: At the flow level, you can specify override values, which are passed into the flow's objects. These values can be overridden by overrides set in the above locations. See Manage Parameters Dialog.
  3. Default values: If no overrides are specified, the default values are applied:
    1. Imported datasets: See Create Dataset with Parameters.
    2. Flow parameters: See Manage Parameters Dialog.
    3. Output parameters: See Run Job Page.
  4. Inherited (upstream) values: Any parameter values that are passed into a flow can be overridden by any matching override specified within the downstream flow.

...