Page tree

Trifacta Dataprep



Contents:

If you licensed Dataprep by Trifacta before Oct. 14, 2020, you are using the Dataprep by Trifacta Legacy product edition. On October 14, 2022, this product edition will be decommissioned by Google and will be no longer available for use. Current customers of this product edition are encouraged to transition to one of the product editions hosted by Trifacta. See Product Editions.

   

Contents:


The following settings can be customized for the user experience in your  Dataprep by Trifacta project. When you modify a setting, the change is immediately applied to the project. To access the page, select User menu > Admin console > Project settings.


NOTE: Users may not experience the changed environment until each user refreshes the application page or logs out and in again.

Enablement Options:

NOTE: Any values specified in this page applies exclusively to the specific project and override any system-level defaults.

OptionDescription
Default

The default value is applied. This value may be inherited from higher level configuration.

Tip: You can review the default value as part of the help text.

Enabled

The setting is enabled.

NOTE: If the setting applies to a feature, the feature is enabled. Additional configuration may be required. See below.

DisabledThe setting is disabled.
EditClick Edit to enter a specific value for the setting.

Disable Dataprep

To disable Dataprep by Trifacta for this project, click the link.

NOTE: To remove a user and his or her assets from a project, please contact Trifacta Support.

For more information, see Enable or Disable Dataprep.

General

Locale

Set the locale to use for inferring or validating data in the application, such as numeric values or dates. The default is  United States .

NOTE: After saving changes to your locale, refresh your page. Subsequent executions of the data inference service use the new locale settings.

For more information, see Locale Settings.

Session duration

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium
  • Dataprep by Trifacta Standard

Specify the length of time in minutes before a session expires. Default is 10080 (one week).

API

Allow users to generate access tokens

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium

When enabled, individual users can generate their own personal access tokens, which enable access to REST APIs. For more information, see Manage API Access Tokens.

Maximum lifetime for user generated access tokens (days)

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium

Defines the maximum number of days that a user-generated access token is permitted for use in the product.

Tip: To permit generation of access tokens that never expire, set this value to -1.

For more information, see Manage API Access Tokens.

Connectivity

Custom SQL query

When enabled, users can create custom SQL queries to import datasets from relational tables. For more information, see Create Dataset with SQL.

Enable conversion of standard JSON files via conversion service

When enabled, the Trifacta application utilizes the conversion service to ingest JSON files and convert them to a tabular format that is easier to import into the application. For more information, see Working with JSON v2.

NOTE: This feature is enabled by default but can be disabled as needed. The conversion process performs cleanup and re-organization of the ingested data for display in tabular format.

When disabled, the Trifacta application uses the old version of JSON import, which does not restructure the data and may require additional recipe steps to manually structure it into tabular format.

NOTE: Although imported datasets and recipes created under v1 of the JSON importer continue to work without interruption, the v1 version is likely to be deprecated in a future release. You should switch your old imported datasets and recipes to using the new version. Instructions to migrate are provided at the link below.

NOTE: The legacy version of JSON import is required if you are working with compressed JSON files or only Newline JSON files.

For more information, see Working with JSON v1.

Manage access to data using user IAM permissions

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Premium

When enabled, user access to data services in Google Cloud Platform, such as Cloud Storage and Bigquery, is determined by the permissions defined in a user's assigned IAM role.


NOTE: When this feature is enabled, all Dataprep by Trifacta Premium users that belong to the project are automatically logged out of all Trifacta application sessions across all projects. For example, if a Dataprep by Trifacta Premium user is logged into the product through another project, the user is logged out of their Trifacta application session when this feature is enabled. When each user logs in to the Trifacta application again, any changes to the user's permissions are applied. Since each each API request requires authentication in the header, API users are not automatically logged out.

For more information on IAM-based permissions, Required Dataprep User Permissions .

Max endpoints per JDBC REST connection

For a REST API connection to a JDBC source, this parameter defines the maximum number of endpoints that can be defined per connection .

Avoid modifying this value unless you are experiencing timeouts or failures to connect.

For more information, see REST API Connections.

Flows, recipes, and plans

Column from examples

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium
  • Dataprep by Trifacta Standard

When enabled, users can access a tool through the column menus that enables creation of new columns based on example mappings from the selected column. For more information, see Overview of TBE.

Editor Scheduling

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Standard

When enabled, flow editors are also permitted to create and edit schedules. For more information, see Flow View Page.

NOTE: The Scheduling feature may need to be enabled in your environment. When enabled, flow owners can always create and edit schedules.

When this feature is enabled, plan collaborators are also permitted to create and edit schedules. For more information, see Plan View Page.

Export

When enabled, users are permitted to export their flows and plans. Exported flows can be imported into other work areas or product editions. 

NOTE: If plans have been enabled in your project settings, enabling this flag applies to flows and plans.

Import

When enabled, users are permitted to import exported flows and plans.

NOTE: If plans have been enabled in your project settings, enabling this flag applies to flows and plans.

Maximum number of files to read in a directory for the initial sample

When the Trifacta application is generating an initial sample of data for your dataset from a set of source files, you can define the maximum number of files in a directory from which the sample is generated. This limit is applied to reduce the overhead of reading in a new file, which improves performance in the Transformer page.

Tip: The initial sample type for files is generated by reading one file after another from the source. If the source is multiple files or a directory, this limit caps the maximum number of files that can be scanned for sampling purposes.

NOTE: If the files in the directory are small, the initial sample may contain the maximum number of files and less than the maximum size permitted for a sample. You may see fewer rows that expected.

If the generated sample is unsatisfactory, you can generate a new sample using a different method. In that case, this limit no longer applies. For more information, see Overview of Sampling.


Plan feature

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium

When enabled, users can create plans to execute sequences of recipes across one or more flows. For more information, see Plans Page.

For more information on plans and orchestration, see Overview of Operationalization.

Schematized output

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium

When enabled, all output columns are typecast to their annotated types. This feature is enabled by default.

Webhooks

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium

When enabled, webhook notification tasks can be configured on a per-flow basis in Flow View page. Webhook notifications allow you to deliver messages to third-party applications based on the success or failure of your job executions. For more information, see Create Flow Webhook Task.


Job execution

BigQuery execution

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium
  • Dataprep by Trifacta Standard

When enabled, the Trifacta application can execute transformation jobs inside BigQuery when all data sources and outputs for the job are located in BigQuery.

NOTE: Logical and physical optimization of jobs must also be enabled.

To enable BigQuery execution on your flow jobs, you must enable all general and BigQuery optimizations within the flow. For more information, see Flow Optimization Settings Dialog .

For more information on BigQuery as a running environment, see Overview of Job Execution.

BigQuery query temp dataset

By default,  Dataprep by Trifacta assumes that the service account used to run Dataflow jobs has been granted the bigquery.datasets.create permission. For jobs that contain a BigQuery data source, this permission is required by default. Dataflow job execution on BigQuery data sources creates a temporary BigQuery dataset, in which temporary tables are created to store intermediate query results from the BigQuery sources. The ability to create BigQuery datasets is required to run Dataflow jobs that contain one or more BigQuery data sources. 

In some environments, this permission cannot be granted, which prevents job execution from creating the required temporary dataset and causes the job to fail.

As an alternative, you can use this setting to specify a pre-existing BigQuery dataset within which  Dataprep by Trifacta can create the temporary tables. When this BigQuery dataset is provided, the job execution process writes intermediate query results into temporary tables within the dataset, and the bigquery.datasets.create permission is no longer required.

Requirements:

  • The dataset must be a pre-existing BigQuery dataset that is created outside of Dataprep by Trifacta. Please have your BigQuery administrator create the dataset first.
  • The temporary dataset must be located in the same region as the BigQuery source tables. Otherwise, the Dataflow job fails.
  • This BigQuery dataset is used only for Dataflow job execution on BigQuery datasources. Other sources and running environments are not affected.

Logical and physical optimization of jobs

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Starter Edition
  • Dataprep by Trifacta Premium
  • Dataprep by Trifacta Standard

When enabled, the Trifacta application attempts to optimize job execution through logical optimizations of your recipe and physical optimizations of your recipes interactions with data.

This workspace setting can be overridden for individual flows.

Tip: You should keep this feature enabled. Please enable it at the project level and disable it only if needed at the flow level.

For more information, see Flow Optimization Settings Dialog.

Require a companion service account for running jobs

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Premium

By default, Dataprep by Trifacta utilizes a default compute service account for running jobs on Dataflow. Optionally, you can enable this feature, which requires each user in the project to provide their own companion service account to run jobs. This feature is disabled by default.

Prerequisites:

  • Service accounts must be created in the Google Cloud platform.
  • Companion service accounts must have a minimum set of permissions.
  • For more information, see Google Service Account Management.

When this feature is enabled:

  • Project administrators can review and specify companion service accounts for individual users of the project. For more information, see Service Accounts Page.
  • Individual users can specify their companion service account. For more information see User Profile Page.
  • At runtime, an override service account can be applied if needed. See Run Job Page.

When this feature is disabled:

  • By default, all users of the project use the Compute Engine service account specified for the project.
  • If companion service accounts has been enabled, when it's disabled, the default service account for the project is used.
  • For more information, see Google Service Account Management.

SQL Scripts

When enabled, users may define SQL scripts to execute as part of a job's run. Scripts can be executed before data ingestion, after output publication, or both through any write-supported relational connection to which the user has access.

For more information, see Create Output SQL Scripts.

Schema validation

When enabled, by default the structure and ordering of columns in your import datasets are checked for changes before data is ingested for job execution. 

Tip: This setting can be overridden for individual jobs, even if it is disabled. For more information, see Run Job Page.

Errors are immediately reported in the Job Details page. See Job Details Page.

For more information on schema validation, see Overview of Schema Management.

Schema validation: stop job if schema changes are found

When schema validation is enabled, this setting specifies the default behavior when schema changes are found. 

  • When enabled, jobs are failed when schema changes are found, and error messages are surfaced in the Trifacta application.
  • When disabled, jobs are permitted to continue. 

    • Jobs may ultimately fail due to schema changes. 
    • Jobs may result in bad data being written in outputs.
    • Job failures may be more challenging to debug.

      Tip: Setting this value to Disabled matches the behavior of the Trifacta application from before schema validation was possible.

Tip: This setting can be overridden for individual jobs, even if it is disabled. For more information, see Run Job Page.

Errors are immediately reported in the Job Details page. See Job Details Page.

For more information on schema validation, see Overview of Schema Management.

Skip write settings validation

When enabled, write settings objects are not validated as part of job execution. Write settings are used to define the outputs for file-based results. Default is enabled.

NOTE: When this feature is enabled, no validations are performed of any writesettings objects for scheduled and API-based jobs. Issues with these objects may cause failures during the transformation and publishing stages of job execution.

Tip: Before running a job via schedule or API that produces file-based outputs, you should do a test manual execution of the job to verify the outputs.

Trifacta Photon execution

Feature Availability: This feature is not available in
Dataprep by Trifacta Legacy only.

When enabled, users can choose to execute their jobs on Trifacta Photon, a proprietary running environment built for execution of small- to medium-sized jobs in memory on the Trifacta node.

NOTE: Jobs executed in Trifacta Photon are executed within the Trifacta VPC. Data is temporarily streamed to the Trifacta VPC during job execution and is not persisted.

NOTE: Jobs that are executed on Trifacta Photon may be limited to run for a maximum of 10 minutes, after which they fail with a timeout error. If your job fails due to this limit, please switch to running the job on Dataflow.

Tip: When enabled, you can select to run jobs on Photon through the Run Jobs page. The default running environment is the one that is best for the size of your job.

When Trifacta Photon is disabled:

  • You cannot run jobs on the local running environment. All jobs must be executed on a clustered running environment.
  • Trifacta Photon is used for Quick Scan sampling jobs. If Trifacta Photon is disabled, the Trifacta application attempts to run the Quick Scan job on another available running environment. If that job fails or no suitable running environment is available, the Quick Scan sampling job fails.

For more information, see Run Job Page.

Data encryption

Use a customer-managed encryption key with Dataflow

Private Preview: This feature is disabled by default. For more information on enabling this feature in your project, please contact Trifacta Support.

Feature Availability: This feature is available in
Dataprep by Trifacta Enterprise Edition only.

Optionally, you can specify a customer-managed encryption key (CMEK) from your Google Cloud Platform project for use by the Trifacta application. Specify the resource name in this field. This key is used when running jobs on Dataflow.

For more information on how to use CMEKs, see Overview of CMEK.

Validate that a customer-managed encryption key is used

Private Preview: This feature is disabled by default. For more information on enabling this feature in your project, please contact Trifacta Support.

Feature Availability: This feature is available in
Dataprep by Trifacta Enterprise Edition only.

When this feature is enabled, the Trifacta application checks  Cloud Storage and BigQuery to verify that a customer-managed encryption key is in use by the datastore before writing any results to it. If a CMEK is not present, the Dataflow and publishing jobs fail. 

NOTE: If CMEK validation checks are enabled, pushdown of job execution to BigQuery is disabled.

Tip: This feature checks only for the presence of a CMEK. It does not check to see if the CMEK specified above is in use.

For more information on how to use CMEKs, see Overview of CMEK.

Scheduling and parameterization

Include Hidden Files in Parameterization

Feature Availability: This feature is not available in
Dataprep by Trifacta Legacy only.

When enabled, hidden files and hidden directories can be searched for matches for wildcard- or pattern-based parameters when importing datasets. 

Tip: This can be useful for importing data from generated profiles, which are stored in the .profiler folder in a job output directory.

NOTE: Scanning hidden folders may impact performance. For existing imported datasets with parameters, you should enable the inclusion of hidden folders on individual datasets and run a test job to evaluate impact.


For more information, see Parameterize Files for Import.

Scheduling feature

Feature Availability: This feature is available in the following editions:

  • Dataprep by Trifacta Enterprise Edition
  • Dataprep by Trifacta Professional Edition
  • Dataprep by Trifacta Premium
  • Dataprep by Trifacta Standard

When enabled, project users can schedule the execution of flows. See Add Schedule Dialog.

Publishing


Notifications

Email notifications: on plan/flow share

When email notifications are enabled, users automatically receive notifications whenever an owner shares the plan or flow with the user.

Individual users can opt out of receiving notifications. For more information, see Preferences Page.

Experimental features

These experimental features are not supported. 

Experimental features are in active development. Their functionality may change from release to release, and they may be removed from the product at any time. Do not use experimental features in a production environment.

These settings may or may not change application behavior.

Cache data in the Transformer intelligently

This feature has been disabled. This feature is currently disabled due to technical issues. When this message is gone, it will be available in the product.
For updates on system status, please visit https://status.trifacta.com/.

NOTE: This feature is in Beta release.

When enabled, this feature allows the Trifacta application to cache data from the Transformer page periodically based on Trifacta Photon execution time. This feature enables users to move faster between recipe steps.

Dataset configuration

NOTE: This experimental feature is intended for demonstration purposes only. This feature may be modified or removed from the Google Cloud without warning in a future release. It should not be deployed in a production environment.

When enabled, users can select and rename columns and update column data types for an imported dataset object. 

This feature may be modified or removed in a future release without warning. By default, this feature is disabled.

For more information, see Dataset Configuration Settings.

Default language

Select the default language to use in the Trifacta application.

Enable creation of custom JavaScript UDFs for use in recipes

NOTE: This feature is in Beta release.

Feature Availability: This feature is not available in
Dataprep by Trifacta Legacy only.

When enabled, users can create and upload JavaScript-based user-defined functions (UDFs), which can be referenced in the recipes created in the project. For more information, see JavaScript UDFs.

Execution time threshold (in milliseconds) to control caching in the Transformer

This feature has been disabled. This feature is currently disabled due to technical issues. When this message is gone, it will be available in the product.
For updates on system status, please visit https://status.trifacta.com/.

NOTE: This feature is in Beta release.

When intelligent caching in the Transformer is enabled, you can set the threshold time in milliseconds for when Trifacta Photon updates the cache. At each threshold of execution time in Trifacta Photon, the output of the intermediate recipe (CDF) steps are cached in-memory, which speeds up movements between recipe steps in the Trifacta application.

Language localization

When enabled, the Trifacta application is permitted to display text in the selected language. 

Show user language preference

When enabled, users are permitted to select a preferred language in their preferences. See Preferences Page.

This page has no comments.