Page tree

Trifacta Dataprep



Contents:

   

Contents:


The following settings can be customized for the user experience in your  Dataprep by Trifacta project. When you modify a setting, the change is immediately applied to the project. To access the page, select User menu > Admin console > Project settings.


NOTE: Users may not experience the changed environment until each user refreshes the application page or logs out and in again.

Enablement Options:

NOTE: Any values specified in this page applies exclusively to the specific project and override any system-level defaults.

OptionDescription
Default

The default value is applied. This value may be inherited from higher level configuration.

Tip: You can review the default value as part of the help text.

Enabled

The setting is enabled.

NOTE: If the setting applies to a feature, the feature is enabled. Additional configuration may be required. See below.

DisabledThe setting is disabled.
EditClick Edit to enter a specific value for the setting.

Disable Dataprep

To disable Dataprep by Trifacta for this project, click the link.

NOTE: To remove a user and his or her assets from a project, please contact Alteryx Support.

For more information, see Enable or Disable Dataprep.

General

Locale

Set the locale to use for inferring or validating data in the application, such as numeric values or dates. The default is  United States .

NOTE: After saving changes to your locale, refresh your page. Subsequent executions of the data inference service use the new locale settings.

For more information, see Locale Settings.

Session duration

Feature Availability: This feature may not be available in all product editions.

Specify the length of time in minutes before a session expires. Default is 10080 (one week).

API

Allow users to generate access tokens

Feature Availability: This feature may not be available in all product editions.

When enabled, individual users can generate their own personal access tokens, which enable access to REST APIs. For more information, see Manage API Access Tokens.

Maximum lifetime for user generated access tokens (days)

Feature Availability: This feature may not be available in all product editions.

Defines the maximum number of days that a user-generated access token is permitted for use in the product.

Tip: To permit generation of access tokens that never expire, set this value to -1.

For more information, see Manage API Access Tokens.

Connectivity

Custom SQL query

When enabled, users can create custom SQL queries to import datasets from relational tables. For more information, see Create Dataset with SQL.

Enable conversion of standard JSON files via conversion service

When enabled, the Trifacta application utilizes the conversion service to ingest JSON files and convert them to a tabular format that is easier to import into the application. For more information, see Working with JSON v2.

NOTE: This feature is enabled by default but can be disabled as needed. The conversion process performs cleanup and re-organization of the ingested data for display in tabular format.

When disabled, the Trifacta application uses the old version of JSON import, which does not restructure the data and may require additional recipe steps to manually structure it into tabular format.

NOTE: Although imported datasets and recipes created under v1 of the JSON importer continue to work without interruption, the v1 version is likely to be deprecated in a future release. You should switch your old imported datasets and recipes to using the new version. Instructions to migrate are provided at the link below.

NOTE: The legacy version of JSON import is required if you are working with compressed JSON files or only Newline JSON files.

For more information, see Working with JSON v1.

Enables long loading from bigquery

When enabled, large datasets or custom SQL requests from BigQuery are ingested in the background, allowing users to continue to use the Trifacta application while the ingest completes. 

Tip: You can monitor the ingest process through Flow View or the Dataset Details page.


Manage access to data using user IAM permissions

Feature Availability: This feature may not be available in all product editions.

When enabled, user access to data services in Google Cloud Platform, such as Cloud Storage and Bigquery, is determined by the permissions defined in a user's assigned IAM role.


NOTE: When this feature is enabled, all Dataprep by Trifacta Premium users that belong to the project are automatically logged out of all Trifacta application sessions across all projects. For example, if a Dataprep by Trifacta Premium user is logged into the product through another project, the user is logged out of their Trifacta application session when this feature is enabled. When each user logs in to the Trifacta application again, any changes to the user's permissions are applied. Since each each API request requires authentication in the header, API users are not automatically logged out.

For more information on IAM-based permissions, Required Dataprep User Permissions .

Max endpoints per JDBC REST connection

For a REST API connection to a JDBC source, this parameter defines the maximum number of endpoints that can be defined per connection .

Avoid modifying this value unless you are experiencing timeouts or failures to connect.

For more information, see REST API Connections.

Flows, recipes, and plans

Column from examples

Feature Availability: This feature may not be available in all product editions.

When enabled, users can access a tool through the column menus that enables creation of new columns based on example mappings from the selected column. For more information, see Overview of TBE.

Editor Scheduling

Feature Availability: This feature may not be available in all product editions.

When enabled, flow editors are also permitted to create and edit schedules. For more information, see Flow View Page.

NOTE: The Scheduling feature may need to be enabled in your environment. When enabled, flow owners can always create and edit schedules.

When this feature is enabled, plan collaborators are also permitted to create and edit schedules. For more information, see Plan View Page.

Enable creation of custom JavaScript UDFs for use in recipes

Feature Availability: This feature may not be available in all product editions.

When enabled, users can create and upload JavaScript-based user-defined functions (UDFs), which can be referenced in the recipes created in the project. For more information, see JavaScript UDFs.

NOTE: User-defined functions can be pushed down to BigQuery during job execution. This optimization must be enabled for each flow. For more information, see Flow Optimization Settings Dialog.

Export

When enabled, users are permitted to export their flows and plans. Exported flows can be imported into other work areas or product editions. 

NOTE: If plans have been enabled in your project settings, enabling this flag applies to flows and plans.

Import

When enabled, users are permitted to import exported flows and plans.

NOTE: If plans have been enabled in your project settings, enabling this flag applies to flows and plans.

Maximum number of files to read in a directory for the initial sample

When the Trifacta application is generating an initial sample of data for your dataset from a set of source files, you can define the maximum number of files in a directory from which the sample is generated. This limit is applied to reduce the overhead of reading in a new file, which improves performance in the Transformer page.

Tip: The initial sample type for files is generated by reading one file after another from the source. If the source is multiple files or a directory, this limit caps the maximum number of files that can be scanned for sampling purposes.

NOTE: If the files in the directory are small, the initial sample may contain the maximum number of files and less than the maximum size permitted for a sample. You may see fewer rows that expected.

If the generated sample is unsatisfactory, you can generate a new sample using a different method. In that case, this limit no longer applies. For more information, see Overview of Sampling.


Plan feature

Feature Availability: This feature may not be available in all product editions.

When enabled, users can create plans to execute sequences of recipes across one or more flows. For more information, see Plans Page.

For more information on plans and orchestration, see Overview of Operationalization.

Schematized output

Feature Availability: This feature may not be available in all product editions.
When enabled, all output columns are typecast to their annotated types. This feature is enabled by default.

UI for range join

When enabled, workspace users can specify join key matching across a range of values. For more information, see Configure Range Join.

Webhooks

Feature Availability: This feature may not be available in all product editions.

When enabled, webhook notification tasks can be configured on a per-flow basis in Flow View page. Webhook notifications allow you to deliver messages to third-party applications based on the success or failure of your job executions. For more information, see Create Flow Webhook Task.


Job execution

BigQuery execution

Feature Availability: This feature may not be available in all product editions.

When enabled, the Trifacta application can execute transformation jobs inside BigQuery when all data sources and outputs for the job are located in BigQuery.

NOTE: Logical and physical optimization of jobs must also be enabled.

To enable BigQuery execution on your flow jobs, you must enable all general and BigQuery optimizations within the flow. For more information, see Flow Optimization Settings Dialog.

For more information on BigQuery as a running environment, see Overview of Job Execution.

BigQuery query temp dataset

By default,  Dataprep by Trifacta assumes that the service account used to run Dataflow jobs has been granted the bigquery.datasets.create permission. For jobs that contain a BigQuery data source, this permission is required by default. Dataflow job execution on BigQuery data sources creates a temporary BigQuery dataset, in which temporary tables are created to store intermediate query results from the BigQuery sources. The ability to create BigQuery datasets is required to run Dataflow jobs that contain one or more BigQuery data sources. 

In some environments, this permission cannot be granted, which prevents job execution from creating the required temporary dataset and causes the job to fail.

As an alternative, you can use this setting to specify a pre-existing BigQuery dataset within which  Dataprep by Trifacta can create the temporary tables. When this BigQuery dataset is provided, the job execution process writes intermediate query results into temporary tables within the dataset, and the bigquery.datasets.create permission is no longer required.

Requirements:

  • The dataset must be a pre-existing BigQuery dataset that is created outside of Dataprep by Trifacta. Please have your BigQuery administrator create the dataset first.
  • The temporary dataset must be located in the same region as the BigQuery source tables. Otherwise, the Dataflow job fails.
  • This BigQuery dataset is used only for Dataflow job execution on BigQuery datasources. Other sources and running environments are not affected.

In-VPC execution

Feature Availability: This feature may not be available in all product editions.

When enabled, transformation jobs that are not on Dataflow can be executed within your enterprise's virtual private cloud (VPC). 

NOTE: Additional configuration through the Google Cloud command line is required. For more information, see Run Dataprep in Your VPC.

When in-VPC execution has been enabled and configured, you can configure some aspects of runtime execution through the Trifacta application. For more information, see VPC Runtime Settings Page.

Logical and physical optimization of jobs

Feature Availability: This feature may not be available in all product editions.

When enabled, the Trifacta application attempts to optimize job execution through logical optimizations of your recipe and physical optimizations of your recipes interactions with data.

This workspace setting can be overridden for individual flows.

Tip: You should keep this feature enabled. Please enable it at the project level and disable it only if needed at the flow level.

For more information, see Flow Optimization Settings Dialog.

Require a companion service account for running jobs

Feature Availability: This feature may not be available in all product editions.

By default, Dataprep by Trifacta utilizes a default compute service account for running jobs on Dataflow. Optionally, you can enable this feature, which requires each user in the project to provide their own companion service account to run jobs. This feature is disabled by default.

Prerequisites:

  • Service accounts must be created in the Google Cloud platform.
  • Companion service accounts must have a minimum set of permissions.
  • For more information, see Google Service Account Management.

When this feature is enabled:

  • Project administrators can review and specify companion service accounts for individual users of the project. For more information, see Service Accounts Page.
  • Individual users can specify their companion service account. For more information see User Profile Page.
  • At runtime, an override service account can be applied if needed. See Run Job Page.

When this feature is disabled:

  • By default, all users of the project use the Compute Engine service account specified for the project.
  • If companion service accounts has been enabled, when it's disabled, the default service account for the project is used.
  • For more information, see Google Service Account Management.

SQL Scripts

When enabled, users may define SQL scripts to execute as part of a job's run. Scripts can be executed before data ingestion, after output publication, or both through any write-supported relational connection to which the user has access.

For more information, see Create Output SQL Scripts.

Schema validation feature

When enabled, by default the structure and ordering of columns in your import datasets are checked for changes before data is ingested for job execution. 

Tip: Schema validation can be overridden for individual jobs when the schema validation option is enabled in the job settings. See below.

Errors are immediately reported in the Job Details page. See Job Details Page.

For more information on schema validation, see Overview of Schema Management.

Schema validation option in job settings

When the schema validation feature and this setting are enabled, users can make choices on how individual jobs are managed when schema changes are detected. This setting is enabled by default.   

For more information, see Run Job Page.

For more information on schema validation, see Overview of Schema Management.

Schema validation option to fail job

When schema validation is enabled, this setting specifies the default behavior when schema changes are found. 

  • When enabled, jobs are failed when schema changes are found, and error messages are surfaced in the Trifacta application.
  • When disabled, jobs are permitted to continue. 

    • Jobs may ultimately fail due to schema changes. 
    • Jobs may result in bad data being written in outputs.
    • Job failures may be more challenging to debug.

      Tip: Setting this value to Disabled matches the behavior of the Trifacta application from before schema validation was possible.

Tip: This setting can be overridden for individual jobs, even if it is disabled. For more information, see Run Job Page.

Errors are immediately reported in the Job Details page. See Job Details Page.

For more information on schema validation, see Overview of Schema Management.

Skip write settings validation

When enabled, write settings objects are not validated as part of job execution. Write settings are used to define the outputs for file-based results. Default is enabled.

NOTE: When this feature is enabled, no validations are performed of any writesettings objects for scheduled and API-based jobs. Issues with these objects may cause failures during the transformation and publishing stages of job execution.

Tip: Before running a job via schedule or API that produces file-based outputs, you should do a test manual execution of the job to verify the outputs.

Trifacta Photon execution

Feature Availability: This feature may not be available in all product editions.

When enabled, users can choose to execute their jobs on Trifacta Photon, a proprietary running environment built for execution of small- to medium-sized jobs in memory on the Trifacta node.

NOTE: Jobs executed in Trifacta Photon are executed within the Trifacta VPC. Data is temporarily streamed to the Trifacta VPC during job execution and is not persisted.

NOTE: Jobs that are executed on Trifacta Photon may be limited to run for a maximum of 10 minutes, after which they fail with a timeout error. If your job fails due to this limit, please switch to running the job on Dataflow.

Tip: When enabled, you can select to run jobs on Trifacta Photon through the Run Job page. The default running environment is the one that is best for the size of your job.

When Trifacta Photon is disabled:

  • You cannot run jobs on the local running environment. All jobs must be executed on a clustered running environment.
  • Trifacta Photon is used for Quick Scan sampling jobs. If Trifacta Photon is disabled, the Trifacta application attempts to run the Quick Scan job on another available running environment. If that job fails or no suitable running environment is available, the Quick Scan sampling job fails.

For more information, see Run Job Page.

Data encryption

Use a customer-managed encryption key with Dataflow

Private Preview: This feature is disabled by default. For more information on enabling this feature in your project, please contact Alteryx Support.

Feature Availability: This feature may not be available in all product editions.

Optionally, you can specify a customer-managed encryption key (CMEK) from your Google Cloud Platform project for use by the Trifacta application. Specify the resource name in this field. This key is used when running jobs on Dataflow.

For more information on how to use CMEKs, see Overview of CMEK.

Validate that a customer-managed encryption key is used

Private Preview: This feature is disabled by default. For more information on enabling this feature in your project, please contact Alteryx Support.

Feature Availability: This feature may not be available in all product editions.

When this feature is enabled, the Trifacta application checks  Cloud Storage and BigQuery to verify that a customer-managed encryption key is in use by the datastore before writing any results to it. If a CMEK is not present, the Dataflow and publishing jobs fail. 

NOTE: If CMEK validation checks are enabled, pushdown of job execution to BigQuery is disabled.

Tip: This feature checks only for the presence of a CMEK. It does not check to see if the CMEK specified above is in use.

For more information on how to use CMEKs, see Overview of CMEK.

Scheduling and parameterization

Include Hidden Files in Parameterization

Feature Availability: This feature may not be available in all product editions.

When enabled, hidden files and hidden directories can be searched for matches for wildcard- or pattern-based parameters when importing datasets. 

Tip: This can be useful for importing data from generated profiles, which are stored in the .profiler folder in a job output directory.

NOTE: Scanning hidden folders may impact performance. For existing imported datasets with parameters, you should enable the inclusion of hidden folders on individual datasets and run a test job to evaluate impact.


For more information, see Parameterize Files for Import.

Scheduling feature

Feature Availability: This feature may not be available in all product editions.

When enabled, project users can schedule the execution of flows. See Add Schedule Dialog.

Publishing

Notifications

Email notification feature

When enabled,  Dataprep by Trifacta can send email notifications to users based on the success or failure of jobs.  By default, this feature is Enabled.

Email notification trigger when flow jobs fail

When email notifications are enabled, you can configure the default setting for the types of failed jobs that generate an email to interested stakeholders. The value set here is the default value for each flow in the workspace.

Settings:

SettingDescription
Default (any jobs)By default, email notifications are sent on failure of any job.
Never sendEmail notifications are never sent for job failures.
Scheduled jobsNotifications are sent only when scheduled jobs fail.
Manual jobs

Notifications are sent only when ad-hoc (manually executed) jobs fail.

Tip: Jobs executed via API are Manual jobs.

AnyNotifications are sent for all job failures.

Individual users can opt out of receiving notifications or configure a different email address. See Email Notifications Page

Emailed stakeholders are configured by individual flow. For more information, see Manage Flow Notifications Dialog.

Email notification trigger when flow jobs succeed

When email notifications are enabled, you can configure the default setting for the types of successful jobs that generate an email to interested stakeholders. The value set here is the default value for each flow in the workspace.

For more information on the settings, see the previous section. Default setting is Default (any jobs).

Individual users can opt out of receiving notifications or configure a different email address. See Email Notifications Page.

Emailed stakeholders are configured by individual flow. For more information, see Manage Flow Notifications Dialog.

Email notification trigger when plans run

You can configure the default trigger for email notifications when a plan runs. Default setting is Default (all runs).

SettingDescription
Default (all runs)By default, email notifications are sent to users for all plan runs.
All runsEmails are sent for all runs.
Failed runsEmails are sent for failed runs only.
Success runsEmails are sent for successful runs only.

Sharing email notifications

When email notifications are enabled, users automatically receive notifications whenever an owner shares the plan or flow with the user.

Individual users can opt out of receiving notifications. For more information, see Preferences Page.

Experimental features

These experimental features are not supported. 

Experimental features are in active development. Their functionality may change from release to release, and they may be removed from the product at any time. Do not use experimental features in a production environment.

These settings may or may not change application behavior.

Cache data in the Transformer intelligently

This feature has been disabled. This feature is currently disabled due to technical issues. When this message is gone, it will be available in the product.
For updates on system status, please visit https://status.trifacta.com/.

NOTE: This feature is in Beta release.

When enabled, this feature allows the Trifacta application to cache data from the Transformer page periodically based on Trifacta Photon execution time. This feature enables users to move faster between recipe steps.

Default language

Select the default language to use in the Trifacta application.

Execution time threshold (in milliseconds) to control caching in the Transformer

This feature has been disabled. This feature is currently disabled due to technical issues. When this message is gone, it will be available in the product.
For updates on system status, please visit https://status.trifacta.com/.

NOTE: This feature is in Beta release.

When intelligent caching in the Transformer is enabled, you can set the threshold time in milliseconds for when Trifacta Photon updates the cache. At each threshold of execution time in Trifacta Photon, the output of the intermediate recipe (CDF) steps are cached in-memory, which speeds up movements between recipe steps in the Trifacta application.

Language localization

When enabled, the Trifacta application is permitted to display text in the selected language. 

Show user language preference

When enabled, users are permitted to select a preferred language in their preferences. See Preferences Page.

This page has no comments.