Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0642

D toc

The 

D s platform
rtrue
 can be configured to integrate with a variety of environments for processing transformation jobs. When you run a job through the application, you have the option of selecting the running environment on which you wish to run the job. 

Tip

Tip: In general, you should accept the default environment that is presented for job execution. The application attempts to match the scope of your job to the most appropriate running environment.

This section applies to execution of transform jobs. For more information on options for profiling jobs, see Profiling Options.

Available Environments

D s photon
 Running Environment

This running environment is available through the

D s node
. When enabled, select Photon.

Info

NOTE: This running environment is enabled by default.

 

Suitable for small to medium jobs.

Required Installation: None.

Required Configuration: See Configure Photon Running Environment.

Supported Output Formats: CSV, JSON, Avro, Parquet

Notes and Limitations:

Info

NOTE: When a recipe containing a user-defined function is applied to text data, any null characters cause records to be truncated during

D s photon
job execution. In these cases, please execute the job on Spark.

Spark Running Environment

This running environment is the new default running environment. The Spark running environment deploys Spark libraries from the 

D s node
 to the nodes of the integrated cluster. Spark uses in-memory processing for jobs, which limits the read/write operations on each node's hard storage and thereby shortens the time to execute jobs.

Suitable for jobs of all sizes.

Required Installation: None.

Required Configuration: See Configure Spark Running Environment.

Supported Output Formats: CSV, JSON, Avro, Parquet

Notes and Limitations:

Info

NOTE: When executing a job on the Spark running environment using a relational source, the job fails if one or more columns has been dropped from the underlying source table. As a workaround, the recipe panel may show steps referencing the missing columns, which can be used to fix to either fix the recipe or the source data.

Info

NOTE: The Spark running environment does not support use of multi-character delimiters for CSV outputs. You can switch your job to a different running environment or use single-character delimiters. For more information on this issue, see https://issues.apache.org/jira/browse/SPARK-24540.


EMR Running Environment

If you have deployed the 

D s platform
 to integrate with an Amazon EMR cluster, you can run Spark-based jobs on the cluster. This environment is similar to the Spark running environment. 

Required Installation: None.

Required Configuration: See Configure for EMR.

Supported Output Formats: CSV, JSON, Avro, Parquet

Notes and Limitations:

Info

NOTE: Job cancellation is not supported on an EMR cluster.

Azure Databricks Running Environment

The Azure Databricks running environment is an Apache Spark implementation that has been tuned specifically for deployment on Microsoft Azure. 

This running environment deploys Spark libraries from the 

D s node
 to the nodes of the Azure Databricks cluster. Spark uses in-memory processing for jobs, which limits the read/write operations on each node's hard storage and thereby shortens the time to execute jobs.

Suitable for jobs of all sizes.

Required Installation: None.

Required Configuration: See below.

Supported Output Formats: CSV, JSON, Avro, Parquet

Notes and Limitations:

Info

NOTE: Use of Azure Databricks is not supported on Marketplace installs.

Info

NOTE: When executing a job on the Azure Databricks running environment using a relational source, the job fails if one or more columns has been dropped from the underlying source table. As a workaround, the recipe panel may show steps referencing the missing columns, which can be used to either fix the recipe or the source data.

Configure

Available Running Environments

D s config
methodt

The following parameters define the available running environments:

Code Block
"webapp.runInTrifactaServer": true,
"webapp.runInHadoop": true,
"webapp.runinEMR": false,
"webapp.runInDataflow": false,
"photon.enabled": true,

For more information on configuring the running environment for EMR, see Configure for EMR.Below, you can see the configuration settings required to enable each running environment.

  • The Spark running environment requires a Hadoop cluster as the backend job execution environment.
    • In the Run Job page, select Spark.
  • The
    D s photon
     running environment executes on the 
    D s node
     and provide processing to the front-end client and at execution time. 
TypeRunning EnvironmentConfiguration ParametersNotes
Hadoop BackendSpark

webapp.runInHadoop = true

The Spark running environment is the default configuration. 
Client Front-end and non-Hadoop Backend

D s photon

webapp.runInTrifactaServer = true

photon.enabled = true

D s photon
is the default running environment for the front-end of the application. It is enabled by default.

Info

NOTE: Do not modify the runInDataflow setting.

Configure Default Running Environment

When you specify a job, the default running environment is pre-configured for you, based on the following parameter:

Info

NOTE: If your environment has no running environment such as Spark for running large-scale jobs, this parameter is not used. All jobs are run on the 

D s node
.

Code Block
"webapp.client.maxExecutionBytes.photon": 1000000000,

The default environment presented to you is based on the size of the primary datasource. For the above setting of 1 GB:

Running EnvironmentDefault Condition

D s photon

Size of primary datasource is less than or equal to the above value in bytes.
SparkSize of primary datasource is greater than the above value in bytes.
Info

NOTE: This setting defines only the environment that is recommended to you as a predefined selection. If a second running environment is available, you can choose to select it, although it is not recommended to choose an environment other than the default. See Run Job Page.

Warning

Setting this value too high forces more jobs onto the

D s photon
running environment, which may cause slow performance and can potentially overwhelm the server.

Tip

Tip: To force the default setting to always be a Hadoop or bulk running environment, set this value to 0. All users are recommended to use the bulk option instead of the

D s photon
running environment. However, smaller jobs may take longer than expected to execute.