The can be configured to integrate with a variety of environments for processing transformation jobs. When you run a job through the application, you have the option of selecting the running environment on which you wish to run the job.
Tip: In general, you should accept the default environment that is presented for job execution. The application attempts to match the scope of your job to the most appropriate running environment. |
This section applies to execution of transform jobs. For more information on options for profiling jobs, see Profiling Options.
This running environment is integrated with the application. When enabled, select Run on .
NOTE: This running environment is enabled by default. |
Suitable for small to medium jobs.
Required Installation: None.
Required Configuration: See Configure Photon Running Environment.
Supported Output Formats: CSV, JSON, Avro, Parquet
Notes and Limitations:
NOTE: If the Photon running environment is disabled, please set |
NOTE: When a recipe containing a user-defined function is applied to text data, any null characters cause records to be truncated by the running environment during |
This running environment is the new default running environment. The Spark running environment deploys Spark libraries from the to the nodes of the Hadoop cluster. Spark uses in-memory processing for jobs, which limits the read/write operations on each node's hard storage and thereby shortens the time to execute jobs.
Suitable for jobs of all sizes.
Required Installation: None.
Required Configuration: See Configure Spark Running Environment.
Supported Output Formats: CSV, JSON, Avro, Parquet
Notes and Limitations:
NOTE: When executing a job on the Spark running environment using a relational source, the job fails if one or more columns has been dropped from the underlying source table. As a workaround, the recipe panel may show steps referencing the missing columns, which be used to fix to either fix the script or the source data. |
NOTE: The Spark running environment does not support use of multi-character delimiters for CSV outputs. You can switch your job to a different running environment or use single-character delimiters. For more information on this issue, see https://issues.apache.org/jira/browse/SPARK-24540. |
NOTE: Although you can enable it, this environment is no longer supported. You should enable the Photon running environment in your deployment. |
Legacy running environment is no longer enabled by default. When enabled, select Run on .
Suitable for small jobs.
Required Installation: None.
Required Configuration: For more information on re-enabling this running environment, see Configure Photon Running Environment.
Supported Output Formats: CSV, JSON, Avro
Notes and Limitations:
NOTE: Parquet format cannot be generated in a |
If you have deployed the to integrate with an Amazon EMR cluster, you can run Spark-based jobs on the cluster. This environment is similar to the Hadoop cluster.
Required Installation: None.
Required Configuration: See Configure for EMR.
Supported Output Formats: CSV, JSON, Avro, Parquet
Notes and Limitations:
NOTE: Job cancellation is not supported on an EMR cluster. |
The following parameters define the available running environments:
"webapp.runInTrifactaServer": true, "webapp.runInHadoop": true, "webapp.runinEMR": false, "webapp.runInDataflow": false, "photon.enabled": true, |
For more information on configuring the running environment for EMR, see Configure for EMR.
Below, you can see the configuration settings required to enable each running environment.
In the Run Job page, select Run on .
NOTE: To disable execution on the |
Type | Running Environment | Configuration Parameters | Notes | |
---|---|---|---|---|
Hadoop Backend | Spark |
| The Spark running environment is the default configuration. | |
Client Front-end and non-Hadoop Backend | Photon |
| Photon is the default running environment for the front-end of the application. It is enabled by default.
| |
JavaScript |
| The JavaScript running environment should not be enabled unless necessary.
|
NOTE: Do not modify the |
When you specify a job, the default running environment is pre-configured for you, based on the following parameter:
NOTE: If your environment has no running environment such as Hadoop for running large-scale jobs, this parameter is not used. All jobs are run on the |
"webapp.client.maxExecutionBytes.photon": 1000000000, |
The default environment presented to you is based on the size of the primary datasource. For the above setting of 1 GB:
Running Environment | Default Condition |
---|---|
Size of primary datasource is less than or equal to the above value in bytes. | |
Cluster-based running environment | Size of primary datasource is greater than the above value in bytes. |
NOTE: This setting defines only the environment that is recommended to you as a predefined selection. If a second running environment, you can choose to select it, although it is not recommended to choose an environment other than the default. See Run Job Page. |
Setting this value too high forces more jobs onto the |
Tip: To force the default setting to always be a Hadoop or bulk running environment, set this value to |