Skip to main content

Hadoop Spark Running Environment

When Designer Cloud Powered by Trifacta Enterprise Edition is installed on a supported version of Cloudera, the Trifacta Application can be configured to execute larger jobs on the cluster instance of Spark. Spark leverages in-memory capabilities on individual nodes for faster processing of distributed analytics tasks, with spillover to disk as needed.

Tip

In the Run Job page, select Spark to run the job on this running environment when the Trifacta Application has been integrated with it.

Spark requires a backend distributed storage layer:

  • On AWS-based deployments, this storage layer is S3.

  • On Hadoop-based deployments, this storage layer is HDFS.

Additional configuration is required.

Note

When executing a job on the Spark running environment using a relational source, the job fails if one or more columns has been dropped from the underlying source table. As a workaround, the recipe panel may show steps referencing the missing columns, which can be used to fix to either fix the recipe or the source data.

Note

The Spark running environment does not support use of multi-character delimiters for CSV outputs. You can switch your job to a different running environment or use single-character delimiters. This issue is fixed in Spark 3.0 and later. For more information on this issue, see https://issues.apache.org/jira/browse/SPARK-24540.

For more information, see Configure for Spark.