When Designer Cloud powered by Trifacta Enterprise Edition is installed on a supported version of Cloudera, the Designer Cloud application can be configured to execute larger jobs on the cluster instance of Spark. Spark leverages in-memory capabilities on individual nodes for faster processing of distributed analytics tasks, with spillover to disk as needed.
Tip: In the Run Job page, select Spark to run the job on this running environment when the Designer Cloud application has been integrated with it.
Spark requires a backend distributed storage layer:
- On AWS-based deployments, this storage layer is S3.
- On Hadoop-based deployments, this storage layer is HDFS.
Additional configuration is required.
NOTE: When executing a job on the Spark running environment using a relational source, the job fails if one or more columns has been dropped from the underlying source table. As a workaround, the recipe panel may show steps referencing the missing columns, which can be used to fix to either fix the recipe or the source data.
NOTE: The Spark running environment does not support use of multi-character delimiters for CSV outputs. You can switch your job to a different running environment or use single-character delimiters. This issue is fixed in Spark 3.0 and later. For more information on this issue, see https://issues.apache.org/jira/browse/SPARK-24540.
For more information, see Configure for Spark.
This page has no comments.