Contents:
The Designer Cloud Powered by Trifacta® platform can be configured to work with Spark to execute job results, a visual profile of transform job results or both.
- A visual profile is a visual summary of a dataset. It visually identifies areas of interest, including valid, missing, or mismatched values, as well as useful statistics on column data.
- Visual profiles can be optionally generated using Spark.
- In the Designer Cloud application, visual profiles appear in the Job Results page when a job has successfully executed and a profile has been requested for it. See Job Results Page.
- Apache Spark provides in-memory processing capabilities for a Hadoop cluster. In Spark, the processing of the large volume of computations to generate this information is performed in-memory. This method reduces disk access and significantly improves overall performance. For more information, see https://spark.apache.org/.
The Spark Job Service is a Scala-based capability for executing jobs and profiling your job results as an extension of job execution. This feature leverages the computing power of your existing Hadoop cluster to increase job execution and profiling performance. Features:
- Requires no additional installation on the Alteryx node.
- Support for yarn-cluster mode ensures that all Spark processing is handled on the Hadoop cluster.
- Exact bin counts appear for profile results, except for Top-N counts.
Pre-requisites
NOTE: Spark History Server is not supported. It should be used only for short-term debugging tasks, as it requires considerable resources.
Before you begin, please verify the following:
For additional pre-requisites for a kerberized environment, see Set up for a Kerberos-enabled Hadoop cluster.
- Additional configuration is required for secure impersonation. See Configure for secure impersonation.
Configure the Designer Cloud Powered by Trifacta platform
The Spark Job Service must be enabled for both execution and profiling jobs to work in Spark. NOTE: Beginning in Release 4.0, the Spark Job Service and running environment are enabled by default. If you are upgrading from an earlier release, you may be required to enable the service through the following configuration changes. Below is a sample configuration and description of each property. You can apply this change through the Admin Settings Page (recommended) or The following properties can be modified based on your needs: NOTE: Unless explicitly told to do so, do not modify any of the above properties that are not listed below. This array of values can be used to pass parameters to the JVM that manages Spark Job Service. If Spark is integrated with a Hive instance, set this value to the path to the location where Hive dependencies are installed on the Alteryx node. For more information, see Configure for Hive. The username of the Hadoop principal used by the platform. By default, this value is The directory on the Alteryx node where the Hadoop cluster configuration files are stored. Do not modify unless necessary. See below. After making any changes, save the file and restart the platform. See Start and Stop the Platform. Depending on the environment, please apply the following configuration changes to manage Spark interactions with Hive: If Hive is present on the cluster and either enabled or disabled: the At this point, the platform only expects that a After the Spark Job Service has been enabled, please complete the following sections to configure it for the Designer Cloud Powered by Trifacta platform. All jobs submitted to the Spark Job Service are executed in YARN cluster mode. No other cluster mode is supported for the Spark Job Service. The Spark Job Service can run under secure impersonation. For more information, see Configure for secure impersonation. When running under secure impersonation, the Spark Job Service requires access to the following folders. Read, write, and execute access must be provided to the Alteryx user and the impersonated user. Maintains JAR files and other libraries required by Spark. No sensitive information is written to this location. Contains definitions of dictionaries created for the platform. The Spark Job Service does not require additional installation on the Alteryx node or on the Hadoop cluster. Instead, it references the spark-assembly JAR that is provided with the Alteryx distribution. This JAR file does not include the Hadoop client libraries. You must point the Designer Cloud Powered by Trifacta platform to the appropriate libraries. Steps: Set the following property: For Hortonworks 2.x: This property configuration is covered later in this section. If the Designer Cloud Powered by Trifacta platform is also connected to a Hive instance, please verify the location of the Hive dependencies on the Alteryx node. The following example is from Cloudera 5.10: NOTE: This parameter value is distribution-specific. Please update based on your Hadoop distribution.Configure Spark Job Service
trifacta-conf.json
.
For more information, see Platform Configuration Methods.."spark-job-service" : {
"systemProperties" : {
"java.net.preferIPv4Stack": "true",
"SPARK_YARN_MODE": "true"
},
"sparkImpersonationOn": false,
"optimizeLocalization": true,
"mainClass": "com.trifacta.jobserver.SparkJobServer",
"jvmOptions": [
"-Xmx128m"
],
"hiveDependenciesLocation": "%(topOfTree)s/hadoop-deps/cdh-5.4/build/libs",
"env": {
"SPARK_JOB_SERVICE_PORT": "4007",
"SPARK_DIST_CLASSPATH": "",
"MAPR_TICKETFILE_LOCATION": "<MAPR_TICKETFILE_LOCATION>",
"MAPR_IMPERSONATION_ENABLED": "0",
"HADOOP_USER_NAME": "trifacta",
"HADOOP_CONF_DIR": "%(topOfTree)s/conf/hadoop-site/"
},
"enabled": true,
"enableHiveSupport": true,
"enableHistoryServer": false,
"classpath": "%(topOfTree)s/services/spark-job-server/server/build/libs/spark-job-server-bundle.jar:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:%(topOfTree)s/%(hadoopBundleJar)s",
"autoRestart": false,
},
Property Description sparkImpersonationOn
Set this value to true
, if secure impersonation is enabled on your cluster. See Configure for secure impersonation. jvmOptions
hiveDependenciesLocation
env.SPARK_JOB_SERVICE_PORT
Set this value to the listening port number on the cluster for Spark. Default value is 4007
. For more information, see System Ports.env.HADOOP_USER_NAME
trifacta
.env.HADOOP_CONF_DIR
enabled
Set this value to true
to enable the Spark Job Service.enableHiveSupport
Configure service for Hive
Environment spark.enableHiveSupport Hive is not present false
Hive is present but not enabled. false
Hive is present and enabled true
hive-site.xml
file must be copied to the correct directory:cp /etc/hive/conf/hive-site.xml /opt/trifacta/conf/hadoop-site/hive-site.xml
hive-site.xml
file has been installed on the Alteryx node. A valid connection is not required. For more information, see Configure for Hive.Configure Spark
Yarn cluster mode
Configure access for secure impersonation
Folder Name Default Value Description "hdfs.pathsConfig.libraries"
/trifacta/libraries
"hdfs.pathsConfig.tempFiles"
/trifacta/tempfiles
Holds temporary progress information files for YARN applications. Each file contains a number indicating the progress percentage. No sensitive information is written to this location. "hdfs.pathsConfig.dictionaries"
/trifacta/dictionaries
Identify Hadoop libraries on the cluster
spark-job-service
configuration block."spark-job-service.env.HADOOP_CONF_DIR": "<path_to_Hadoop_conf_dir_on_Hadoop_cluster>",
Property Description spark-job-service.env.HADOOP_CONF_DIR Path to the Hadoop configuration directory on the Hadoop cluster. SPARK_DIST_CLASSPATH
property must be set depending on your Hadoop distribution.Locate Hive dependencies location
"spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-5.10/build/libs",
Specify YARN queue for Spark jobs
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Designer Cloud Powered by Trifacta platform are submitted to this queue.
Steps:
In platform configuration, locate the following:
"spark.props.spark.yarn.queue": "default",
- Replace
default
with the name of the queue. Save your changes.
Spark tuning properties
The following properties in platform configuration can be modified for Spark.
"spark": { ... "props": { "spark.driver.maxResultSize": "0", "spark.executor.memory": "6GB", "spark.executor.cores": "2", "spark.driver.memory": "2GB" ... } },
Setting Spark properties: You can pass additional properties to Spark such as number of cores, executors, and more.
NOTE: The above values are default values. If you are experiencing performance issues, you can modify the values. If you require further assistance, please contact Alteryx Support.
If you have sufficient cluster resources, you should pass the following values:
"spark": { ... "props": { "spark.driver.maxResultSize": "0", "spark.executor.memory": "16GB", "spark.executor.cores": "5", "spark.driver.memory": "16GB" ... } },
Notes:
The above values must be below the per-container thresholds set by YARN. Please verify your settings against the following parameters in
yarn-site.xml
:yarn.scheduler.maximum-allocation-mb yarn.scheduler.maximum-allocation-vcores yarn.nodemanager.resource.memory-mb yarn.nodemanager.resource.cpu-vcores
- If you are using YARN queues, please verify that these values are set below max queue thresholds.
- For more information on these properties, see https://spark.apache.org/docs/2.2.0/configuration.html.
Save your changes.
Configure Batch Job Runner for Spark service
You can modify the following Batch Job Runner configuration settings for the Spark service.
NOTE: Avoid modifying these settings unless you are experiencing issues with the user interface reporting jobs as having failed while the Spark job continues to execute on YARN.
Setting | Description | Default |
---|---|---|
batchserver.spark.requestTimeoutMillis | Specifies the number of milliseconds that the Batch Job Runner service should wait for a response from Spark. If this timeout is exceeded, the UI changes the job status to failed. The YARN job may continue. | 20000 (20 seconds) |
Configure Spark for Hortonworks
If you are using Hortonworks, additional configuration is required to enable integration with Spark.
NOTE: Before you begin, you must acquire the specific version number of your Hortonworks version. This version number should look similar to the following: 2.4.2.0-258
. Below, this value is referenced as <HDP_version_number>
.
Steps:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Locate the
spark-job-service
configuration area. For the
jvmOptions
properties, specify the following values. Do not remove any other values that may be specified:"spark-job-service" : { ... "jvmOptions" : [ "-Xmx128m", "-Dhdp.version=<HDP_version_number>" ] ... }
In the
spark-job-service
area, locate theSPARK_DIST_CLASSPATH
configuration. Add a reference to your specific version of the local Hadoop client. The following applies to Hortonworks:"spark-job-service.env.SPARK_DIST_CLASSPATH" : "/usr/hdp/<HDP_version_number>/hadoop/client/*",
In the
spark.props
configuration area, add the following values. Do not remove any other values that may be set for these options:"spark": { ... "props": { ... "spark.yarn.am.extraJavaOptions": "-Dhdp.version=<HDP_version_number>", "spark.driver.extraJavaOptions": "-Dhdp.version=<HDP_version_number>" },
- Save your changes.
Additional Configuration
Modify Spark version and Java JDK version
The Designer Cloud Powered by Trifacta platform defaults to using Spark 2.1.0. Depending on the version of your Hadoop distribution, you may need to modify the version of Spark that is used by the platform.
In the following table, you can review the requirements for the Alteryx node hosting Designer Cloud Enterprise Edition and (if applicable) the requirements for the EMR cluster.
NOTE: If you are using Spark 2.2, you must upgrade both the Alteryx node and the EMR cluster to use Java JDK 1.8.
- For platform instances connected to an EMR cluster, you must set the local version of Spark (
spark.version
property) to match the version of Spark that is used on the EMR cluster. - For other platform instances that connect to non-EMR clusters, you must also specify the path on the Alteryx node to the Spark bundle JAR.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
.
For more information, see Platform Configuration Methods.
Spark Version | Required Java JDK Version | Spark for Designer Cloud Enterprise Edition without EMR | Supported version(s) of EMR | Spark for Designer Cloud Enterprise Edition with EMR |
---|---|---|---|---|
Spark 2.1.0 | Java JDK 1.7 or Java JDK 1.8 | Default settings: "spark.version": "2.1.0", "sparkBundleJar": "services/spark-job-server/build/bundle/spark21/*", | EMR 5.6 and EMR 5.7 | "spark.version": "2.1.0",
|
Spark 2.2.0 | Java JDK 1.8 | "spark.version": "2.2.0", "sparkBundleJar": "services/spark-job-server/build/bundle/spark22/*", | EMR 5.8, EMR 5.9, and EMR 5.10 | Spark version must match the Spark version on the EMR cluster: "spark.version": "2.2.0", sparkBundleJar property is not used.
|
Spark 2.2.1 | Java JDK 1.8 | Not supported. | EMR 5.11 and EMR 5.12 | Spark version must match the Spark version on the EMR cluster: "spark.version": "2.2.1", sparkBundleJar property is not used. |
Other Spark configuration topics
- For more information on executing jobs on Spark, see Configure Spark Running Environment.
- For more information on visual profiling using Spark, see Configure Spark Profiler.
Restart Platform
You can restart the platform now. See Start and Stop the Platform.
Verify Operations
At this point, you should be able to run a job in the platform, which launches a Spark execution job and a profiling. Results appear normally in the Designer Cloud application.
Steps:
To verify that the Spark running environment is working:
- After you have applied changes to your configuration, you must restart services. See Start and Stop the Platform.
- Run a simple job, including visual profiling, through the Designer Cloud application.
- The job should appear as normal in the Job Status page.
- To verify that it ran on Spark, open the following file:
/opt/trifacta/logs/batch-job-runnner.log
- Search the log file for a
SPARK JOB INFO
block with a timestamp corresponding to your job execution. - See below for information on how to check the job-specific logs.
- Review any errors.
For more information, see Verify Operations.
Logs
Service logs:
Logs for the Spark Job Service are located in the following location:
/opt/trifacta/logs/spark-job-service.log
Additional log information on the launching of profile jobs is located here:
/opt/trifacta/logs/batch-job-runner.log
Job logs:
When profiling jobs fail, additional log information is written to the following:
/opt/trifacta/logs/jobs/<job-id>/spark-job.log
Troubleshooting
Below is a list of common errors in the log files and their likely causes.
Problem - Spark jobs fail with "Unknown message type: -22" error
Spark jobs may fail with the following error in the YARN application logs:
ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalArgumentException: Unknown message type: -22 at org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67)
Solution
This problem may occur if Spark authentication is disabled on the Hadoop cluster but enabled in the Designer Cloud Powered by Trifacta platform. Spark authentication must match on the cluster and the platform.
Steps:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Locate the
spark.props
entry. Insert the following property and value:
"spark.authenticate": "false"
- Save your changes and restart the platform.
Problem - Spark jobs fail when Spark Authentication is enabled on the Hadoop cluster
When Spark authentication is enabled on the Hadoop cluster, Spark jobs can fail. The YARN log file message looks something like the following:
17/09/22 16:55:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, example.com, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?) at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)
Solution
When Spark authentication is enabled on the Hadoop cluster, the Designer Cloud Powered by Trifacta platform must also be configured with Spark authentication enabled.
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. Inside the
spark.props
entry, insert the following property value:"spark.authenticate": "true"
- Save your changes and restart the platform.
Problem - Job fails with "Required executor memory MB is above the max threshold MB of this cluster" error
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
:
Required executor memory (6144+614 MB) is above the max threshold (1615 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
Solution
The per-container memory allocation in Spark (spark.executor.memory
and spark.driver.memory
) must not exceed the YARN thresholds. See Spark tuning properties above.
Problem - Job fails with ask timed out error
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
Job submission failed akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://SparkJobServer/user/ProfileLauncher#1213485950]] after [20000 ms]
There is a 20-second timeout on the attempt to submit a Profiling job to Yarn. If the initial upload of the spark libraries to the cluster takes longer than 20 seconds, the spark-job-service times out and returns an error to the UI. However, the libraries do finish uploading successfully to the cluster.
The library upload is a one-time operation for each install/upgrade. Despite the error, the libraries are uploaded successfully the first time. This error does not affect subsequent profiler job runs.
Solution:
Try running the job again.
Problem - Spark fails with ClassNotFound error in Spark job service log
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
java.lang.ClassNotFoundException: com.trifacta.jobserver.profiler.Profiler
By default, the Spark job service attempts to optimize the distribution of the Spark JAR files across the cluster. This optimization involves a one-time upload of the spark-assembly and profiler-bundle JAR files to HDFS. Then, YARN distributes these JARs to the worker nodes of the cluster, where they are cached for future use.
In some cases, the localized JAR files can get corrupted on the worker nodes, causing this ClassNotFound error to occur.
Solution:
The solution is to disable this optimization through platform configuration.
Steps:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Locate the
spark-job-service
configuration node. Set the following property to
false
:"spark-job-service.optimizeLocalization" : true
- Save your changes and restart the platform.
Problem - Spark fails with PermGen OutOfMemory error in the Spark job service log
spark-job-service.log
file:
Exception in thread "LeaseRenewer:trifacta@nameservice1" java.lang.OutOfMemoryError: PermGen space
Solution:
The solution is to configure the PermGen space for the Spark driver:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Locate the
spark
configuration node. Set the following property to the given value:
"spark.props.spark.driver.extraJavaOptions" : "-XX:MaxPermSize=1024m -XX:PermSize=256m",
- Save your changes and restart the platform.
Problem - Spark fails with "token (HDFS_DELEGATION_TOKEN token) can't be found in cache" error in the Spark job service log on a Kerberized cluster when Impersonation is enabled
spark-job-service.log
file:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token x for trifacta) can't be found in cache
Solution:
The solution is to set Spark impersonation to true
:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. - Locate the
spark-job-service
configuration node. Set the following property to the given value:
"spark-job-service.sparkImpersonationOn" : true,
- Save your changes and restart the platform.
Problem - Spark fails with "job aborted due to stage failure" error
Issue:
Spark fails with an error similar to the following in the spark-job-service.log:
"Job aborted due to stage failure: Total size of serialized results of 208 tasks (1025.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"
Explanation:
The spark.driver.maxResultSize
parameter determines the limit of the total size of serialized results of all partitions for each Spark action (e.g. collect). If the total size of the serialized results exceeds this limit, the job is aborted.
To enable serialized results of unlimited size, set this parameter to zero (0
).
Solution:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. To the
spark.props
section of the file, remove the size limit by setting this value to zero:"spark.driver.maxResultSize": "0"
- Save your changes and restart the platform.
Problem - Spark job fails with "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'" error
Issue:
Spark job fails with an error similar to the following in either the spark-job.log
or the yarn-app.log
file:
"java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'"
Explanation:
By default, the Spark running environment attempts to connect to Hive when it creates the Spark Context. This connection attempt may fail if Hive connectivity (in conf/hadoop-site/hive-site.xml
) is not configured correctly on the Alteryx node.
Solution:
This issue can be fixed by configuring Hive connectivity on the edge node.
If Hive connectivity is not required, the Spark running environment's default behavior can be changed as follows:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. In the
spark-job-service
section of the file, disable Hive connectivity by setting this value tofalse
:"spark-job-service.enableHiveSupport": false
- Save your changes and restart the platform.
Problem - Spark job fails with "No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?" error
Issue:
Spark job fails with an error similar to the following in either the spark-job.log
or the yarn-app.log
file:
java.io.FileNotFoundException: No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?
Explanation:
By default, Spark-Avro requires all Avro files to have the .avro extension, which includes all part files in a source directory. Spark-Avro ignores any files that do not have the .avro extension.
If a directory contains part files without an extension (e.g. part-00001
, part-00002
), Spark-Avro ignores these files and throws the "No Avro files found" error.
Solution:
This issue can be fixed by setting the spark.hadoop.avro.mapred.ignore.inputs.without.extension
property to false
:
- You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods. To the
spark.props
section of the file, add the following setting if it does not already exist. Set its value tofalse
:"spark.hadoop.avro.mapred.ignore.inputs.without.extension": "false"
- Save your changes and restart the platform.
This page has no comments.