...
Spark 3.0.1
Info NOTE: Spark 3.0.1 is supported only on specific deployments and versions of the following environments:
Azure Databricks 8.x
Azure Databricks 8.3 (Recommended)
AWS Databricks 7.x
AWS Databricks 7.3 (Recommended)
EMR 6.2
Spark 2.4.6 Recommended
- Spark 2.4.4Spark 2.3.x
Pre-requisites
Info |
---|
NOTE: Spark History Server is not supported. It should be used only for short-term debugging tasks, as it requires considerable resources. |
...
Info |
---|
NOTE: Before you begin, you must acquire the specific version number of your Hortonworks version. This version number should look similar to the following: |
...
Steps:
D s config - Locate the
spark-job-service
configuration area. For the
jvmOptions
properties, specify the following values. Do not remove any other values that may be specified:Code Block "spark-job-service" : { ... "jvmOptions" : [ "-Xmx128m", "-Dhdp.version=<HDP_version_number>" ] ... }
In the
spark-job-service
area, locate theSPARK_DIST_CLASSPATH
configuration. Add a reference to your specific version of the local Hadoop client. The following applies to Hortonworks:Code Block "spark-job-service.env.SPARK_DIST_CLASSPATH" : "/usr/hdp/<HDP_version_number>/hadoop/client/*",
In the
spark.props
configuration area, add the following values. Do not remove any other values that may be set for these options:Code Block "spark": { ... "props": { ... "spark.yarn.am.extraJavaOptions": "-Dhdp.version=<HDP_version_number>", "spark.driver.extraJavaOptions": "-Dhdp.version=<HDP_version_number>" },
- Save your changes.
...
D s config Verify that the Spark master property is set accordingly:
Code Block "spark.master": "yarn",
Review and set the following parameter based on your Hadoop distribution:
Info NOTE: This setting is ignored for EMR, Azure Databricks and AWS Databricks, which always use the vendor libraries.
Hadoop Distribution Parameter Value Value is required? Cloudera Data Platform 7.1 "spark.useVendorSparkLibraries": true,
Yes. Additional configuration is required. CDH 6.x "spark.useVendorSparkLibraries": true,
Yes. Additional configuration is in the next section. HDP 3. "spark.useVendorSparkLibraries": true,
Yes. Additional configuration is in the next section. Locate the following setting:
Code Block "spark.version"
Set the above value based on your Hadoop distribution in use:
Hadoop Distribution spark.version
Notes 3.0.1 Info NOTE: This version of Spark is available for selection through the
. It is supported for a limited number of running environments. Additional information is provided later.D s webapp Cloudera Data Platform 7.1 2.4.cdh6.3.3.plus Info NOTE: Please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.
CDH 6.3.3 2.4.cdh6.3.3.plus Info NOTE: Please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.
CDH 6.2 - CDH 6.3 2.4.4 Platform must use native Hadoop libraries from the cluster Additional configuration is required. See below..
HDP 3. 2.3.0 Platform must use native Hadoop libraries from the cluster. Additional configuration is required. See below.
...
Set the Hadoop bundle JAR to point to the one provided with your distribution. The example below points to HDP 3.1:
Code Block "hadoopBundleJar": "hadoop-deps/hdp-3.1/build/libs/hdp-3.1-bundle.jar"
Enable use of native libraries:
Code Block "spark.useVendorSparkLibraries": true,
Set the path to the Spark bundle JAR:
Code Block "sparkBundleJar": "/usr/hdp/current/spark2-client/jars/*"
Add the Spark bundle JAR to the Spark Job Service classpath (spark-job-service.classpath). Example:
Code Block "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s"
The following property needs to be added or updated in
spark.props
. If there are other values in this property, the following value must appear first in the list:Code Block "spark.executor.extraClassPath": "/usr/hdp/current/spark2-client/jars/guava-14.0.1.jar"
The following property needs to be added or updated to
spark.props
. These do not need to be listed in a specific order:Code Block "spark.yarn.jars": "local:/usr/hdp/current/spark2-client/jars/*"
- Save your changes and restart the platform.
...
The per-container memory allocation in Spark (spark.executor.memory
and spark.driver.memory
) must not exceed the YARN thresholds. See Spark tuning properties above.
Problem - Job fails with ask timed out error
...
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
Code Block |
---|
Exception in thread "LeaseRenewer:trifacta@nameservice1" java.lang.OutOfMemoryError: PermGen space |
...
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
Code Block |
---|
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token x for trifacta) can't be found in cache |
...
Solution:
The solution is to set Spark impersonation to true
:
...