Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Spark 3.0.1

    Info

    NOTE: Spark 3.0.1 is supported only on specific deployments and versions of the following environments:

    Azure Databricks 8.x

    Azure Databricks 8.3 (Recommended)

    AWS Databricks 7.x

    AWS Databricks 7.3 (Recommended)

    EMR 6.2


  • Spark 2.4.6 Recommended

  • Spark 2.4.4Spark 2.3.x

Pre-requisites

Info

NOTE: Spark History Server is not supported. It should be used only for short-term debugging tasks, as it requires considerable resources.

...

Info

NOTE: Before you begin, you must acquire the specific version number of your Hortonworks version. This version number should look similar to the following: 2.4.2.0-258. Below, this value is referenced as <HDP_version_number>.

...


Steps:

  1. D s config
  2. Locate the spark-job-service configuration area.
  3. For the jvmOptions properties, specify the following values. Do not remove any other values that may be specified:

    Code Block
    "spark-job-service" : {
      ...
      "jvmOptions" : [
        "-Xmx128m",
        "-Dhdp.version=<HDP_version_number>"
      ]
      ...
    }


  4. In the spark-job-service area, locate the SPARK_DIST_CLASSPATH configuration. Add a reference to your specific version of the local Hadoop client. The following applies to Hortonworks:

    Code Block
    "spark-job-service.env.SPARK_DIST_CLASSPATH" : "/usr/hdp/<HDP_version_number>/hadoop/client/*",


  5. In the spark.props configuration area, add the following values. Do not remove any other values that may be set for these options:

    Code Block
     "spark": {
      ...
      "props": {
        ...
       "spark.yarn.am.extraJavaOptions": "-Dhdp.version=<HDP_version_number>",
       "spark.driver.extraJavaOptions": "-Dhdp.version=<HDP_version_number>"
      },


  6. Save your changes.

...

  1. D s config

  2. Verify that the Spark master property is set accordingly:

    Code Block
    "spark.master": "yarn",


  3. Review and set the following parameter based on your Hadoop distribution:

    Info

    NOTE: This setting is ignored for EMR, Azure Databricks and AWS Databricks, which always use the vendor libraries.


    Hadoop DistributionParameter ValueValue is required?
    Cloudera Data Platform 7.1"spark.useVendorSparkLibraries": true,Yes. Additional configuration is required.
    CDH 6.x"spark.useVendorSparkLibraries": true,Yes. Additional configuration is in the next section.
    HDP 3."spark.useVendorSparkLibraries": true,Yes. Additional configuration is in the next section.


  4. Locate the following setting: 


    Code Block
    "spark.version"


  5. Set the above value based on your Hadoop distribution in use:

    Hadoop Distributionspark.versionNotes

    3.0.1


    Info

    NOTE: This version of Spark is available for selection through the

    D s webapp
    . It is supported for a limited number of running environments. Additional information is provided later.


    Cloudera Data Platform 7.12.4.cdh6.3.3.plus


    Info

    NOTE: Please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.


    CDH 6.3.32.4.cdh6.3.3.plus


    Info

    NOTE: Please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.

    CDH 6.2 - CDH 6.32.4.4Platform must use native Hadoop libraries from the cluster

    .

    Additional configuration is required. See below.


    HDP 3.2.3.0Platform must use native Hadoop libraries from the cluster. Additional configuration is required. See below.


...

  1. Set the Hadoop bundle JAR to point to the one provided with your distribution. The example below points to HDP 3.1:

    Code Block
     "hadoopBundleJar": "hadoop-deps/hdp-3.1/build/libs/hdp-3.1-bundle.jar"


  2. Enable use of native libraries:

     


    Code Block
    "spark.useVendorSparkLibraries": true,


  3. Set the path to the Spark bundle JAR:

    Code Block
    "sparkBundleJar": "/usr/hdp/current/spark2-client/jars/*"


  4. Add the Spark bundle JAR to the Spark Job Service classpath (spark-job-service.classpath). Example:

    Code Block
    "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s"


  5. The following property needs to be added or updated in spark.props. If there are other values in this property, the following value must appear first in the list:

    Code Block
    "spark.executor.extraClassPath": "/usr/hdp/current/spark2-client/jars/guava-14.0.1.jar"


  6. The following property needs to be added or updated to spark.props. These do not need to be listed in a specific order:

    Code Block
    "spark.yarn.jars": "local:/usr/hdp/current/spark2-client/jars/*"


  7. Save your changes and restart the platform.

...

The per-container memory allocation in Spark (spark.executor.memory and spark.driver.memory) must not exceed the YARN thresholds. See Spark tuning properties above.

Problem - Job fails with ask timed out error

...


When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

 


Code Block
Exception in thread "LeaseRenewer:trifacta@nameservice1" java.lang.OutOfMemoryError: PermGen space

...


When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file: 


Code Block
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token x for trifacta) can't be found in cache

...


Solution:

The solution is to set Spark impersonation to true:

...