Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you are using Hortonworks, additional configuration is required to enable integration with Spark. 

Info

NOTE: Before you begin, you must acquire the specific version number of your Hortonworks version. This version number should look similar to the following: 2.4.2.0-258. Below, this value is referenced as <HDP_version_number>.

 

Steps:

  1. D s config
  2. Locate the spark-job-service configuration area.
  3. For the jvmOptions properties, specify the following values. Do not remove any other values that may be specified:

    Code Block
    "spark-job-service" : {
      ...
      "jvmOptions" : [
        "-Xmx128m",
        "-Dhdp.version=<HDP_version_number>"
      ]
      ...
    }


  4. In the spark-job-service area, locate the SPARK_DIST_CLASSPATH configuration. Add a reference to your specific version of the local Hadoop client. The following applies to Hortonworks:

    Code Block
    "spark-job-service.env.SPARK_DIST_CLASSPATH" : "/usr/hdp/<HDP_version_number>/hadoop/client/*",


  5. In the spark.props configuration area, add the following values. Do not remove any other values that may be set for these options:

    Code Block
     "spark": {
      ...
      "props": {
        ...
       "spark.yarn.am.extraJavaOptions": "-Dhdp.version=<HDP_version_number>",
       "spark.driver.extraJavaOptions": "-Dhdp.version=<HDP_version_number>"
      },


  6. Save your changes.

Additional Configuration

Modify Spark version and Java JDK version

The

D s platform
 defaults to using Spark 2.1.0. Depending on the version of your Hadoop distribution, you may need to modify the version of Spark that is used by the platform. 

In the following table, you can review the requirements for the 

D s item
itemnode
 hosting 
D s product
productee
 and (if applicable) the requirements for the EMR cluster.

Info

NOTE: If you are using Spark 2.2, you must upgrade both the

D s item
itemnode
and the EMR cluster to use Java JDK 1.8.

  • For platform instances connected to an EMR cluster, you must set the local version of Spark (spark.version property) to match the version of Spark that is used on the EMR cluster.
  • For other platform instances that connect to non-EMR clusters, you must also specify the path on the 
    D s item
    itemnode
     to the Spark bundle JAR.

D s config

Spark VersionRequired Java JDK Version

Spark for

D s product
productee
without EMR

Supported version(s) of EMR

Spark for

D s product
productee
with EMR

Spark 2.1.0Java JDK 1.7 or
Java JDK 1.8

Default settings:

Code Block
"spark.version": "2.1.0",
"sparkBundleJar": "services/spark-job-server/build/bundle/spark21/*",


EMR 5.6 and EMR 5.7


Code Block
"spark.version": "2.1.0",

sparkBundleJar property is not used.

Spark 2.2.0Java JDK 1.8


Code Block
"spark.version": "2.2.0",
"sparkBundleJar": "services/spark-job-server/build/bundle/spark22/*",


EMR 5.8, EMR 5.9, and EMR 5.10
Spark version must match the Spark version on the EMR cluster:
Code Block
"spark.version": "2.2.0",
sparkBundleJar property is not used.

 

Spark 2.2.1Java JDK 1.8

Not supported.

EMR 5.11 and EMR 5.12
Spark version must match the Spark version on the EMR cluster:
Code Block
"spark.version": "2.2.1",
sparkBundleJar property is not used.

Other Spark configuration topics

Restart Platform

You can restart the platform now. See Start and Stop the Platform.

Verify Operations

At this point, you should be able to run a job in the platform, which launches a Spark execution job and a profiling. Results appear normally in the 

D s webapp
.

Steps:

To verify that the Spark running environment is working:

  1. After you have applied changes to your configuration, you must restart services. See Start and Stop the Platform.
  2. Run a simple job, including visual profiling, through the 
    D s webapp
  3. The job should appear as normal in the Job Status page.
  4. To verify that it ran on Spark, open the following file:
    /opt/trifacta/logs/batch-job-runnner.log
  5. Search the log file for a SPARK JOB INFO block with a timestamp corresponding to your job execution. 
  6. See below for information on how to check the job-specific logs. 
  7. Review any errors.

For more information, see Verify Operations.

Logs

Service logs:

Logs for the Spark Job Service are located in the following location:

/opt/trifacta/logs/spark-job-service.log

Additional log information on the launching of profile jobs is located here: 

/opt/trifacta/logs/batch-job-runner.log

Job logs:

When profiling jobs fail, additional log information is written to the following:

/opt/trifacta/logs/jobs/<job-id>/spark-job.log

Troubleshooting

Below is a list of common errors in the log files and their likely causes.

Problem - Spark jobs fail

when encryption service for the shuffle-service is enabledWhen authentication for the Spark shuffle-service is enabled

with "Unknown message type: -22" error

Spark jobs may fail with the following error in the YARN application logs:

Code Block
ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalArgumentException: Unknown message type: -22
at org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67)

Solution

This problem may occur if Spark authentication is disabled on the Hadoop cluster but enabled in the 

D s platform
. Spark authentication must match on the cluster and the platform. 

Steps:

  1. D s config
  2. Locate the spark.props entry. 
  3. Insert the following property and value:

    Code Block
    "spark.authenticate": "false"


  4. Save your changes and restart the platform.

Problem - Spark jobs fail when Spark Authentication is enabled on the Hadoop cluster

When Spark authentication is enabled on the Hadoop cluster, Spark jobs can fail. The  The YARN log file message looks something like the following:

Code Block
17/09/22 16:55:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, example.com, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)
at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)

Solution

When the shuffle service Spark authentication is enabled on the Hadoop cluster, the 

D s platform
 must also be configured to authenticate to with Spark authentication enabled. 

  1. D s config
  2. Inside the spark.props entry, insert the following property value:

    Code Block
    "spark.authenticate": "true"


  3. Save your changes and restart the platform.

Problem - Job fails with "Required executor memory MB is above the max threshold MB of this cluster" error

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log:

Code Block
Required executor memory (6144+614 MB) is above the max threshold (1615 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

Solution

The per-container memory allocation in Spark (spark.executor.memory and spark.driver.memory) must not exceed the YARN thresholds. See Spark tuning properties above.

Problem - Job fails with ask timed out error

When executing a  job through Spark, the job may fail with the following error in the  spark-job-service.log  file:

Code Block
Job submission failed
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://SparkJobServer/user/ProfileLauncher#1213485950]] after [20000 ms]

There is a 20-second timeout on the attempt to submit a Profiling job to Yarn. If the initial upload of the spark libraries to the cluster takes longer than 20 seconds, the spark-job-service times out and returns an error to the UI. However, the libraries do finish uploading successfully to the cluster.

The library upload is a one-time operation for each install/upgrade. Despite the error, the libraries are uploaded successfully the first time. This error does not affect subsequent profiler job runs.

Solution:

Try running the job again.

Problem - Spark fails with ClassNotFound error in Spark job service log

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

Code Block
java.lang.ClassNotFoundException: com.trifacta.jobserver.profiler.Profiler

By default, the Spark job service attempts to optimize the distribution of the Spark JAR files across the cluster. This optimization involves a one-time upload of the spark-assembly and profiler-bundle JAR files to HDFS. Then, YARN distributes these JARs to the worker nodes of the cluster, where they are cached for future use.

In some cases, the localized JAR files can get corrupted on the worker nodes, causing this ClassNotFound error to occur.

Solution:

The solution is to disable this optimization through platform configuration. 

Steps:

  1. D s config
  2. Locate the spark-job-service configuration node.
  3. Set the following property to false:

    Code Block
    "spark-job-service.optimizeLocalization" : true

     

  4. Save your changes and restart the platform.

Problem - Spark fails with PermGen OutOfMemory error in the Spark job service log

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

 

Code Block
Exception in thread "LeaseRenewer:trifacta@nameservice1" java.lang.OutOfMemoryError: PermGen space

Solution:

The solution is to configure the PermGen space for the Spark driver: 

  1. D s config
  2. Locate the spark configuration node.
  3. Set the following property to the given value:

    Code Block
    "spark.props.spark.driver.extraJavaOptions" : "-XX:MaxPermSize=1024m -XX:PermSize=256m",


  4. Save your changes and restart the platform. 

Problem - Spark fails with "token (HDFS_DELEGATION_TOKEN token) can't be found in cache" error in the Spark job service log on a Kerberized cluster when Impersonation is enabled

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

 

Code Block
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token x for trifacta) can't be found in cache

 

Solution:

The solution is to set Spark impersonation to true

  1. D s config
  2. Locate the spark-job-service configuration node.
  3. Set the following property to the given value:

    Code Block
    "spark-job-service.sparkImpersonationOn" : true,


  4. Save your changes and restart the platform.

Problem - Spark fails with "job aborted due to stage failure" error

Issue:

Spark fails with an error similar to the following in the spark-job-service.log: 

Code Block
"Job aborted due to stage failure: Total size of serialized results of 208 tasks (1025.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"

Explanation:

The spark.driver.maxResultSize parameter determines the limit of the total size of serialized results of all partitions for each Spark action (e.g. collect). If the total size of the serialized results exceeds this limit, the job is aborted.

To enable serialized results of unlimited size, set this parameter to zero (0). 

Solution:


  1. D s config
  2. To the spark.props section of the file, remove the size limit by setting this value to zero:

    Code Block
    "spark.driver.maxResultSize": "0"


  3. Save your changes and restart the platform.

Problem - Spark job fails with "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'" error

Issue:

Spark job fails with an error similar to the following in either the spark-job.log or the yarn-app.log file: 

Code Block
"java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'"

Explanation:

By default, the Spark running environment attempts to connect to Hive when it creates the Spark Context. This connection attempt may fail if Hive connectivity (in conf/hadoop-site/hive-site.xml) is not configured correctly on the

D s item
itemnode
.  

Solution:

This issue can be fixed by configuring Hive connectivity on the edge node.

If Hive connectivity is not required, the Spark running environment's default behavior can be changed as follows:

  1. D s config
  2. In the spark-job-service section of the file, disable Hive connectivity by setting this value to false:

    Code Block
    "spark-job-service.enableHiveSupport": false


  3. Save your changes and restart the platform.

Problem - Spark job fails with "No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?" error

Issue:

Spark job fails with an error similar to the following in either the spark-job.log or the yarn-app.log file: 

Code Block
java.io.FileNotFoundException: No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?

Explanation:

By default, Spark-Avro requires all Avro files to have the .avro extension, which includes all part files in a source directory. Spark-Avro ignores any files that do not have the .avro extension.

If a directory contains part files without an extension (e.g. part-00001, part-00002), Spark-Avro ignores these files and throws the "No Avro files found" error.

Solution:

This issue can be fixed by setting the spark.hadoop.avro.mapred.ignore.inputs.without.extension property to false:

  1. D s config
  2. To the spark.props section of the file, add the following setting if it does not already exist. Set its value to false:

    Code Block
    "spark.hadoop.avro.mapred.ignore.inputs.without.extension": "false"


  3. Save your changes and restart the platform.