Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0711

D toc

The

D s platform
rtrue
can be configured to work with Spark to execute job results, a visual profile of transform job results, or both.

  • A visual profile is a visual summary of a dataset. It visually identifies areas of interest, including valid, missing, or mismatched values, as well as useful statistics on column data.
    • Visual profiles can be optionally generated using Spark.
    • In the
      D s webapp
      , visual profiles appear in the Job Details page when a job has successfully executed and a profile has been requested for it. See Job Details Page.
  • Apache Spark provides in-memory processing capabilities for a Hadoop cluster. In Spark, the processing of the large volume of computations to generate this information is performed in-memory. This method reduces disk access and significantly improves overall performance. For more information, see https://spark.apache.org/.

The Spark Job Service is a Scala-based capability for executing jobs and profiling your job results as an extension of job execution. This feature leverages the computing power of your existing Hadoop cluster to increase job execution and profiling performance. Features:

  • Requires no additional installation on the
    D s node
    .
  • Support for yarn-cluster mode ensures that all Spark processing is handled on the Hadoop cluster.
  • Exact bin counts appear for profile results, except for Top-N counts.

Supported Versions

The following versions of Spark are supported: 

Info

NOTE: Depending on the version of Spark and your Hadoop distribution, additional configuration may be required. See Configure Spark Version below.

  • Spark 2.4.x Recommended

  • Spark 2.3.x

Pre-requisites

Info

NOTE: Spark History Server is not supported. It should be used only for short-term debugging tasks, as it requires considerable resources.


Before you begin, please verify the following:

  1. For additional pre-requisites for a kerberized environment, see Configure for Kerberos Integration.

  2. Additional configuration is required for secure impersonation. See Configure for Secure Impersonation.

Configure the
D s platform

Excerpt

Configure Spark Job Service

The Spark Job Service must be enabled for both execution and profiling jobs to work in Spark.

Below is a sample configuration and description of each property.

D s config
.

Code Block
"spark-job-service" : {
  "systemProperties" : {
    "java.net.preferIPv4Stack": "true",
    "SPARK_YARN_MODE": "true"
  },
  "sparkImpersonationOn": false,
  "optimizeLocalization": true,
  "mainClass": "com.trifacta.jobserver.SparkJobServer",
  "jvmOptions": [
"-Xmx128m"
],
  "hiveDependenciesLocation": "%(topOfTree)s/hadoop-deps/cdh-5.4/build/libs",
  "env": {
    "SPARK_JOB_SERVICE_PORT": "4007",
    "SPARK_DIST_CLASSPATH": "",
    "MAPR_TICKETFILE_LOCATION": "<MAPR_TICKETFILE_LOCATION>",
    "MAPR_IMPERSONATION_ENABLED": "0",
    "HADOOP_USER_NAME": "trifacta",
    "HADOOP_CONF_DIR": "%(topOfTree)s/conf/hadoop-site/"
  },
  "enabled": true,
  "enableHiveSupport": true,
  "enableHistoryServer": false,
  "classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:%(topOfTree)s/%(hadoopBundleJar)s",
  "autoRestart": false,
},

The following properties can be modified based on your needs:

Info

NOTE: Unless explicitly told to do so, do not modify any of the above properties that are not listed below.

PropertyDescription
sparkImpersonationOnSet this value to true, if secure impersonation is enabled on your cluster. See Configure for Secure Impersonation.
jvmOptions

This array of values can be used to pass parameters to the JVM that manages Spark Job Service.

hiveDependenciesLocation

If Spark is integrated with a Hive instance, set this value to the path to the location where Hive dependencies are installed on the

D s node
. For more information, see Configure for Hive.

env.SPARK_JOB_SERVICE_PORTSet this value to the listening port number on the cluster for Spark. Default value is 4007. For more information, see System Ports.
env.HADOOP_USER_NAME

The username of the Hadoop principal used by the platform. By default, this value is

D s defaultuser
Typehadoop
Valuetrue
.

env.HADOOP_CONF_DIR

The directory on the

D s node
where the Hadoop cluster configuration files are stored. Do not modify unless necessary.

enabledSet this value to true to enable the Spark Job Service.
enableHiveSupport

See below.

After making any changes, save the file and restart the platform. See Start and Stop the Platform.

Configure service for Hive

Depending on the environment, please apply the following configuration changes to manage Spark interactions with Hive:

Environmentspark.enableHiveSupport
Hive is not presentfalse
Hive is present but not enabled.false
Hive is present and enabledtrue

If Hive is present on the cluster and either enabled or disabled: the hive-site.xml file must be copied to the correct directory:

Code Block
cp /etc/hive/conf/hive-site.xml /opt/trifacta/conf/hadoop-site/hive-site.xml

At this point, the platform only expects that a hive-site.xml file has been installed on the

D s node
. A valid connection is not required. For more information, see Configure for Hive .

Configure Spark

After the Spark Job Service has been enabled, please complete the following sections to configure it for the

D s platform
.

Yarn cluster mode

All jobs submitted to the Spark Job Service are executed in YARN cluster mode. No other cluster mode is supported for the Spark Job Service.

Configure access for secure impersonation

The Spark Job Service can run under secure impersonation. For more information, see Configure for Secure Impersonation.

When running under secure impersonation, the Spark Job Service requires access to the following folders. Read, write, and execute access must be provided to the

D s item
user
user
and the impersonated user.

Folder Name
Platform Configuration Property
Default ValueDescription
D s item
Libraries folder
Libraries folder
"hdfs.pathsConfig.libraries"/trifacta/libraries

Maintains JAR files and other libraries required by Spark. No sensitive information is written to this location.

D s item
Temp files folder
Temp files folder
"hdfs.pathsConfig.tempFiles"/trifacta/tempfilesHolds temporary progress information files for YARN applications. Each file contains a number indicating the progress percentage. No sensitive information is written to this location.
D s item
Dictionaries folder
Dictionaries folder
"hdfs.pathsConfig.dictionaries"/trifacta/dictionaries

Contains definitions of dictionaries created for the platform.

Identify Hadoop libraries on the cluster

The Spark Job Service does not require additional installation on the

D s node
or on the Hadoop cluster. Instead, it references the spark-assembly JAR that is provided with the
D s item
distribution
distribution
.

This JAR file does not include the Hadoop client libraries. You must point the

D s platform
to the appropriate libraries.

Steps:

  1. In platform configuration, locate the spark-job-service configuration block.
  2. Set the following property:

    Code Block
    "spark-job-service.env.HADOOP_CONF_DIR": "<path_to_Hadoop_conf_dir_on_Hadoop_cluster>",
    PropertyDescription
    spark-job-service.env.HADOOP_CONF_DIRPath to the Hadoop configuration directory on the Hadoop cluster.
  3. In the same block, the SPARK_DIST_CLASSPATH property must be set depending on your Hadoop distribution.
    1. For Cloudera 5.x: This property can be left blank.
    2. For Hortonworks 2.x: This property configuration is covered later in this section.

  4. Save your changes.

Locate Hive dependencies location

If the

D s platform
is also connected to a Hive instance, please verify the location of the Hive dependencies on the
D s node
. The following example is from Cloudera 5.10:

Info

NOTE: This parameter value is distribution-specific. Please update based on your Hadoop distribution.

Code Block
"spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-5.10/build/libs",

Specify YARN queue for Spark jobs

Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the

D s platform
are submitted to this queue.

Steps:

  1. In platform configuration, locate the following:

    Code Block
    "spark.props.spark.yarn.queue": "default",
  2. Replace default with the name of the queue.
  3. Save your changes.

Spark tuning properties

The following properties in platform configuration can be modified for Spark.

Code Block
"spark": {
    ...
	"props": {
      "spark.driver.maxResultSize": "0",
      "spark.executor.memory": "6GB",
      "spark.executor.cores": "2",
      "spark.driver.memory": "2GB"
      ...
	}
},

Setting Spark properties: You can pass additional properties to Spark such as number of cores, executors, and more.

Info

NOTE: The above values are default values. If you are experiencing performance issues, you can modify the values. If you require further assistance, please contact

D s support
.

If you have sufficient cluster resources, you should pass the following values:

Code Block
"spark": {
    ...
	"props": {
      "spark.driver.maxResultSize": "0",
      "spark.executor.memory": "16GB",
      "spark.executor.cores": "5",
      "spark.driver.memory": "16GB"
      ...
	}
},

Notes:

  • The above values must be below the per-container thresholds set by YARN. Please verify your settings against the following parameters in yarn-site.xml:

    Code Block
    yarn.scheduler.maximum-allocation-mb
    yarn.scheduler.maximum-allocation-vcores
    yarn.nodemanager.resource.memory-mb
    yarn.nodemanager.resource.cpu-vcores
  • If you are using YARN queues, please verify that these values are set below max queue thresholds.
  • For more information on these properties, see https://spark.apache.org/docs/2.2.0/configuration.html.

Save your changes.

Configure Batch Job Runner for Spark service

You can modify the following Batch Job Runner configuration settings for the Spark service.

Info

NOTE: Avoid modifying these settings unless you are experiencing issues with the user interface reporting jobs as having failed while the Spark job continues to execute on YARN.

SettingDescriptionDefault
batchserver.spark.requestTimeoutMillisSpecifies the number of milliseconds that the Batch Job Runner service should wait for a response from Spark. If this timeout is exceeded, the UI changes the job status to failed. The YARN job may continue.600000 (600 seconds)

Configure Spark for Hortonworks

If you are using Hortonworks, additional configuration is required to enable integration with Spark.

Info

NOTE: Before you begin, you must acquire the specific version number of your Hortonworks version. This version number should look similar to the following: 2.4.2.0-258. Below, this value is referenced as <HDP_version_number>.

 

Steps:

  1. D s config
  2. Locate the spark-job-service configuration area.
  3. For the jvmOptions properties, specify the following values. Do not remove any other values that may be specified:

    Code Block
    "spark-job-service" : {
      ...
      "jvmOptions" : [
        "-Xmx128m",
        "-Dhdp.version=<HDP_version_number>"
      ]
      ...
    }
  4. In the spark-job-service area, locate the SPARK_DIST_CLASSPATH configuration. Add a reference to your specific version of the local Hadoop client. The following applies to Hortonworks:

    Code Block
    "spark-job-service.env.SPARK_DIST_CLASSPATH" : "/usr/hdp/<HDP_version_number>/hadoop/client/*",
  5. In the spark.props configuration area, add the following values. Do not remove any other values that may be set for these options:

    Code Block
     "spark": {
      ...
      "props": {
        ...
       "spark.yarn.am.extraJavaOptions": "-Dhdp.version=<HDP_version_number>",
       "spark.driver.extraJavaOptions": "-Dhdp.version=<HDP_version_number>"
      },
  6. Save your changes.

Configure Spark Version

Set Spark version

Review and set the following parameter. 

Steps:

  1. D s config

  2. Verify that the Spark master property is set accordingly:

    Code Block
    "spark.master": "yarn",
  3. Review and set the following parameter based on your Hadoop distribution:

    Hadoop DistributionParameter ValueValue is required?
    CDH 6.x"spark.useVendorSparkLibraries": true,Yes. Additional configuration is in the next section.
    HDP 3.x"spark.useVendorSparkLibraries": true,Yes. Additional configuration is in the next section.
    CDH 5.x and earlier"spark.useVendorSparkLibraries": false,N
    HDP 2.x and earlier"spark.useVendorSparkLibraries": false,N
  4. Locate the following setting:

     

    Code Block
    "spark.version"
  5. Set the above value based on your Hadoop distribution in use:

    Hadoop Distributionspark.versionNotes
    CDH 6.3.32.4.cdh6.3.3.plus
    Info

    NOTE: For CDH 6.3.3, please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.

    CDH 6.1 - CDH 6.32.4.4Platform must use native Hadoop libraries from the cluster. Additional configuration is required. See below.
    CDH 5.16.xA supported version of Spark 
    HDP 3.12.3.0Platform must use native Hadoop libraries from the cluster. Additional configuration is required. See below.
    HDP 3.02.3.0Platform must use native Hadoop libraries from the cluster. Additional configuration is required. See below.
    HDP 2.6.xA supported version of Spark 

Acquire native libraries from the cluster

Info

NOTE: If the

D s node
is installed on an edge node of the cluster, you may skip this section.

You must acquire native Hadoop libraries from the cluster if you are using any of the following versions:

Hadoop versionLibrary location on cluster

D s node
location

Cloudera 6.0 or later/opt/cloudera/parcels/CDHSee section below.
Hortonworks 3.0 or later/usr/hdp/currentSee section below.

Info

NOTE: Whenever the Hadoop distribution is upgraded on the cluster, the new versions of these libraries must be recopied to the following locations on the

D s node
. This maintenance tasks is not required in the
D s node
is an edge node of the cluster.

For more information on acquiring these libraries, please see the documentation provided with your Hadoop distribution.

Use native libraries on Cloudera 6.0 and later

To integrate with CDH 6.x, the platform must use the native Spark libraries. Please add the following properties to your configuration. 

Steps:

  1. D s config
  2. Set sparkBundleJar to the following:

    Code Block
    "sparkBundleJar":"/opt/cloudera/parcels/CDH/lib/spark/jars/*:/opt/cloudera/parcels/CDH/lib/spark/hive/*"
  3. For the Spark Job Service, the Spark bundle JAR must be added to the classpath:

    Info

    NOTE: The key modification is to remove the topOfTree element from the sparkBundleJar entry.

    Code Block
    "spark-job-service": {
        "classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:/usr/lib/hdinsight-datalake/*:%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s"
      },
  4. In the spark.props section, add the following property:

    Code Block
    "spark.yarn.jars":"local:/opt/cloudera/parcels/CDH/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH/lib/spark/hive/*",
  5. Save your changes.

Use native libraries on Hortonworks 3.0 and later

To integrate with HDP 3.x, the platform must use the native Spark libraries. Please add the following properties to your configuration. 

Steps:

  1. Set the Hadoop bundle JAR to point to the one provided with your distribution. The example below points to HDP 3.0:

    Code Block
     "hadoopBundleJar": "hadoop-deps/hdp-3.0/build/libs/hdp-3.0-bundle.jar"
  2. Enable use of native libraries:

     

    Code Block
    "spark.useVendorSparkLibraries": true,
  3. Set the path to the Spark bundle JAR:

    Code Block
    "sparkBundleJar": "/usr/hdp/current/spark2-client/jars/*"
  4. Add the Spark bundle JAR to the Spark Job Service classpath (spark-job-service.classpath). Example:

    Code Block
    "spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s"
  5. The following property needs to be added or updated in spark.props. If there are other values in this property, the following value must appear first in the list:

    Code Block
    "spark.executor.extraClassPath": "/usr/hdp/current/spark2-client/jars/guava-14.0.1.jar"
  6. The following property needs to be added or updated to spark.props. These do not need to be listed in a specific order:

    Code Block
    "spark.yarn.jars": "local:/usr/hdp/current/spark2-client/jars/*"
  7. Save your changes and restart the platform.

Modify Spark version and Java JDK version

The

D s platform
defaults to using Spark 2.3.0. Depending on the version of your Hadoop distribution, you may need to modify the version of Spark that is used by the platform.

In the following table, you can review the Spark/Java version requirements for the

D s node
 hosting
D s product
.

To change the version of Spark in use by the

D s platform
, you change the value of the spark.version property, as listed below. No additional installation is required.
Info

NOTE: If you are integrating with an EMR cluster, the version of Spark to configure for use depends on the version of EMR. Additional configuration is required. See Configure for EMR.

Additional requirements:

  • The supported cluster must use Java JDK 1.8.

  • If the platform is connected to an EMR cluster, you must set the local version of Spark (spark.version property) to match the version of Spark that is used on the EMR cluster. 
D s config

Spark 2.3.0

Tip

Tip: This version of Spark is required for HDP 3.x.


Required Java JDK VersionJava JDK 1.8

Spark for

D s product

Code Block
"spark.version": "2.3.0",

Spark 2.4.4

Info

NOTE: For Azure Databricks, you must provide the specific version of Spark through a different configuration parameter. See Configure for Azure Databricks.


Required Java JDK VersionJava JDK 1.8

Spark for

D s product

Code Block
"spark.version": "2.4.4",
Info

NOTE: For Spark 2.4.0 and later, please verify that the following is set:

Code Block
"spark.useVendorSparkLibraries": true,

This configuration is not required for integrating with EMR.


Additional Configuration

Restart Platform

You can restart the platform now. See Start and Stop the Platform.

Verify Operations

At this point, you should be able to run a job in the platform, which launches a Spark execution job and a profiling. Results appear normally in the

D s webapp
.

Steps:

To verify that the Spark running environment is working:

  1. After you have applied changes to your configuration, you must restart services. See Start and Stop the Platform.
  2. Through the application, run a simple job, including visual profiling. Be sure to select Spark as the running environment.
  3. The job should appear as normal in the Job Status page.
  4. To verify that it ran on Spark, open the following file:
    /opt/trifacta/logs/batch-job-runnner.log
  5. Search the log file for a SPARK JOB INFO block with a timestamp corresponding to your job execution.
  6. See below for information on how to check the job-specific logs.
  7. Review any errors.

For more information, see Verify Operations.

Logs

Service logs:

Logs for the Spark Job Service are located in the following location:

/opt/trifacta/logs/spark-job-service.log

Additional log information on the launching of profile jobs is located here:

/opt/trifacta/logs/batch-job-runner.log

Job logs:

When profiling jobs fail, additional log information is written to the following:

/opt/trifacta/logs/jobs/<job-id>/spark-job.log

Troubleshooting

Below is a list of common errors in the log files and their likely causes.

Problem - Spark job succeeds on the cluster but is reported as failed in the application. Spark Job Service is constantly dying.

Whenever a Spark job is executed, it is reported back as having failed. On the cluster, the job appears to have succeeded. However, in the Spark Job Service logs, the Spark Job Service cannot find any of the applications that it has submitted to resource manager.

In this case, the root problem is that Spark is unable to delete temporary files after the job has completed execution. During job execution, a set of ephemeral files may be written to the designated temporary directory on the cluster, which is typically /trifacta/tempfiles. In most cases, these files are removed transparent to the user. 

  • This location is defined in the hdfs.pathsConfig.tempFiles parameter in 
    D s triconf
    .

In some cases, those files may be left behind. To account for this accumulation in the directory, the 

D s platform
 performs a periodic cleanup operation to remove temp files that are over a specified age. 

  • The age in days is defined in the the job.tempfiles.cleanup.age parameter in 
    D s triconf
    .

This cleanup operation can fail if HDFS is configured to send Trash to an encrypted zone. The HDFS API does not support the skipTrash option, which is available through the HDFS CLI. In this scenario, the temp files are not successfully removed, and the files continue to accumulate without limit in the temporary directory. Eventually, this accumulation of files can cause the Spark Job Service to crash with Out of Memory errors.

Solution

The following are possible solutions:

  1. Solution 1: Configure HDFS to use an unencrypted zone for Trash files.
  2. Solution 2: 
    1. Disable temp file cleanup in 

      D s triconf
      :

      Code Block
      "job.tempfiles.cleanup.age": 0,


    2. Clean up the tempfiles directory using an external process.

Problem - Spark Job Service fails to start with a "Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3" error.

Spark job service fails to start with an error similar to the following in the spark-job-service.log file:

Code Block
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3

Solution

Some versions of the hadoopBundleJar contain older versions of the Jackson dependencies, which break the spark-job-service.

To ensure that the spark-job-service is provided the correct Jackson dependency versions, the sparkBundleJar must be listed before the hadoopBundleJar in the spark-job-service.classpath, which is inserted as a parameter in

D s triconf
. Example:

Code Block
"spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s"

Problem - Spark jobs fail with "Unknown message type: -22" error

Spark jobs may fail with the following error in the YARN application logs:

Code Block
ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalArgumentException: Unknown message type: -22
at org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67)

Solution

This problem may occur if Spark authentication is disabled on the Hadoop cluster but enabled in the 

D s platform
. Spark authentication must match on the cluster and the platform. 

Steps:

  1. D s config
  2. Locate the spark.props entry. 
  3. Insert the following property and value:

    Code Block
    "spark.authenticate": "false"
  4. Save your changes and restart the platform.

Problem - Spark jobs fail when Spark Authentication is enabled on the Hadoop cluster

When Spark authentication is enabled on the Hadoop cluster, Spark jobs can fail. The YARN log file message looks something like the following:

Code Block
17/09/22 16:55:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, example.com, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)
at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)

Solution

When Spark authentication is enabled on the Hadoop cluster, the 

D s platform
 must also be configured with Spark authentication enabled.

  1. D s config
  2. Inside the spark.props entry, insert the following property value:

    Code Block
    "spark.authenticate": "true"
  3. Save your changes and restart the platform.

Problem - Job fails with "Required executor memory MB is above the max threshold MB of this cluster" error

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log:

Code Block
Required executor memory (6144+614 MB) is above the max threshold (1615 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

Solution

The per-container memory allocation in Spark (spark.executor.memory and spark.driver.memory) must not exceed the YARN thresholds. See Spark tuning properties above.

Problem - Job fails with ask timed out error

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

Code Block
Job submission failed
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://SparkJobServer/user/ProfileLauncher#1213485950]] after [20000 ms]

There is a 20-second timeout on the attempt to submit a Profiling job to Yarn. If the initial upload of the spark libraries to the cluster takes longer than 20 seconds, the spark-job-service times out and returns an error to the UI. However, the libraries do finish uploading successfully to the cluster.

The library upload is a one-time operation for each install/upgrade. Despite the error, the libraries are uploaded successfully the first time. This error does not affect subsequent profiler job runs.

Solution:

Try running the job again.

Problem - Spark fails with ClassNotFound error in Spark job service log

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

Code Block
java.lang.ClassNotFoundException: com.trifacta.jobserver.profiler.Profiler

By default, the Spark job service attempts to optimize the distribution of the Spark JAR files across the cluster. This optimization involves a one-time upload of the spark-assembly and profiler-bundle JAR files to HDFS. Then, YARN distributes these JARs to the worker nodes of the cluster, where they are cached for future use.

In some cases, the localized JAR files can get corrupted on the worker nodes, causing this ClassNotFound error to occur.

Solution:

The solution is to disable this optimization through platform configuration.

Steps:

  1. D s config
  2. Locate the spark-job-service configuration node.
  3. Set the following property to false:

    Code Block
    "spark-job-service.optimizeLocalization" : true
  4. Save your changes and restart the platform.

Problem - Spark fails with PermGen OutOfMemory error in the Spark job service log

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

 

Code Block
Exception in thread "LeaseRenewer:trifacta@nameservice1" java.lang.OutOfMemoryError: PermGen space

Solution:

The solution is to configure the PermGen space for the Spark driver:

  1. D s config
  2. Locate the spark configuration node.
  3. Set the following property to the given value:

    Code Block
    "spark.props.spark.driver.extraJavaOptions" : "-XX:MaxPermSize=1024m -XX:PermSize=256m",
  4. Save your changes and restart the platform.

Problem - Spark fails with "token (HDFS_DELEGATION_TOKEN token) can't be found in cache" error in the Spark job service log on a Kerberized cluster when Impersonation is enabled

When executing a job through Spark, the job may fail with the following error in the spark-job-service.log file:

 

Code Block
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token x for trifacta) can't be found in cache

 

Solution:

The solution is to set Spark impersonation to true:

  1. D s config
  2. Locate the spark-job-service configuration node.
  3. Set the following property to the given value:

    Code Block
    "spark-job-service.sparkImpersonationOn" : true,
  4. Save your changes and restart the platform.

Problem - Spark fails with "job aborted due to stage failure" error

Issue:

Spark fails with an error similar to the following in the spark-job-service.log:

Code Block
"Job aborted due to stage failure: Total size of serialized results of 208 tasks (1025.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"

Explanation:

The spark.driver.maxResultSize parameter determines the limit of the total size of serialized results of all partitions for each Spark action (e.g. collect). If the total size of the serialized results exceeds this limit, the job is aborted.

To enable serialized results of unlimited size, set this parameter to zero (0).

Solution:


  1. D s config
  2. To the spark.props section of the file, remove the size limit by setting this value to zero:

    Code Block
    "spark.driver.maxResultSize": "0"
  3. Save your changes and restart the platform.

Problem - Spark job fails with "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'" error

Issue:

Spark job fails with an error similar to the following in either the spark-job.log or the yarn-app.log file:

Code Block
"java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'"

Explanation:

By default, the Spark running environment attempts to connect to Hive when it creates the Spark Context. This connection attempt may fail if Hive connectivity (in conf/hadoop-site/hive-site.xml) is not configured correctly on the

D s node
.

Solution:

This issue can be fixed by configuring Hive connectivity on the edge node.

If Hive connectivity is not required, the Spark running environment's default behavior can be changed as follows:

  1. D s config
  2. In the spark-job-service section of the file, disable Hive connectivity by setting this value to false:

    Code Block
    "spark-job-service.enableHiveSupport": false
  3. Save your changes and restart the platform.

Problem - Spark job fails with "No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?" error

Issue:

Spark job fails with an error similar to the following in either the spark-job.log or the yarn-app.log file:

Code Block
java.io.FileNotFoundException: No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?

Explanation:

By default, Spark-Avro requires all Avro files to have the .avro extension, which includes all part files in a source directory. Spark-Avro ignores any files that do not have the .avro extension.

If a directory contains part files without an extension (e.g. part-00001, part-00002), Spark-Avro ignores these files and throws the "No Avro files found" error.

Solution:

This issue can be fixed by setting the spark.hadoop.avro.mapred.ignore.inputs.without.extension property to false:

  1. D s config
  2. To the spark.props section of the file, add the following setting if it does not already exist. Set its value to false:

    Code Block
    "spark.hadoop.avro.mapred.ignore.inputs.without.extension": "false"
  3. Save your changes and restart the platform.

Problem - Spark job fails in the platform but successfully executes on Spark

Issue:

After you have submitted a job to be executed on the Spark cluster, the job may fail in the

D s platform
after 30 minutes. However, on the busy cluster, the job remains enqueued and is eventually collected and executed. Since the job was canceled in the platform, results are not returned.

Explanation:

This issue is caused by a timeout setting for Batch Job Runner, which cancels management of jobs after a predefined number of seconds. Since these jobs are already queued on the cluster, they may be executed independent of the platform.

Solution:

This issue can be fixed by increasing the Batch Job Runner Spark timeout setting:

  1. D s config
  2. Locate the following property. By default, it is set to 172800, which is 48 hours:

    Code Block
    "batchserver.spark.progressTimeoutSeconds": 172800,
  3. If your value is lower than the default, you can increase this value high enough for your job to succeed.

  4. Save your changes and restart the platform.
  5. Re-run the job.