By default, the  applies configuration to Spark at the global level. All jobs submitted to the connected instance of Spark utilize the same set of Spark properties and settings. As needed, the set of properties can be modified by administrators through the Admin console.

Optionally, flow owners can configure overrides to the default Spark properties at the output object level.

User-specific Spark overrides: If you have enabled user-specific overrides for Spark jobs, those settings take precedence over the settings that are applied through this feature. For more information, see Configure User-Specific Props for Cluster Jobs.

Limitations

This feature allows administrators to enable the passthrough of properties to Spark, and users can submit any value of an enabled property. Please be careful in choosing the properties that you enable for users to override.

Property validation:

Default Overrides

When this feature is enabled, the following properties are available for users to override at job execution time with their preferred values.

NOTE: These properties are always available for override when the feature is enabled.


Spark parameterDescription
spark.driver.memoryAmount of RAM in GB on each Spark node that is made available for the Spark drivers.
spark.executor.memoryAmount of RAM in GB on each Spark node that is made available for the Spark executors.
spark.executor.coresNumber of cores on each Spark executor that is made available to Spark.
transformer.dataframe.checkpoint.threshold

When checkpointing is enabled, the Spark DAG is checkpointed when the approximate number of expressions in this parameter has been added to the DAG. Checkpointing assists in managing the volume of work that is processed through Spark at one time; by checkpointing after a set of steps, the can reduce the chances of execution errors for your jobs.

By raising this number:

  • You increase the upper limit of steps between checkpoints.
  • You may reduce processing time.
  • It may result in a higher number of job failures.


Spark jobs on Azure Databricks:

For Spark jobs executed on Azure Databricks, only the following default override parameters are supported:

Spark parameterDescription
transformer.dataframe.checkpoint.thresholdSee above.

During Spark job execution on Azure Databricks:

Whenever overrides are applied to an Azure Databricks cluster, the overrides must be applied at the time of cluster creation. As a result, a new Azure Databricks cluster is spun up for the job execution, which may cause the following:

  • Delay in job execution as the cluster is spun up.
  • Increased usage and costs
  • After a new Azure Databricks cluster has been created using updated Spark properties for the job, any existing clusters complete executing any in-progress jobs and gracefully terminate based on the idle timeout setting for Azure Databricks clusters.

For more details on setting these parameters, see Tune Cluster Performance.

Enable

Workspace administrators can enable Spark job overrides.

Steps:

  1. Login to the application as a workspace administrator.
  2. Locate the following parameter:

    Enable Custom Spark Options Feature
  3. Set this parameter to Enabled.

Configure Available Parameters to Override

After enabling the feature, workspace administrators can define the Spark properties that are available for override. 

Steps:

  1. Login to the application as a workspace administrator.
  2. Locate the following parameter:

    Spark Whitelist Properties
  3. Enter a comma-separated list of Spark properties. For example, the entry for adding the following two properties looks like the following:

    spark.driver.extraJavaOptions,spark.executor.extraJavaOptions
  4. Save your changes.
  5. After saving, please reload the page for the feature to be enabled.
  6. When users configure ad-hoc or scheduled Spark jobs, the above properties are now available for overriding, in addition to the default override properties.

Apply Overrides

Overrides are applied to output objects associated with a flow.

Run Job page

When you configure an on-demand job to run:

  1. Select the output object in the flow. Then, click Run.
  2. In the Run Job page, select Spark for the Running Environment.
  3. Click the Advanced environment options caret.
  4. The Spark Execution Properties are available for override. 

    NOTE: No validation of the property values is performed against possible values or the connected running environment.

For more information, see Spark Execution Properties Settings.

For scheduled jobs

When you are scheduling a job:

  1. Select the output object in the flow. In the context panel, select the Destinations tab.
  2. Under Scheduled destinations, click Add to add a new destination or Edit to modify an existing one.
  3. In the Scheduled Publishing settings page, select Spark for the Running Environment.
  4. Click the Advanced environment options caret.
  5. The Spark Execution Properties are available for override. 

    NOTE: No validation of the property values is performed against possible values or the connected running environment.

For more information, see Spark Execution Properties Settings.

Via API

You can submit Spark property overrides as part of the request body for an output object. See API Workflow - Manage Outputs.