You can create a connection to Databricks Tables from the . This section describes how to create connections of this type.
See Configure for Azure Databricks.
NOTE: For job execution on Spark, the connection must use the Spark instance on the Azure Databricks cluster. No other Spark instance is supported. You can run jobs from this connection through the Photon running environment. For more information, see Running Environment Options. |
AWS: The must be installed on AWS and integrated with an AWS Databricks cluster.
See Configure for AWS Databricks.
NOTE: For job execution on Spark, the connection must use the Spark instance on the AWS Databricks cluster. No other Spark instance is supported. You can run jobs from this connection through the Photon running environment. For more information, see Running Environment Options. |
This connection interacts with Databricks Tables through the Hive metastore that has been installed in the Databricks cluster.
NOTE: External Hive metastores are not supported. |
Each user must insert a Databricks Personal Access Token into the user profile. For more information, see Databricks Settings Page.
To enable Databricks Tables connections, please complete the following:
NOTE: Typically, you need only one connection to Databricks Tables, although you can create multiple connections. |
NOTE: This connection is created with SSL automatically enabled. |
Steps:
Locate the following parameter and set it to true
:
"feature.databricks.connection.enabled": true, |
To allow for direct publishing of job results to Databricks tables from the Run Job page, you must enable the following parameters. For more information on these settings, see Databricks Tables Table Settings.
Parameter | Description |
---|---|
feature.databricks.enableDeltaTableWrites | Set this value to true to enable users to choose to write generated results to Databricks delta tables from the Run Job page. |
feature.databricks.enableExternalTableWrites | Set this value to true to enable users to choose to write generated results to Databricks external tables from the Run Job page. |
Save your changes and restart the platform.
This connection can also be created via API. For details on values to use when creating via API, see Connection Types.
Please create a Databricks connection and then specify the following properties with the listed values:
NOTE: Host and port number connection information is taken from Databricks and does not need to be re-entered here. |
Property | Description |
---|---|
Connect String options | Please insert any connection string options that you need. Connect String options are not required for this connection. |
Test Connection | Click this button to test the specified connection. |
Default Column Data Type Inference | Set to |
The properties that you provide are inserted into the following URL, which connects to the connection:
jdbc:spark://<host>:<port>/<database><connect-string-options> |
The Connection URL is mostly built up automatically using cluster configuration for the platform.
The connect string options are optional. If you are passing additional properties and values to complete the connection, the connect string options must be structured in the following manner:
;<prop1>=<val1>;<prop2>=<val2>... |
where:
<prop>
: the name of the property<val>
: the value for the propertydelimiters:
=
: property names and values must be separated with an equal sign (=
).To enable the use of the HTTP protocol, specify the following in the connect string options:
;transportMode=http; |
To enable the use of SSL for the connection, specify the following in the connect string options:
;ssl=1; |
When HTTP is enabled, you can specify the path as a connect string option:
;httpPath=sql/protocolv1/o/0/xxxx-xxxxxx-xxxxxxxx; |
You can specify a Databricks personal access token to use when authenticating to the database using the following connect string options.
;AuthMech=3;UID=token;PWD=<Databricks-personal-access-token> |
where:
<Databricks-personal-access-token>
= the personal access token of the user who is connecting to the database.This connection uses the following driver:
Driver name: com.simba.spark.jdbc41.Driver
com.simba.jdbc:SparkJDBC41:2.6.11.1014
For more information, see Using Databricks Tables.
For more information on how values are converted during input and output with this database, see Databricks Tables Data Type Conversions.
For more information on error messages for this connection type, see https://kb.databricks.com/bi/jdbc-odbc-troubleshooting.html.
If you are attempting to import a table containing a large number of columns (>200), you may encounter an error message similar to the following:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 408.0 failed 4 times, most recent failure: Lost task 0.3 in stage 408.0 (TID 1342, 10.139.64.11, executor 11): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1426050. To avoid this, increase spark.kryoserializer.buffer.max value. |
The problem is that the serializer ran out of memory.
Solution:
To address this issue, you can increase the Kyroserializer buffer size.
Locate the spark.props
section and add the following setting. Modify 2000
(2GB) depending on whether your import is successful:
"spark.kryoserializer.buffer.max.mb": "2000" |
For more information on passing property values into Spark, see Configure for Spark.