...
Info |
---|
NOTE: Column names in custom SQL statements are case-sensitive. Case mismatches between SQL statement and your datasource can cause jobs to fail. |
- SQL statements are stored as part of the query instance for the object. If the same query is being made across multiple users using private connections, the SQL must be shared and entered by individual users.
SQL statements must be valid for the syntax of the target relational system. Syntax examples are provided below.
- If you modify the custom SQL statement when reading from a source, all samples generated based on the previous SQL are invalidated.
Declared variables are not supported.
- For each SQL statement, all columns must have an explicit name. Example:
Function references such as:
Code Block UPPER(col)
Must be specified as:
Code Block UPPER(col) as col_name
When using custom SQL to read from a Hive view, the results of a nested function are saved to a temporary name, unless explicitly aliased.
- If aliases are not used, the temporary column names can cause jobs to fail, on Spark in particular.
- For more information, see Using Hive.
...
The following limitations apply to creating datasets from a single statement.
All single-statement SQL queries must begin with a
SELECT
keyword.Selecting columns with the same name, even with
"*"
, is not supported and generates an ambiguous column name error.Tip Tip: You should use fully qualified column names or proper aliasing. See Column Aliasing below.
Users are encouraged to provide fully qualified path to table being used. Example:
Code Block SELECT "id", "value" FROM "public"."my_table"
- You should use proper escaping in SQL.
...
D s config Locate the following setting:
Code Block Enable custom SQL Query
SQL pushdown featureSetting
Description
enabled
Set to true
to enable theability to create datasets using customized SQL statements. By default, this feature is enabled.
Use
To use, please complete the following steps.
...
- In the Library page, click Import Data.
- In the Import Data page, select a connection.
- Within your source, locate the table from which you wish to import. Do not select the table.
Click the Preview icon to review the columns in the dataset.
Tip Tip: You may wish to copy the database, table name, and column names to a text editor to facilitate generating your SQL statement.
Click Create Dataset with SQL. Enter or paste your SQL statement.
Warning Through the custom SQL interface, it is possible to enter SQL statements that can delete data, change table schemas, or otherwise corrupt the targeted database. Please use this feature with caution.
D caption Create Dataset with SQL dialog
The customized source is added to the right panel. To re-edit, click Custom SQL.
Complete the other steps to define your imported dataset.
When the data is imported, it is altered or filtered based on your SQL statement.
- After dataset creation, you can modify the SQL, if needed. See Dataset Details Page.
...
AWS Glue follows Hive syntax. See previous.
Column Aliasing
If your select statement results in multiple columns with same name, the query fails to validate or fails on execution, such as selecting all columns in a JOIN. In these cases, columns must be properly aliased.
...
Code Block |
---|
INSERT INTO "SALES"."TABLE_INVENTORY" ("ID", "AVAILABILITY") VALUES (1, 10); SELECT * FROM "SALES"."TABLE_INVENTORY" |
Troubleshooting
Snowflake
Selecting time zone data returns null values in profiling and fails in publishing
When you import a column from Snowflake that contains time zone information, you may see the following behavior:
- Sampled data appears to import correctly into the Transformer page for the TIMESTAMP-based column.
- When a job is run, the visual profile for the output column based on this data indicates null values.
- When the data is published back to Snowflake, the publishing job fails.
The above issue is caused by the following:
- When data is imported into the Transformer page, it is automatically converted to UTC timezone during the JDBC ingestion step for displaying the sample in the application. This ingestion process is called by the application and outside of the application's control.
- During this ingestion process, some auto-recognition and conversion to UTC of Datetime values is applied to the sample for display.
- Example: You design a recipe step to parse the following Datetime format:
2020-10-11 12:13:14
., which has been auto-converted to UTC.
- When a job is run:
- The application instructs Snowflake to unload the entire dataset from Snowflake and write it the target location, bypassing this automatic conversion process.
- The recipe that was created to handle the data in the sample does not properly handle the data that is directly unloaded from Snowflake.
- In the previous example: The Datetime parsing in your recipe may receive an input that looks very different from what you parsed in the displayed sample:
2020-10-11 14:13:14 CEST
.
Solution:
For a time stamp with a time zone, you must wrap your reference to it like the following:
Code Block |
---|
TO_TIMESTAMP(CONVERT_TIMEZONE('UTC', <timestamp_column_or_function>)) |
Suppose your query was the following:
Code Block |
---|
SELECT *, CURRENT_TIMESTAMP() AS current_time FROM MY_TABLE; |
To address this issue, the query needs to be rewritten as follows:
Code Block |
---|
SELECT *, TO_TIMESTAMP(CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP())) AS current_time FROM MY_TABLE; |
When the above wrapper function is applied, the data is imported normally and validated and published as expected.
Running job on an empty Snowflake dataset fails
If you run a job on an 0-row dataset that is sourced from Snowflake, the job execution fails.
- The underlying issue is that when the 0-row dataset is unloaded from Snowflake to S3, no file is created. Therefore, there is no dataset object to wrangle.
- For more information, see https://community.snowflake.com/s/question/0D50Z00009JKB9gSAH/wanted-unload-empty-file-with-header
Solution:
The solution is to union the empty dataset row with an empty row. Example:
Code Block |
---|
SELECT col1, col2 FROM empty_table
UNION ALL
SELECT '' AS col1, '' AS col2 FROM empty_table; |
The insert row values prevent the job from failing.