Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r089

...

BigQuery is a scalable cloud data warehouse integrated with the Google Cloud Platform for storage of a wide range of datasets. In some use cases, your transformation jobs can be executed completely in BigQuery. If all of your source datasets and outputs are in BigQuery locations, then transferring the execution steps from the the 

D s node
to  to BigQuery yields the following benefits:

...

  • This feature must be enabled by the project owner. See Configure Running Environments.
  • The permission to execute jobs in BigQuery must be enabled. In most environments, it is enabled by default. For more information, see Required Dataprep User Permissions.
  • In your flow, you must enable all general and BigQuery-specific flow optimizations. When all of these optimizations are enabled, the job can be pushed down to BigQuery for execution. For more information, see Flow Optimization Settings Dialog.

If the requirements and limitations are met, the 

D s webapp
 automatically executes the job in BigQuery.

Limitations

BigQuery as a running environment requires that pushdowns be enabled for the project and for the specific flow for which the job is executed. If the flow and the project are properly configured, the job is automatically executed in BigQuery.

...

  • All datasources and all outputs specified in a job must be located within BigQuery.
  • D s dataflow
     must be selected as running environment.Custom SQL datasets are not supported.
  • All recipe steps, including all all 

    D s lang
    functions  functions in the recipe, must be translatable to SQL. 

    Info

    NOTE: When attempting to execute a job in BigQuery,

    D s webapp
    executes each recipe in BigQuery, until it reaches a step that cannot be executed there. At that point, data is transferred to
    D s dataflow
    , where the remainder of the job is executed.

  • Some transformations and functions are not currently supported for execution in BigQuery. See below.
  • Upserts, merges, and deletes are not supported for full execution in BigQuery.
  • Sampling jobs are not supported for execution in BigQuery.
  • If your recipe includes data quality rules, the job cannot be fully executed in BigQuery.
Info

NOTE: In the new BigQuery console, you may not be able to see the SQL that was used to execute your job in BigQuery. This is a known issue with the new BigQuery console.

  • In BigQuery, escaped whitespace characters (\s) match a broader set of Unicode space characters than 
    D s dataflow
    , due to differences in implementation of regular expressions between the two running environments. Depending on your dataset, this difference may result in mismatches between rows in your results when running the same job across different running environments.

D s storage
 File Support

D s ed
editionsgdpent,gdppro,gdppr

In addition to BigQuery sources, you can execute jobs in BigQuery on source files from 

D s storage
. 

Tip

Tip: The BigQuery running environment also supports hybrid sources, so you can use as sources

D s storage
files and BigQuery tables in the same flow.

Requirements:

  • Publishing actions must be defined to be a target in BigQuery. 
  • External views must be enabled in BigQuery. External views are used to query GCS files.  For more information, see Required Dataprep User Permissions.
  • In the 
    D s webapp
    , the following flow optimization settings must be enabled at the flow level. 
    • BigQuery optimization
    • Full execution for GCS file
  • In the Run Job page, the 
    D s dataflow
     + BigQuery running environment must be selected. For more information, see Run Job Page.

Supported file formats from 

D s storage
:

  • CSV. CSV files that fail to meet the following requirements may cause job failures when executed in BigQuery, even though they can be imported into

    D s product
    . Requirements:

    • For job execution of CSV files in BigQuery, source CSV files must be well-formatted.

    • Newlines must be inserted. 

    • Fields must be demarcated with quotes and commas.

      • Quotes in field value must be escaped with quotes when needed ("").

    • Each row must have the same number of columns.
    • For more information, see  https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv .

  • TSV
  • JSON (newline-delimited)
  • TXT
  • LOG
  • Compressed Files (gz and bz)

Supported file encodings:

  • UTF-8
  • ISO-8859-1

Supported delimiters:

  • Comma
  • Tab
  • Pipe

Supported quote characters:

  • No quotes
  • double quotes

Unsupported 
D s lang
 for BigQuery Execution

...

Info

NOTE: If your recipe contains any of the following transformations or functions, full job execution in BigQuery is not possible at this time. These transformations are expected to be supported and removed from this list in future releases.

General limitations

  • Regex patterns used must be valid RE2. Operations on non-RE2 regex patterns are not pushed down.
  • For more information on limitations on specific push-downs, see Flow Optimization Settings Dialog.

Unsupported data types

...

  • .

...

Unsupported transformations

The following transformations are not supported for execution in BigQuery.

Legend:

  • Search term: the value you enter in the Transform Builder
  • Transform: name of the underlying transform

...

For more information, see Transformation ReferenceNone.

Unsupported functions

The following 

D s lang
 functions are not currently supported for execution in BigQuery.

...

KTHLARGEST
KTHLARGESTIF
KTHLARGESTUNIQUE
KTHLARGESTUNIQUEIF
LIST
LISTIF
MODE
MODEIF
UNIQUE
QUARTILE
APPROXIMATEMEDIAN
APPROXIMATEPERCENTILE
APPROXIMATEQUARTILE

For more information, see Aggregate Functions.


Math functions

LCM
NUMVALUE

Partially supported:

NUMFORMAT: Only supported when used for rounding.

For more information, see Math Functions.

Date functions

...

NETWORKDAYS
NETWORKDAYSINTL
MODEDATE
WORKDAY
WORKDAYINTL
CONVERTFROMUTC
CONVERTTOUTC
CONVERTTIMEZONE
MODEDATEIF
KTHLARGESTDATE
KTHLARGESTUNIQUEDATE
KTHLARGESTUNIQUEDATEIF
KTHLARGESTDATEIF
EOMONTH
SERIALNUMBERPartially supported:

DATEDIF: Only day, hour, minute, second and millisecond are supported as units.

For more information, see Date Functions.

String functions

...

PROPER
REMOVESYMBOLS
RIGHTFIND
EXACT
STRINGGREATERTHAN
STRINGGREATERTHANEQUAL
STRINGLESSTHAN
STRINGLESSTHANEQUAL
DOUBLEMETAPHONE
DOUBLEMETAPHONEEQUALS
TRANSLITERATE

For more information, see String Functions.

Nested functions

ARRAYCONCAT
ARRAYCROSS
ARRAYINTERSECT
ARRAYLEN
ARRAYSTOMAP
ARRAYUNIQUE
ARRAYZIP
FILTEROBJECT
KEYS
ARRAYELEMENTAT
LISTAVERAGE
LISTMAX
LISTMIN
LISTMODE
LISTSTDEV
LISTSUM
LISTVAR
ARRAYSORT
ARRAYINDEXOF
ARRAYMERGEELEMENTS
ARRAYRIGHTINDEXOF
ARRAYSLICE

For more information, see Nested Functions.

Window functions

SESSION

For more information, see Window Functions.

Other functions

IPTOINT
IPFROMINT

For more information, see Other Functions.

Verify Execution

To verify execution in BigQuery, please do the following:

Steps:

  1. In the left nav bar, click the Jobs link. 
  2. In the Jobs page, select the job that you executed. 
  3. In the Overview tab, the value for Environment under the Execution summary should be: BigQuery.

For more information, see Job Details Page.