Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r100

...

D s ed
editionsgdpent,gdppro,gdpsta,gdppr,gdpst

Excerpt

BigQuery is a scalable cloud data warehouse integrated with the Google Cloud Platform for storage of a wide range of datasets. In some use cases, your transformation jobs can be executed completely in BigQuery. If all of your source datasets and outputs are in BigQuery locations, then transferring the execution steps from the 

D s node
 to BigQuery yields the following benefits:


  • A minimum of data (recipe steps and associated metadata) is transferred between systems. Everything else remains in BigQuery.
  • Recipe steps are converted into SQL that is understandable and native to BigQuery. Execution times are much faster.
  • Depending on your environment, total cost of executing the job may be lower in BigQuery.

...

Info

NOTE: BigQuery is not a running environment that you explicitly select or specify as part of a job. If all of the requirements are met, then the job is executed in BigQuery when you select

D s dataflow
.

Info

NOTE: Datasources that require conversion are not supported for execution inn BigQuery.


  • For any job to be executed in BigQuery, all :
    • All datasources must be located in BigQuery or
     
    • D s storage
      .
    • If the output is a file published to
      D s storage
      ,
    and all outputs must be located within BigQuery.
    • then inputs must be in BigQuery.
  • BigQuery does not publish CSV files with quotes. For example, when a column value is empty in transformer, BigQuery publishes it as an empty string.
  • If a column string value has quotes or a delimiter, BigQuery encloses that string value with double quotes for CSV and JSON files.
  • BigQuery does not write column values that are empty or null while publishing to JSON format.
  • D s dataflow
     must be selected as running environment.
  • All recipe steps, including all 

    D s lang
     functions in the recipe, must be translatable to SQL. 

    Info

    NOTE: When attempting to execute a job in BigQuery,

    D s webapp
    executes each recipe in BigQuery, until it reaches a step that cannot be executed there. At that point, data is transferred to
    D s dataflow
    , where the remainder of the job is executed.

  • BigQuery imposes a limit of 1 MB for all submitted SQL queries. If this limit is exceeded during job execution, 

    D s product
     falls back to submitting the job through 
    D s dataflow
    .

  • If the schemas have changed for your datasets, pushdown execution on BigQuery is not supported. 

    D s product
     falls back to submitting the job through 
    D s dataflow
    .

  • Some transformations and functions are not currently supported for execution in BigQuery. See below.Upserts, merges, and deletes are not supported for full execution in BigQuery.
  • Sampling jobs are not supported for execution in BigQuery.
  • If your recipe includes data quality rules, the job cannot be fully executed in BigQuery.
  • BigQuery does not permit partitioned tables to be replaced. As a result, the Drop and Load publishing action is not supported when writing to a partitioned table during BigQuery execution. For more information, see https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement. 
  • In BigQuery, escaped whitespace characters (\s) match a broader set of Unicode space characters than 
    D s dataflow
    , due to differences in implementation of regular expressions between the two running environments. Depending on your dataset, this difference may result in mismatches between rows in your results when running the same job across different running environments.
  • Some uncommon date formats are not supported for pushdown.
  • Publication of complex arrays to BigQuery is not supported for jobs executed in the BigQuery running environment. To publish these arrays as non-String values, you must disable all flow optimizations and run the job in 
    D s dataflow
    .

D s storage
 File Support

D s ed
editionsgdpent,gdppro,gdppr

...

  • CSV. CSV files that fail to meet the following requirements may cause job failures when executed in BigQuery, even though they can be imported into

    D s product
    . Requirements:

    • For job execution of CSV files in BigQuery, source CSV files must be well-formatted.

    • Newlines must be inserted. 

    • Fields must be demarcated with quotes and commas.

      • Quotes in field value must be escaped with quotes when needed ("").

    • Each row must have the same number of columns.
    • For more information, see  https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv .

  • TSV
  • JSON (newline-delimited)
  • TXT
  • LOG
  • Compressed Files (gz and bz)

    Info

    NOTE:  Snappy and bz2 file formats are not supported for pushdown execution in BigQuery. When these file formats are encountered as datasources, the job automatically reverts to run on 

    D s dataflow
    .


Supported file encodings:

...

  • Regex patterns used must be valid RE2. Operations on non-RE2 regex patterns are not pushed down.
  • Source metadata references such as $rownumber  and $filepath  are not supported for pushdown.
  • For more information on limitations on specific push-downs, see Flow Optimization Settings Dialog.

...

NETWORKDAYS
NETWORKDAYSINTL
MODEDATE
WORKDAY
WORKDAYINTL
MODEDATEIF
KTHLARGESTDATE
KTHLARGESTUNIQUEDATE
KTHLARGESTUNIQUEDATEIF
KTHLARGESTDATEIF
EOMONTH
SERIALNUMBER

Partially supported:

DATEFORMAT: Some uncommon formatting options are not supported for pushdown.

For more information, see Date Functions.

String functions

RIGHTFIND
EXACT
STRINGGREATERTHAN
STRINGGREATERTHANEQUAL
STRINGLESSTHAN
STRINGLESSTHANEQUAL
DOUBLEMETAPHONEEQUALS
TRANSLITERATE

...

  1. In the left nav bar, click the Jobs link. 
  2. In the Jobs Job History page, select the job that you executed. 
  3. In the Overview tab, the value for Environment under the Execution summary should be: BigQuery.

For more information, see Job Details Page.

D s also
labelrunning_environment