Google Cloud Storage Access

Cloud Storage is a web service for storing and accessingfilesonGoogle Cloud Platform. The service combines the performance and scalability of the Google cloud with advanced security.

For more information on Cloud Storage, see https://cloud.google.com/storage.

Enable

A project owner does not need to enable access to Cloud Storage. Access to Cloud Storage is governed by permissions.

IAM role

An IAM role is used by the product to enable access for it to Google Cloud resources. The default IAM role that is assigned to each user in the project is granted access to read and write data in Cloud Storage.

If you are using a custom IAM role to access Google Cloud resources, you must ensure that the role contains the appropriate permissions to read and write data in Cloud Storage.

For more information, see Required Dataprep User Permissions.

Service account

A service account is used by the product to run jobs in Dataflow. The default service account that is assigned to each user in the project is granted access to Cloud Storage.

If you are using a custom service account, you must ensure that the account contains the appropriate permissions to read and write data in Cloud Storage.

For more information, see Google Service Account Management.

Limitations

Note

The platform supports a single, global connection to Cloud Storage. All users must use this connection.

Create Cloud Storage Connection

You do not need to create a connection to Cloud Storage. It is accessible based on permissions. See above.

Create via API

You cannot create Cloud Storage connections through the APIs.

Using Cloud Storage Connections

Uses of Cloud Storage

Dataprep by Trifacta can use Cloud Storage for the following reading and writing tasks:

Upload through application: When files are imported into Dataprep by Trifacta as datasets, it is uploaded and stored in a location in Cloud Storage. For more information, see User Profile Page.
Creating Datasets from Cloud Storage Files: You can read in from source data stored in Cloud Storage. A source may be a single Cloud Storage file or a folder of identically structured files. See Reading from Sources below.
Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in Cloud Storage.
Writing Results: After a job has been executed, you can write the results back to Cloud Storage.

Note

When Dataprep by Trifacta executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

> Open GCS

Before you begin using Cloud Storage

Your administrator must configure read/write permissions to locations in Cloud Storage. Please see the Cloud Storage documentation.

Warning

Avoid reading and writing in the following locations:

The Scratch Area location is used by Dataprep by Trifacta for temporary storage.

The Upload location is used for storing data that has been uploaded from local file.

For more information on these locations, see User Profile Page.

Limitations:

The Requestor Pays feature of Cloud Storage is not supported in Dataprep by Trifacta.

Storing Data in Cloud Storage

Your administrator should provide raw data or locations and access for storing raw data within Cloud Storage.

All Alteryx users should have a clear understanding of the folder structure within Cloud Storage where each individual can read from and write results.
Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.

Note

Dataprep by Trifacta does not modify source data in Cloud Storage. Sources stored in Cloud Storage are read without modification from their source locations, and sources that are uploaded to the platform are stored in the designated Upload location for each user. See User Profile Page.

Show hidden:

When reading from or writing to buckets on Cloud Storage, you can optionally choose to show hidden files and folders. The names of hidden objects begin with a dot (.) or an underscore (_).

You can choose to use these files as:

Sources for imported datasets
Output targets for writing job results

Note

Hidden files and folders are typically hidden for a reason. For example, Dataprep by Trifacta may write temporary files to buckets and then delete them. File structures may change at any time and without notice.

Tip

When importing a file from Cloud Storage, you can optionally choose to show hidden files and folders. Hidden files may contain useful information, such as JSON representations of your visual profiles. File structures in hidden folders may change without notice at any time.

For more information on importing hidden files, seeImport Data Page.
For more information when writing to storage, see Run Job Page.

Support for CMEK

Use of Customer Managed Encryption Keys (CMEK) is supported and is transparent to the user. For more information, see https://cloud.google.com/kms/docs/cmek.

Reading from sources

You can create a dataset from one or more files stored in Cloud Storage.

Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.

Folder selection:

When you select a folder in Cloud Storage to create your dataset, you select all files in the folder to be included.

This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.

Read file formats:

From Cloud Storage, Dataprep by Trifacta can read the following file formats:

CSV
JSON
AVRO
GZIP
BZIP2
TXT
XLS/XLSX
LOG
TSV

Creating datasets

When creating a dataset, you can choose to read data from a source stored from Cloud Storage or from a local file.

Cloud Storage sources are not moved or changed.
Local file sources are uploaded to the designated Upload location in Cloud Storage where they remain and are not changed. This location is specified in your user profile. See User Profile Page.

Data may be individual files or all of the files in a folder. For more information, see Reading from Sources above.

Full execution on BigQuery

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

For Cloud Storage data sources that are written to BigQuery, you may be able to execute the job in BigQuery.

You must enable the Full execution for the GCS file option for existing flows that use files from Cloud Storage. For more information, see Flow Optimization Settings Dialog.
Additional configuration and limitations may apply. For more information, see BigQuery Running Environment.

Writing results

When your results from a job are generated, they can be stored back in Cloud Storage. The Cloud Storage location is available through the Output Destinations tab in the Job Details page. See Job Details Page.

Warning

If your environment is using Cloud Storage, do not use the Upload location for storage. This directory is used for storing uploads, which may be used by multiple users. Manipulating files outside of the product can destroy other users' data. Please use the tools provided through the interface for managing uploads from Cloud Storage.

Note

During the publishing process, Dataprep by Trifacta may write temporary files to your storage bucket and then delete them. If you have enabled a storage retention policy on your bucket, that time period may interfere with the publishing process. For more information on storage retention policy, see https://cloud.google.com/storage/docs/bucket-lock#retention-policy.

Creating a new dataset from results

As part of writing results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

Note

When you create a new dataset as part of your results, the file or files are written to the designated output location for your user account. Depending on how your permissions are configured, this location may not be accessible to other users.

Maintenance

Note

Files stored in /uploads should be deleted with caution. These files are the sources for imported datasets created by uploading files from the local desktop. Files in /uploads should only be removed if you are confident that they are no longer used as source data for any imported datasets. Otherwise, those datasets are broken.

Files stored in /tmp are suitable for removal. Some details are below.

Dataflow job temp files

When your jobs are executed on Dataflow, the following temp files may be generated:

.pb and .json files are generated with each job run. After a job run has been completed, these files can be safely removed.
dataflow-bundle.jar contains code dependencies and may be reused in future job submissions. This JAR file can be more than 150 MB.
Tip
You can safely remove all but the latest version of dataflow-bundle.jar. Older versions are no longer used. You can also delete the latest one, which it is automatically replaced in the next job execution. However, it does require a re-transfer of the file.

Reference

Enable: Automatically enabled.

Create New Connection: n/a

Google Cloud Storage Access

Enable

IAM role

Service account

Limitations

Create Cloud Storage Connection

Create via API

Using Cloud Storage Connections

Uses of Cloud Storage

Before you begin using Cloud Storage

Storing Data in Cloud Storage

Support for CMEK

Reading from sources

Creating datasets

Full execution on BigQuery

Writing results

Creating a new dataset from results

Maintenance

Dataflow job temp files

Reference

Search results