- For more information on Cloud Storage, see https://cloud.google.com/storage.
A project owner does not need to enable access to Cloud Storage. Access to Cloud Storage is governed by permissions.
An IAM role is used by the product to enable access for it to Google Cloud resources. The default IAM role that is assigned to each user in the project is granted access to read and write data in Cloud Storage.
If you are using a custom IAM role to access Google Cloud resources, you must ensure that the role contains the appropriate permissions to read and write data in Cloud Storage.
For more information, see Required Dataprep User Permissions.
A service account is used by the product to run jobs in Dataflow. The default service account that is assigned to each user in the project is granted access to Cloud Storage.
If you are using a custom service account, you must ensure that the account contains the appropriate permissions to read and write data in Cloud Storage.
For more information, see Google Service Account Management.
NOTE: The platform supports a single, global connection to Cloud Storage. All users must use this connection.
Create Cloud Storage Connection
You do not need to create a connection to Cloud Storage. It is accessible based on permissions. See above.
Create via API
You cannot create Cloud Storage connections through the APIs.
Using Cloud Storage Connections
Uses of Cloud Storage
Dataprep by Trifacta can use Cloud Storage for the following reading and writing tasks:
- Upload through application: When files are imported into Dataprep by Trifacta as datasets, it is uploaded and stored in a location in Cloud Storage. For more information, see User Profile Page.
- Creating Datasets from Cloud Storage Files: You can read in from source data stored in Cloud Storage. A source may be a single Cloud Storage file or a folder of identically structured files. See Reading from Sources below.
- Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in Cloud Storage.
- Writing Results: After a job has been executed, you can write the results back to Cloud Storage.
NOTE: When Dataprep by Trifacta executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.
Before you begin using Cloud Storage
Your administrator must configure read/write permissions to locations in Cloud Storage. Please see the Cloud Storage documentation.
Avoid reading and writing in the following locations:
The Scratch Area location is used by Dataprep by Trifacta for temporary storage.
The Upload location is used for storing data that has been uploaded from local file.
For more information on these locations, see User Profile Page.
- The Requestor Pays feature of Cloud Storage is not supported in Dataprep by Trifacta .
Storing Data in Cloud Storage
Your administrator should provide raw data or locations and access for storing raw data within Cloud Storage.
- All Dataprep by Trifacta users should have a clear understanding of the folder structure within Cloud Storage where each individual can read from and write results.
- Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
NOTE: Dataprep by Trifacta does not modify source data in Cloud Storage. Sources stored in Cloud Storage are read without modification from their source locations, and sources that are uploaded to the platform are stored in the designated Upload location for each user. See User Profile Page.
When reading from or writing to buckets on Cloud Storage, you can optionally choose to show hidden files and folders. The names of hidden objects begin with a dot (
.) or an underscore (
You can choose to use these files as:
- Sources for imported datasets
- Output targets for writing job results
NOTE: Hidden files and folders are typically hidden for a reason. For example, Dataprep by Trifacta may write temporary files to buckets and then delete them. File structures may change at any time and without notice.
Tip: When importing a file from Cloud Storage, you can optionally choose to show hidden files and folders. Hidden files may contain useful information, such as JSON representations of your visual profiles. File structures in hidden folders may change without notice at any time.
- For more information on importing hidden files, see Import Data Page .
- For more information when writing to storage, see Run Job Page.
Support for CMEK
Use of Customer Managed Encryption Keys (CMEK) is supported and is transparent to the user. For more information, see https://cloud.google.com/kms/docs/cmek.
Reading from sources
You can create a dataset from one or more files stored in Cloud Storage.
You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.
When you select a folder in Cloud Storage to create your dataset, you select all files in the folder to be included.
This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
- All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.
Read file formats:
From Cloud Storage, Dataprep by Trifacta can read the following file formats:
When creating a dataset, you can choose to read data from a source stored from Cloud Storage or from a local file.
- Cloud Storage sources are not moved or changed.
- Local file sources are uploaded to the designated Upload location in Cloud Storage where they remain and are not changed. This location is specified in your user profile. See User Profile Page.
Data may be individual files or all of the files in a folder. For more information, see Reading from Sources above.
Full execution on BigQuery
For Cloud Storage data sources that are written to BigQuery, you may be able to execute the job in BigQuery.
- You must enable the Full execution for the GCS file option for existing flows that use files from Cloud Storage . For more information, see Flow Optimization Settings Dialog .
- Additional configuration and limitations may apply. For more information, see BigQuery Running Environment.
When your results from a job are generated, they can be stored back in Cloud Storage. The Cloud Storage location is available through the Output Destinations tab in the Job Details page. See Job Details Page.
If your environment is using Cloud Storage, do not use the Upload location for storage. This directory is used for storing uploads, which may be used by multiple users. Manipulating files outside of the product can destroy other users' data. Please use the tools provided through the interface for managing uploads from Cloud Storage.
NOTE: During the publishing process, Dataprep by Trifacta may write temporary files to your storage bucket and then delete them. If you have enabled a storage retention policy on your bucket, that time period may interfere with the publishing process. For more information on storage retention policy, see https://cloud.google.com/storage/docs/bucket-lock#retention-policy.
Creating a new dataset from results
As part of writing results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.
NOTE: When you create a new dataset as part of your results, the file or files are written to the designated output location for your user account. Depending on how your permissions are configured, this location may not be accessible to other users.
NOTE: Files stored in
/uploads should be deleted with caution. These files are the sources for imported datasets created by uploading files from the local desktop. Files in
/uploads should only be removed if you are confident that they are no longer used as source data for any imported datasets. Otherwise, those datasets are broken.
Files stored in
/tmp are suitable for removal. Some details are below.
Dataflow job temp files
When your jobs are executed on Dataflow, the following temp files may be generated:
.jsonfiles are generated with each job run. After a job run has been completed, these files can be safely removed.
dataflow-bundle.jarcontains code dependencies and may be reused in future job submissions. This JAR file can be more than 150 MB.
Tip: You can safely remove all but the latest version of
dataflow-bundle.jar. Older versions are no longer used. You can also delete the latest one, which it is automatically replaced in the next job execution. However, it does require a re-transfer of the file.
Enable: Automatically enabled.
Create New Connection: n/a
This page has no comments.