Page tree


Contents:

   

Contents:


Cloud Storage is a web service for storing and accessing files on Google Cloud Platform. The service combines the performance and scalability of the Google cloud with advanced security.

Enable

A project owner does not need to enable access to Cloud Storage. Access to Cloud Storage is governed by permissions. 

IAM role

An IAM role is used by the product to enable access for it to Google Cloud resources. The default IAM role that is assigned to each user in the project is granted access to read and write data in Cloud Storage

If you are using a custom IAM role to access Google Cloud resources, you must ensure that the role contains the appropriate permissions to read and write data in Cloud Storage

For more information, see Required Dataprep User Permissions.

Service account

A service account is used by the product to run jobs in Dataflow. The default service account that is assigned to each user in the project is granted access to Cloud Storage

If you are using a custom service account, you must ensure that the account contains the appropriate permissions to read and write data in Cloud Storage

For more information, see Google Service Account Management.

Limitations

NOTE: The platform supports a single, global connection to Cloud Storage. All users must use this connection.

Create  Cloud Storage Connection

You do not need to create a connection to Cloud Storage. It is accessible based on permissions. See above.

Create via API

You cannot create Cloud Storage connections through the APIs.

Using  Cloud Storage Connections

Uses of  Cloud Storage

Dataprep by Trifacta can use  Cloud Storage for the following reading and writing tasks:

  1. Upload through application: When files are imported into  Dataprep by Trifacta as datasets, it is uploaded and stored in a location in  Cloud Storage. For more information, see User Profile Page
  2. Creating Datasets from  Cloud Storage Files: You can read in from source data stored in  Cloud Storage. A source may be a single  Cloud Storage file or a folder of identically structured files. See Reading from Sources below.
  3. Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in  Cloud Storage
  4. Writing Results: After a job has been executed, you can write the results back to  Cloud Storage

NOTE: When Dataprep by Trifacta executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

Open GCS

Before you begin using  Cloud Storage

Your administrator must configure read/write permissions to locations in  Cloud Storage. Please see the  Cloud Storage documentation.

Avoid reading and writing in the following locations:

The Scratch Area location is used by Dataprep by Trifacta for temporary storage.

The Upload location is used for storing data that has been uploaded from local file.

For more information on these locations, see User Profile Page.

Limitations:

  • The Requestor Pays feature of  Cloud Storage  is not supported in  Dataprep by Trifacta .

Storing Data in  Cloud Storage

Your administrator should provide raw data or locations and access for storing raw data within  Cloud Storage

  • All Dataprep by Trifacta users should have a clear understanding of the folder structure within  Cloud Storage where each individual can read from and write results. 
  • Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.

NOTE: Dataprep by Trifacta does not modify source data in Cloud Storage. Sources stored in Cloud Storage are read without modification from their source locations, and sources that are uploaded to the platform are stored in the designated Upload location for each user. See User Profile Page.

Show hidden:

When reading from or writing to buckets on  Cloud Storage, you can optionally choose to show hidden files and folders. The names of hidden objects begin with a dot (.) or an underscore (_). 

You can choose to use these files as:

  • Sources for imported datasets
  • Output targets for writing job results

NOTE: Hidden files and folders are typically hidden for a reason. For example, Dataprep by Trifacta may write temporary files to buckets and then delete them. File structures may change at any time and without notice.

Tip: When importing a file from Cloud Storage, you can optionally choose to show hidden files and folders. Hidden files may contain useful information, such as JSON representations of your visual profiles. File structures in hidden folders may change without notice at any time.

Support for CMEK

Use of Customer Managed Encryption Keys (CMEK) is supported and is transparent to the user. For more information, see https://cloud.google.com/kms/docs/cmek.

Reading from sources

You can create a dataset from one or more files stored in  Cloud Storage.

Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization

Folder selection:

When you select a folder in  Cloud Storage to create your dataset, you select all files in the folder to be included.

  • This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.

  • All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.

Read file formats:

From  Cloud StorageDataprep by Trifacta can read the following file formats: 

  • CSV
  • JSON
  • AVRO
  • GZIP
  • BZIP2
  • TXT
  • XLS/XLSX
  • LOG
  • TSV

Creating datasets

When creating a dataset, you can choose to read data from a source stored from  Cloud Storage or from a local file.

  • Cloud Storage sources are not moved or changed.
  • Local file sources are uploaded to the designated Upload location in  Cloud Storage where they remain and are not changed. This location is specified in your user profile. See User Profile Page.

Data may be individual files or all of the files in a folder. For more information, see Reading from Sources above.

Full execution on BigQuery

Feature Availability: This feature may not be available in all product editions. For more information on available features, see Compare Editions.

For Cloud Storage data sources that are written to BigQuery, you may be able to execute the job in BigQuery. 

Writing results

When your results from a job are generated, they can be stored back in  Cloud Storage. The  Cloud Storage location is available through the Output Destinations tab in the Job Details page. See Job Details Page.

If your environment is using Cloud Storage, do not use the Upload location for storage. This directory is used for storing uploads, which may be used by multiple users. Manipulating files outside of the product can destroy other users' data. Please use the tools provided through the interface for managing uploads from Cloud Storage.

NOTE: During the publishing process, Dataprep by Trifacta may write temporary files to your storage bucket and then delete them. If you have enabled a storage retention policy on your bucket, that time period may interfere with the publishing process. For more information on storage retention policy, see https://cloud.google.com/storage/docs/bucket-lock#retention-policy.

Creating a new dataset from results

As part of writing results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

NOTE: When you create a new dataset as part of your results, the file or files are written to the designated output location for your user account. Depending on how your permissions are configured, this location may not be accessible to other users.

Maintenance

NOTE: Files stored in /uploads should be deleted with caution. These files are the sources for imported datasets created by uploading files from the local desktop. Files in /uploads should only be removed if you are confident that they are no longer used as source data for any imported datasets. Otherwise, those datasets are broken.

Files stored in /tmp are suitable for removal. Some details are below.

Dataflow job temp files

When your jobs are executed on Dataflow, the following temp files may be generated:

  • .pb and .json files are generated with each job run. After a job run has been completed, these files can be safely removed.
  • dataflow-bundle.jar contains code dependencies and may be reused in future job submissions. This JAR file can be more than 150 MB.

    Tip: You can safely remove all but the latest version of dataflow-bundle.jar. Older versions are no longer used. You can also delete the latest one, which it is automatically replaced in the next job execution. However, it does require a re-transfer of the file.

Reference

Enable: Automatically enabled.

Create New Connection: n/a

See Also for Google Cloud Storage Access:

There is no content with the specified labels

This page has no comments.