Page tree

Trifacta SaaS



Contents:


   

Contents:


This section describes how you interact through the Trifacta® platform with your Trifacta File Storage environment.

  • Trifacta File Storage (TFS) is a data storage service provided by Trifacta for use in Trifacta
  • TFS provides S3-backed storage services for importing, storing, sampling, and generating results. 
  • TFS is enabled as the default storage environment for your product. 
    • Optionally, you can set S3 to be the default storage environment and enable TFS as a secondary storage environment.
    • For more information, see Configure Storage Environment.

Uses of TFS

The Trifacta platform can use TFS for the following tasks:

  1. Creating Datasets from Files: You can read in source data stored in TFS. An imported dataset may be a single TFS file or a folder of identically structured files. See Reading from Sources below.

  2. Reading Datasets: When creating a dataset, you can pull your data from a source in TFS. See Creating Datasets below.
  3. Writing Results: After a job has been executed, you can write the results back to TFS. See Writing Results below.

In the Trifacta application, TFS is accessed through the TFS browser. See TFS Browser.

NOTE: When the Trifacta platform executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

Before You Begin Using TFS

Avoid using /trifacta/uploads for reading and writing data. This directory is used by the Trifacta application.

Your writeable home output directory is available through your user profile. See Storage Config Page.

Secure Access

Access to TFS is governed by IAM roles that are automatically assigned to workspace users of the product. No additional configuration is necessary to access TFS.

Storing Data in TFS

When you upload raw data to TFS, it is stored in your pre-configured uploads directory based on an internal upload identifier.

All Trifacta users should have a clear understanding of the folder structure within TFS where each individual can read from and write results.

  • Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
  • The Trifacta application stores the results of each job in a separate folder in your TFS location.

NOTE: The Trifacta platform does not modify source data in TFS. Source data stored in TFS is read without modification, and source data uploaded to the Trifacta platform is stored in /trifacta/uploads.

Reading from Sources

You can create an imported dataset from one or more files stored in TFS.Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.

Folder selection:

When you select a folder in TFS to create your dataset, you select all files in the folder to be included.

Notes:

  • This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
  • All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.

When a folder is selected from TFS, the following file types are ignored:

  • *_SUCCESS and *_FAILED files, which may be present if the folder has been populated by the running environment.

NOTE: If you have a folder and file with the same name in TFS, search only retrieves the file. You can still navigate to locate the folder. 

Creating Datasets

When creating a dataset, you can choose to read data in from a source stored from TFS or local file.

  • TFS sources are not moved or changed.
  • Local file sources are uploaded to /trifacta/uploads where they remain and are not changed.

Data may be individual files or all of the files in a folder.

  • For more information, see Reading from Sources in TFS above.
  • In the Import Data page, click the Trifacta tab. See Import Data Page.

Writing Results

When you run a job, you can specify the file path where the generated results are written. By default, the output is generated in the default output home directory.

  • Each set of results must be stored in a separate folder within your TFS output home directory.
  • For more information on your output home directory, see Storage Config Page.

Creating a new dataset from results

As part of writing results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

NOTE: When you create a new dataset as part of your results, the file or files are written to the designated output location for your user account. Depending on how your permissions are configured, this location may not be accessible to other users.

Purging Files

Other than temporary files, the Trifacta platform does not remove any files that were generated or used by the platform, including:

  • Uploaded datasets
  • Generated samples
  • Generated results

This page has no comments.