Skip to main content

Overview of Storage

The Designer Cloud Powered by Trifacta platform supports different options for reading and writing data from your storage systems.

Base Storage Layer

The base storage layeris the datastore where the Designer Cloud Powered by Trifacta platform uploads data, generates profiles, results, and samples. By default, job results are written on the base storage layer. You can configure the base storage layer and other required settings.

Tip

The base storage layer must be a file-based system.

Note

The base storage layer must be enabled and configured during initial installation. After the base storage layer has been configured, it cannot be switched to another environment. For more information, see Set Base Storage Layer.

Uses of base storage layer

In general, all base storage layers provide similar capabilities for storing, creating, reading, and writing datasets.

The base storage layer enables you to perform the following functions:

  1. Storing datasets: You can upload or store datasets in directories on the base storage layer. See below.

  2. Creating datasets: You can read in from datasources stored in the storage layer. A source may be a single file or a folder of files.

  3. Storage of samples: Any samples that you generate are stored in the base storage layer.

  4. Ingested data: Some data like Excel and PDF are stored as binary (non-text) files. These files must be read and converted to CSVs, which are stored on the base storage layer.

  5. Cached data: You can enable a cache on the base storage layer, which allows data that has been ingested to remain on the base storage layer for a period of time. This cache allows for faster performance if you need to use the data at a later time.

  6. Writing Results: After you run the job, you can write the results to the storage layer.

Base storage layer directories

The following directories and their sub-directories are created and maintained on the base storage layer:

Directory

Description

/trifacta/uploads

Storage of datasets uploaded through the Trifacta Application. Directories beneath this one are listed by the internal identifier for each user of the product who has uploaded at least one file.

Warning

Avoid using /trifacta/uploads for reading and writing data. This directory is used by the Trifacta Application.

/trifacta/queryResults

Default storage of results generated job executions. Directories beneath this one are listed by the internal identifier for each user of the product who has run at least one job.

For each user, these sub-directories are the default storage location for job results. These locations can be modified. See Preferences Page. Within the queryResults directory, you may find sub-directories labeled datasourceCache. When data source caching is enabled, data read into the product may be temporarily stored in this directory. For more information, see Configure Data Source Caching.

/trifacta/dictionaries

Storage of custom dictionary files uploaded by users.

Note

This feature applies to Designer Cloud Powered by Trifacta Enterprise Edition only. It is not often used.

/trifacta/tempfiles

Temporary storage location for files required for use of the product.

Note

The tempfiles directory is reserved for use by the platform. It is the only directory of these that is actively cleaned by the platform.

User-specific directories

The following directories are created by default on the base storage layer for the Designer Cloud Powered by Trifacta platform to store user data.

By default, these directories are stored in the following:

<bucket_name>/<userId>

where:

  • <bucket_name> is the name of the bucket where user data is stored.

  • userId> is the username that is used to log in to the product.

Directory

Description

jobrun

Storage of generated samples.

temp

Temporary storage

upload

Depending on your configuration, uploaded files may be stored in this per-user directory.

These directories may be modified by individual users. For more information, see Storage Config Page.

Minimum Permissions

Designer Cloud Powered by Trifacta Enterprise Edition requires the following operating system level permissions on the listed directories and sub-directories:

Directory

Owner Min Permissions

Group Min Permissions

World Min Permissions

/trifacta/uploads

read+write+execute

none

none

/trifacta/queryResults

read+write+execute

none

none

/trifacta/dictionaries

read+write+execute

none

none

/trifacta/tempfiles

read+write+execute

none

none

Available base storage layers

The Designer Cloud Powered by Trifacta platform supports the following base storage layers.

Note

In some deployments, the base storage layer is pre-configured for you and cannot be modified. After the base storage layer has been defined, you cannot change it.

Note

For all storage layers, the source data is untouched. Results are written to a location whenever a job is executed on a source dataset.

For more information, see Storage Deployment Options.

S3

Simple Storage Service (S3)is an online data storage service provided by Amazon, which provides low-latency access through web services. For more information, see https://aws.amazon.com/s3/.

For more information, see External S3 Connections.

HDFS

HDFS is a scalable file storage system for use across all of the nodes (servers) of a Hadoop cluster. Many interactions with HDFS are similar with desktop interactions with files and folders. However, what looks like a "file" or "folder" in HDFS may be spread across multiple nodes in the cluster. For more information, see https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS.

Note

If you are using impersonated access on the base storage layer, then, the group minimum permissions must be read+write+access on all of the above directories and sub-directories.

For more information, see Using HDFS.

ADLS Gen 1

ADLS is a scalable file storage system for use across all of the nodes (servers) of a cluster. Many interactions with ADLS are similar with desktop interactions with files and folders. However, what looks like a "file" or "folder" in ADLS may be spread across multiple nodes in the cluster. For more information, see https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview.

ADLS Gen 2

Microsoft Azure deployments can integrate with the next generation of Azure Data Lake Store (ADLS Gen2). Microsoft Azure Data Lake Store Gen2 (ADLS Gen2)combines the power of a high-performance file system with massive scale and economy. Azure Data Lake Storage Gen2 extends Azure Blob Storage capabilities and is optimized for analytics workloads. For more information, see https://azure.microsoft.com/en-us/services/storage/data-lake-storage/.

WASB

WASB is a scalable file storage system for use across all of the nodes (servers) of a cluster. As with HDFS, many interactions with WASB are similar with desktop interactions with files and folders. However, what looks like a "file" or "folder" in WASB may be spread across multiple nodes in the cluster.

Encryption on base storage layer

For data that is transferred to and from the base storage layer:

  • Data in transit is encrypted using HTTPS.

  • Data at rest is unencrypted by default.

    Note

    Server-side encryption can be applied when the product is writing results to an S3 bucket. For more information, see AWS Settings Page.

Management of base storage layer

Maintenance of the base storage layer must be in accordance with your enterprise policies.

Warning

Unless the base storage layer is managed by Alteryx, it is the responsibility of the customer to maintain access and perform any required backups of data stored in the base storage layer.

Note

Except for temporary files, the Designer Cloud Powered by Trifacta platform does not perform any cleanup of the base storage layer.

External Storage

You can create connections to external storage systems. You can integrate the Designer Cloud Powered by Trifacta platform with an external datastore. Depending on the type of connection and your permissions, the connection can be:

  • read-only

  • write-only

  • read-write

You can create and edit connections between the Designer Cloud Powered by Trifacta platform and external data stores. You can create either file-based or table-based connections to individual storage units, such as databases or buckets.

Note

In your environment, creation of connections may be limited to administrators only. For more information, contact your workspace administrator.

Tip

Administrators can edit any public connection.

Note

After you create a connection, you cannot change its connection type. You must delete the connection and start again.

File-based systems

In addition to the base storage layer, you may be able to connect to other file-based systems. For example, if your base storage layer is HDFS, you can also connect to S3.

Note

If HDFS is specified as your base storage layer, you cannot publish to Redshift.

For more information, see Connection Types.

Cloud data warehouses

The Trifacta Application can be leveraged for loading and transforming data in data warehouses in the cloud. These integrations offer high performance access to reading in datasets from these and other sources, performing transformations, and writing results back to the data warehouse as needed.

Relational systems

When you are working with relational data, you can configure the database connections after you have completed the platform configuration and have validated that it is working for locally uploaded files.

Note

Database connections cannot be deleted if their databases host imported datasets that are in use by the Designer Cloud Powered by Trifacta platform. Remove these imported datasets before deleting the connection.

For more information, see Using Databases.

Management of external storage

To integrate with an external system, the Trifacta Application requires:

  • Basic ability to connect to the hosting node of the external system through your network or cloud-based infrastructure

  • Requisite permissions to support the browsing, reading and/or writing of data to the storage system

  • A defined connection between the application and the storage system.

Except for cleanup of temporary files, the Trifacta Application does not maintain external storage systems.