This section describes how you interact through the  with your HDFS environment.

Uses of HDFS

The  can use HDFS for the following reading and writing tasks:

  1. Creating Datasets from HDFS Files: You can read in from a data source stored in HDFS. A source may be a single HDFS file or a folder of identically structured files. See Reading from Sources in HDFS below.
  2. Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in HDFS. See Creating Datasets below.
  3. Writing Job Results: After a job has been executed, you can write the results back to HDFS. See Writing Job Results below.

In the , HDFS is accessed through the HDFS browser. See HDFS Browser.

NOTE: When the executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.

Before You Begin Using HDFS

Secure Access

Depending on the security features you've enabled, the technical methods by which  access HDFS may vary. For more information, see Configure Hadoop Authentication.

Storing Data in HDFS

Your Hadoop administrator should provide raw data or locations and access for storing raw data within HDFS. All  should have a clear understanding of the folder structure within HDFS where each individual can read from and write their job results. 

NOTE: The does not modify source data in HDFS. Sources stored in HDFS are read without modification from their source locations, and sources that are uploaded to the platform are stored in /trifacta/uploads.

Reading from Sources in HDFS

You can create a dataset from one or more files stored in HDFS.

Wildcards:

You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.

Folder selection:

When you select a folder in HDFS to create your dataset, you select all files in the folder to be included. Notes:

Creating Datasets

When creating a dataset, you can choose to read data in from a source stored from HDFS or from a local file.

Data may be individual files or all of the files in a folder. For more information, see Reading from Sources in HDFS.

Writing Job Results

When your job results are generated, they can be stored back in HDFS for you at the location defined for your user account.

If your deployment is using HDFS, do not use the trifacta/uploads directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.

NOTE: Users can specify a default output home directory and, during job execution, an output directory for the current job. In an encrypted HDFS environment, these two locations must be in the same encryption zone. Otherwise, writing the job results fails with a Publish Job Failed error.

 

Access to results:

Depending on how the platform is integrated with HDFS, other users may or may not be able to access your job results.

Creating a new dataset from results

As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.

NOTE: When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your Hadoop permissions are configured, this location may not be accessible to other users.