This section describes how you interact through the with your HDFS environment.
The can use HDFS for the following reading and writing tasks:
In the , HDFS is accessed through the HDFS browser. See HDFS Browser.
NOTE: When the executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.
Read/Write Access: Your Hadoop administrator must configure read/write permissions to locations in HDFS. Please see the HDFS documentation provided with your Hadoop distribution.
Your Hadoop administrator should provide a place or mechanism for raw data to be uploaded to your Hadoop datastore.
Your Hadoop administrator should provide a writeable home output directory for you, which you can review. See Storage Config Page.
Depending on the security features you've enabled, the technical methods by which access HDFS may vary. For more information, see Configure Hadoop Authentication.
Your Hadoop administrator should provide raw data or locations and access for storing raw data within HDFS. All should have a clear understanding of the folder structure within HDFS where each individual can read from and write their job results.
NOTE: The does not modify source data in HDFS. Sources stored in HDFS are read without modification from their source locations, and sources that are uploaded to the platform are stored in
If JDBC ingest caching has been enabled, users may see a
dataSourceCache folder in their browser. This folder is used to store per-user caches of JDBC-based data that has been ingested into the platform from its source.
NOTE: The datasourceCache folder should not be used for reading and writing of datasets, metadata, or results.
For more information, see Configure JDBC Ingestion.
You can create a dataset from one or more files stored in HDFS.Wildcards:
You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.
When you select a folder in HDFS to create your dataset, you select all files in the folder to be included. Notes:
*_FAILEDfiles, which may be present if the folder has been populated by Hadoop.
_), these files cannot be read during batch transformation and are ignored. Please rename these files through HDFS so that they do not begin with an underscore.
When creating a dataset, you can choose to read data in from a source stored from HDFS or from a local file.
/trifacta/uploadswhere they remain and are not changed.
Data may be individual files or all of the files in a folder. For more information, see Reading from Sources in HDFS.
When your job results are generated, they can be stored back in HDFS for you at the location defined for your user account.
If your deployment is using HDFS, do not use the
NOTE: Users can specify a default output home directory and, during job execution, an output directory for the current job. In an encrypted HDFS environment, these two locations must be in the same encryption zone. Otherwise, writing the job results fails with a
Access to results:
Depending on how the platform is integrated with HDFS, other users may or may not be able to access your job results.
If user impersonation is enabled, results are written to HDFS through the HDFS account configured for your use. Depending on the permissions of your HDFS account, you may be the only person who can access these results.
As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.
NOTE: When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your Hadoop permissions are configured, this location may not be accessible to other users.
Other than temporary files, the does not remove any files that were generated or used by the platform, including:
If you are concerned about data accumulation, please contact your HDFS administrator.