Page tree

   

Contents:


Through the Import Data page, you can upload datasets or select datasets from sources that are stored on connected datastores. From the Library page, click Import Data.




Figure: Import Data page

General Limitations

NOTE: For file-based sources, Trifacta® Wrangler Pro expects that each row of data in the import file is terminated with a consistent newline character, including the last one in the file.

  • For single files lacking this final newline character, the final record may be dropped.

  • For multi-file imports lacking a newline in the final record of a file, this final record may be merged with the first one in the next file and then dropped in the Trifacta Photon running environment.



NOTE: An imported dataset requires about 15 rows to properly infer column data types and the row, if any, to use for column headers.

File and path limitations:

  • The colon character  ( :)  cannot appear in a filename or a file path.
  • Filenames cannot begin with special characters like dot ( .) or underscore( _).

Basic Workflow

1. Connect to sources

NOTE: Compressed files are recognized and can be imported based on their file extensions.

Upload: Trifacta® Wrangler Pro can also load files from your local file system.

Tip: You can drag and drop files from your desktop to to upload them.


NOTE: You can upload a file up to 1 GB in size.


NOTE: You can upload a file up to 1 GB in size.


NOTE: When you upload an updated version of a previously uploaded file, the new file is stored as a separate upload altogether. In your flow, you must swap out the old dataset to point to the new one.




S3: If connected to an S3 instance, you can browse your S3 buckets to select source files.

Tip: For HDFS and S3, you can select folders, which selects each file within the directory as a separate dataset.

See S3 Browser.

Redshift: If connected to an S3 data warehouse, you can import source from the connected database. See Redshift Browser.




Databases: If connected to a relational datastore, you can load tables or views from your database. See Database Browser.

NOTE: For long-loading relational sources, you can monitor progress through each stage of ingestion. After these sources are ingested, subsequent steps to import and wrangle the data may be faster.

For more information, see Overview of Job Monitoring.


For more information on the supported input formats, see Supported File Formats.

New/Edit: Click to create or edit a connection. By default, the displayed connections support import.

Search: Enter a search term to locate a specific connection.

See Create Connection Window.

2. Add datasets

When you have found your source directory or file:

  • You can hover over the name of a file to preview its contents.

    NOTE: Preview may not be available for some sources, such as Parquet.

  • Click the Plus icon next to the directory or filename to add it as a dataset.

    Tip: You can import multiple datasets at the same time. See below.

  • Excel files: Click the Plus icon next to the parent workbook to add all of the worksheets as a single dataset, or you can add individual sheets as individual datasets. See Import Excel Data.

  • If custom SQL query is enabled, you can click Create Dataset with SQL to enter a customized SQL statement to pre-filter the table within the database to include only the rows and columns of interest.

    For more information, see Create Dataset with SQL.


If parameterization has been enabled, you can apply parameters to the source paths of your datasets to capture a wider set of sources. Click Create Dataset with Parameters. See Create Dataset with Parameters.

3. Configure selections

When a dataset has been selected, the following fields appear on the right side of the screen. Modify as needed:

Dataset Name: This name appears in the interface.

Dataset Description: You may add an optional description that provides additional detail about the dataset. This information is visible in some areas of the interface.

Tip: Click the Eye icon to inspect the contents of the dataset prior to importing.

You can select a single dataset or multiple datasets for import.

Edit settings

You can modify settings used during import for individual files. In the card for an individual dataset, click Edit Settings.

NOTE: In some cases, there may be discrepancies between row counts in the previewed data versus the data grid after the dataset has been imported, due to rounding in row counts performed in the preview.

Per-file encoding: By default, Trifacta Wrangler Pro applies a specified encoding type on the imported the file. In some cases, the data preview panel may contain garbled data, due to a mismatch in encodings. In the Data Preview dialog, you can select a different encoding for the file. When the correct encoding is selected, the preview displays the data as expected. 

NOTE: Assessing the file encoding type based on parsing an input file is not an accurate method. Instead, Trifacta Wrangler Pro assumes that the file is encoded in the default encoding. If it is not, you should change the encoding type for the file.


NOTE: In some cases, imported files are not properly parsed due to issues with encryption types or encryption keys in the source datastore. For more information, please contact your datastore administrator.

For a list of supported encoding types, see Supported File Encoding Types.

Detect structure: By default, Trifacta Wrangler Pro attempts to interpret the structure of your data during import. This structuring attempts to apply an initial tabular structure to the dataset.

  • Unless you have specific problems with the initial structure, you should leave the Detect structure setting enabled. Recipes created from these imported datasets automatically include the structuring as the first, hidden steps. These steps are not available for editing, although you can remove them through the Recipe panel. See Recipe Panel.
  • When detecting structure is disabled, imported datasets whose schema has not been detected are labeled, unstructured datasets. When recipes are created for these unstructured datasets, the structuring steps are added into the recipe and can be edited as needed.
  • For more information, see Initial Parsing Steps.

Remove special characters from column names: When selected, characters that are not alphanumeric or underscores are stripped, and space characters are converted to underscores.

For more information, see Sanitize Column Names.

Infer column data types: (table only) You can choose whether or not to apply Trifacta Wrangler Pro type inference to table data imported from a database.

  • In the preview panel, you can see the data type that is to be applied after the dataset is imported. This data type may change depending on whether column data type inference is enabled or disabled for the dataset.
  • To enable Trifacta Wrangler Pro type inference, select the Infer column data types checkbox.

    Tip: To see the effects of Trifacta Wrangler Pro type inference, you can toggle the checkbox and review data type listed at the top of individual columns. To override an individual column's data type, click the data type name and select a new value.

You can configure the default use of type inference at the individual connection level. For more information, see Create Connection Window.


Selecting Column Headers: (file only) You can apply the column headers to your datasets during import. Select the required option from the drop-down list:

  • Infer Header: (default) When selected, the Trifacta application infers the header based on the data in the import. 
  • Use first row as header: When selected, the first row is used as the column headers.

  • No Header: When selected, the inference is ignored and column headers are defined using generic names with no headers.

If replacing a file: 

  • If you replace a dataset in a flow and select the Use first row as header option, then the existing header row labels are updated with the new headers.
  • Subsequent steps in a pre-existing recipe may be broken if the headers are changed by a replaced file.

Tip: After the dataset is imported, you can rename columns manually or using any row in the dataset. For more information, see Rename Columns.

4. Import selections

Single dataset

If you have selected a single dataset for import:

Tip: If present, you can click the Add to new flow checkbox, which adds the imported datasets to an Untitled flow. For more information, see Flow View Page.

  • Click Continue. The dataset is imported. 
  • A recipe is created for it, added to a new flow, and loaded in the Transformer page for wrangling. See Transformer Page.

Multiple datasets

You can import multiple datasets from multiple sources at the same time. In the Import Data page, continue selecting sources, and additional dataset cards are added to the right panel.

NOTE: If you are importing from multiple files at the same time, the files are not necessarily read in a regular or predictable order.


NOTE: When you import a dataset with parameters from multiple files, only the first matching file is displayed in the right panel.

In the right panel, you can see a preview of each dataset and make changes as needed.

Figure: Import Multiple Datasets


If you have selected multiple datasets for import:

Tip: If present, you can click the Add to new flow checkbox, which adds the imported datasets to an Untitled flow. For more information, see Flow View Page.

  • To import the selected datasets, click Continue

    • To begin transforming one of these datasets in Flow View, select it. From its context menu, select Add new recipe. Select the recipe. In the context panel on the right, select Edit Recipe. See Transformer Page.

  • To remove a dataset from import, click the X in the dataset card.

This page has no comments.