can directly import Adobe® Acrobat® PDF files containing one or more tables. The tables of a PDF can be imported as:
A dataset with parameters
NOTE: When importing as a parameterized dataset, all selected tables are imported into a single dataset. |
PDF files can be uploaded from your local system. If is connected to a backend file storage system, you can also import PDF files stored in readable directories.
PDF ingest is limited to 100 MB per file.
Filepath and source row number information is not available from original PDF files. These references return values from the CSV files that have been converted on the backend. For more information, see Source Metadata References.
The PDF file format is a publishing format designed around visual layout of information, some of which may include tabular data. Table data in PDF files must be detected and converted into CSV data for proper ingestion in the platform. This ingest process occurs on the backend datastore.
To facilitate ingestion, the following requirements must be met for tables in your source PDF files:
When a table spans multiple pages, it is ingested as two separate CSV files, which can be combined later.
If a file contains multiple tables, each table is converted as a separate dataset.
Tip: After import, separate datasets can be unioned together or integrated using as a dataset with parameters. |
In the Library page, click Import Data. Select the connection to use. See Import Data Page.
Import PDF file containing multiple pages |
By default, all pages in the PDF are imported as individual datasets. To change how the data is imported, click Edit in the right panel.
Import settings for PDF datasets |
Selected tables into 1 dataset: All selected tables in the PDF are combined and imported as a single dataset.
NOTE: The schemas of each dataset must match. Columns must be listed in the same order in each dataset. The column headers are taken from the first selected dataset. |
All and future tables into 1 dataset: If the PDF is updated periodically with new tables that you would like to add in the future, select this option. After initial selection of the tables to include, all PDF pages that are added to the PDF file in the future are automatically added as part of the imported dataset.
NOTE: This option is available only if you are connected to a backend file storage system. |
NOTE: When an imported dataset based on this option is first loaded into the Transformer page, the data grid displays an initial sample taken from rows in the first table only. When you take another sample from the Samples panel, data is collected from other tables. For more information, see Samples Panel. |
You can select the tables to import. A table can be a single page, or a single table among multiple on a page.
NOTE: If you are importing a folder of PDF files, data preview and initial sampling are executed against the first file found in the folder. |
To preview the data of an individual table, mouse over a dataset and click Jump to.
See Import Data Page.