Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r092

D toc


This section provides an overview of data import and how different types of import are handled in

D s product

How to Import

You import data for use in the 

D s webapp
 through a reference object called an imported dataset. An imported dataset is a reference to the source of the data. 


NOTE: The source data is never modified. In some cases, the source data may be copied to the base storage layer. For example, data that is uploaded from your local desktop must be copied to the base storage layer so that it is accessible to you and potentially other users of the

D s webapp


  1. In the 
    D s webapp
    , click the Library icon in the left navigation bar. 
  2. In the Library, click Import Data.
  3. The Import Data page is displayed. 
    1. Select the connection in the left nav bar through which you can access the data. 
    2. For more information, see  Import Data Page.

After the data has been imported, you can reference it within the application as an imported dataset. For more information, see Import Basics.

Types of Import

You can import datasets or select datasets from sources that are stored on file-based storage, connected datastores, or your desktop. Following are the different types of import that you can perform in the Import Data page.


You can upload a variety of flat file formats from your local desktop. You can upload a file up to 1 GB in size.

  • You can drag and drop files from your desktop to to upload them.
  • You can select multiple files in the same directory for uploading at the same time. 

Import of files

D s product
 supports multiple storage environments. You can import one or more files from any backend data storage systems. Each workspace has a default backend storage environment. Each user should be able to import files that are stored in accessible locations in this backend storage area. 


NOTE: You must have read permissions for these storage environments to import the file. These permissions should be set up during initial configuration of the product. For more information, please contact your administrator.


NOTE: During import, the 

D s product
 identifies file formats based on the extension of the filename. 

Import of tables

You can import one or more tables from relational datastores. Through the Import Data page, you can select or create the appropriate connection to the datastore, navigate to the required database and select the files to be imported. 


NOTE: You must have read permissions for any database from which you want to import. These permissions must be enabled by a database administrator outside of the product. For more information, see  Using Databases.

Imported Datasets

When you import a file or a table, the data that is imported to the platform is referenced as an imported dataset. An imported dataset is simply a reference to the original data. An imported dataset can be a reference to a file, multiple files, database table, or other type of data. 



D s product
does not modify the source data. It is only referenced as an imported dataset.


NOTE: The imported dataset may be broken if the path or the permissions change for the underlying dataset.

Persisted Data

In general, the 

D s webapp
 does not retain data for a longer time than the data is explicitly needed. For example, when jobs are executed on
D s photon
, the source data is streamed to the
D s node
 and transformed, after which results are written. The transformed data is not maintained in the
D s node


NOTE: Data is not persisted on the

D s node

More information on persisted data is available below.


Samples can be generated within the product through the Samples panel. When a sample is created, it is stored within your storage directory on the backend datastore. You can create a new sample at any time. 

  • If the source data is larger than 10 MB in size, an initial sample is automatically generated when the recipe is first loaded in the Transformer page. This sample contains the first set of rows in the dataset up to 10 MB in size. 
  • If the source data contains multiple files, the initial sample for the dataset is generated from the first set of rows in the first filename listed in the directory.
  • If that source data is a multi-sheet Excel file, the sample is taken from the first sheet in the file.

For more information on creating samples, see  Overview of Sampling.


For some file types, the 

D s webapp
 must convert the source data into a format that is natively supported by the product. This process happens as part of the importing of data for use in the 
D s webapp
 and is managed by the conversion service in the platform. In such scenarios, the data is read from the source and passed through the conversion service, which understands how to read the source format and can write it to a supported text format. This text version of the source data is written to the base storage layer.

For example, when a transformation job is executed, the original source data is passed through the conversion service, and the converted data is used for job execution. When the job results are written, conversion service removes the converted data.

During import, the 

D s webapp
 identifies file formats based on the extension of the filename. The conversion process applies for the following type of files:

  • XLS and PDF: These file types are stored in a proprietary binary format. Conversion service uses a set of libraries to convert files of these types to tabular CSV data and store the files in the base storage layer. 
  • JSON: JSON file through the conversion service provides considerable improvements in terms of quality and performance during ingestion of JSON data.

For more information, see Supported File Formats.


Caching refers to the process of ingesting and storing data sources in a temporary backend location for a specific period of time in order to perform any additional operations in a faster way. 

Instead of reloading the source each time that an object is referenced, the 

D s webapp
 checks the cache for a cached version and if the cache is still valid. Based on the results, the 
D s webapp
 pulls data from the local cache instead.


Tip: Cached objects can be referenced later for faster performance on tasks such as sampling and job execution.

For more information, see Configure Data Source Caching.

Sharing Imported Data

You cannot shared an imported dataset as an object; however, you can share connections. If the user has permissions over the dataset that has been shared as a part of the connection then the imported dataset is accessible to the shared user.


NOTE: The shared user should have the connection credentials to access the imported dataset.

For more information, see Overview of Sharing.

For more information, see Share a Connection.

D s also