Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info

NOTE: For file-based sources,

D s product
rtrue
expects that each row of data in the import file is terminated with a consistent newline character, including the last one in the file.

  • For single files lacking this final newline character, the final record may be dropped.
  • For multi-file imports lacking a newline in the final record of a file, this final record may be merged with the first one in the next file and then dropped depending on your running environment.

D s dataadmin role

D s minrows

  1. Connect to the source of your data:

    Info

    NOTE: Compressed files are recognized and can be imported based on their file extensions.

     

    1.  

      Upload: 

      D s product
      rtrue
       can also load files from your local file system.

      Tip

      Tip: You can drag and drop files from your desktop to to upload them.


    2.  

      HDFS: If connected to a Hadoop instance, you can select file(s) or folders to import. See HDFS Browser.

      S3: If connected to an S3 instance, you can browse your S3 buckets to select source files. See S3 Browser.

      Tip

      Tip: For HDFS and S3, you can select folders, which selects each file within the directory as a separate dataset.

      Redshift: If connected to an S3 datawarehouse, you can import source from the connected database. See Redshift Browser.

      Hive: If connected to a Hive instance, you can load datasets from individual tables within the set of Hive databases. See Hive Browser.

      Alation: If connected to Alation, you can search for and import Hive tables as imported datasets. For more information, see Using Alation.

      Waterline: If connected to Waterline, you can search for and import datasets through the data catalog. For more information, Using Waterline.

      Databases: If connected to a relational datastore, you can load tables or views from your database. See Database Browser.

      WASB: If enabled, you can import data into your Azure deployment from WASB. For more information, see WASB Browser.

      ADL: If enabled, you can import data into your Azure deployment from ADLS. The ADLS browser is very similar to the one for HDFS. See HDFS Browser.

       

    3. For more information on the supported input formats, see Supported File Formats.

  2.  

    New/Edit: Click to create or edit a connection.

    Search: Enter a search term to locate a specific connection.

    Info

    NOTE: This feature may be disabled in your environment. For more information, contact your

    D s item
    itemadministrator
    .

    See Create Connection Window.

     

  3. Add datasets:
    1.  

      When you have found your source directory or file, click the Plus icon next to its name to add it as a dataset. 

      Tip

      Tip: You can import multiple datasets at the same time. See below.


    2. Excel files: Click the Plus icon next to the parent workbook to add all of the worksheets as a single dataset, or you can add individual sheets as individual datasets.

      Tip

      Tip: If you experience issues uploading XLS/XLSX files that are larger than 35MB, you can convert the files to CSV files and then upload them.

      See Import Excel Data.

    3.  

      If custom SQL query is enabled, you can click Create Dataset with SQL to enter a customized SQL statement to pre-filter the relational or Hive table within the database to include only the rows and columns of interest.

      Warning

      Through this interface, it is possible to enter SQL statements that can delete data, change table schemas, or otherwise corrupt the targeted database. Please use this feature with caution.

      For more information, see Create Dataset with SQL.

      This feature must be enabled. See Enable Custom SQL Query.

       

    4.  

      If parameterization has been enabled, you can apply parameters to the source paths of your datasets to capture a wider set of sources. Click Create Dataset with Parameters.

      See Create Dataset with Parameters.

      This feature must be enabled. For more information, see Overview of Parameterization.

       

  4. When a dataset has been selected, the following fields appear on the right side of the screen. Modify as needed:
    1. Dataset Name: This name appears in the interface. 
    2. Dataset Description: You may add an optional description that provides additional detail about the dataset. This information is visible in some areas of the interface.

      Tip

      Tip: Click the Eye icon to inspect the contents of the dataset prior to importing.


  5. You can select a single dataset or multiple datasets for import. 

  6. You can modify settings used during import for individual files. In the card for an individual dataset, click Edit Settings

    Info

    NOTE: In some cases, there may be discrepancies between row counts in the previewed data versus the data grid after the dataset has been imported, due to rounding in row counts performed in the preview.

    1. Per-file encoding: By default, 

      D s product
       attempts to interpret the encoding used in the file. In some cases, the data preview panel may contain garbled data, due to a mismatch in encodings. In the Data Preview dialog, you can select a different encoding for the file. When the correct encoding is selected, the preview displays the data as expected.For more information on supported encodings, see Configure Global File Encoding Type.

    2. Detect structure: By default,
      D s product
      attempts to interpret the structure of your data during import. This structuring attempts to apply an initial tabular structure to the dataset.
      1. Unless you have specific problems with the initial structure, you should leave the Detect structure setting enabled. Recipes created from these imported datasets automatically include the structuring as the first, hidden steps. These steps are not available for editing, although you can remove them through the Recipe panel. See Recipe Panel.
      2. When detecting structure is disabled, imported datasets whose schema has not been detected are labeled, raw datasets. When recipes are created for these raw datasets, the structuring datasets are added into the recipe and can be edited as needed.
      3. For more information, see Initial Parsing Steps.
    3.  

      Column data type inference: You can choose whether or not to apply 

      D s item
      itemtype inference
       to your individual dataset.

      1. In the preview panel, you can see the data type that is to be applied after the dataset is imported. This data type may change depending on whether column data type inference is enabled or disabled for the dataset.

      2. To enable 

        D s item
        itemtype inference
        , select the Column Data Type Inference checkbox.

        Tip

        Tip: To see the effects of

        D s item
        itemtype inference
        , you can toggle the checkbox and review data type listed at the top of individual columns. To override an individual column's data type, click the data type name and select a new value.


      3. You can configure the default use of type inference at the individual connection level. For more information, see Create Connection Window.

        For schematized sources that do not require connections, such as uploaded Avro files, the default setting is determined by the global setting for initial type inference. For more information, see Configure Type Inference.

       

  7. If you have selected a single dataset for import:

    1. To immediately wrangle it, click Import & Wrangle. The dataset is imported. A recipe is created for it, added to a flow, and loaded in the Transformer page for wrangling. See Transformer Page.
    2. To import the dataset, click Import. The imported dataset is created. You can add it to a flow and create a recipe for it later. See Library Page.
  8. If you have selected multiple datasets for import:
    1. To import the selected datasets, click Import Datasets. The imported datasets are created. You can begin working with these imported datasets now or at a later time. 
    2. To import the selected datasets and add them to a flow:
      1. Click the Add Dataset to a Flow checkbox. 
      2. Click the textbox to see the available flows, or start typing a new name. 
      3. Click Import & Add to Flow
      4. The datasets are imported, and the associated recipes are created. These datasets and recipes are added to the selected flow. 
      5. For any dataset that has been added to a flow, you can review and perform actions on it. See Flow View Page.
  9. If you are not wrangling the datasets immediately, the datasets you just imported are listed at the top of the Library page. See Library Page

...