Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Pass in parameterized values through API to operationalize the execution of jobs across weeks of transaction data.

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:



Code Block
<file_system>:///source/transactions/YYYY/MM/DD/transactions.csv

...

For more information, see Create Dataset with Parameters.

Mismatched Schemas

D s product
 expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

  • When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.
  • If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.
  • If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.
    • If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.
  • Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:

Tips:

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas. 
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Overview of Parameterization.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.
  • You cannot apply parameters to write or publishing operations.
  • For regular expression patterns, the following reference types are not supported due to the length of time to evaluate:
    • Backreferences. The following example matches on axa, bxb, and cxc yet generates an error:

      Code Block
      ([a-c])x\1


    • Lookahead assertions: The following example matches on a, but only when it is part of an ab pattern. It generates an error:

      Code Block
      a(?=b)


  • For some source file types, such as Parquet, the schemas between source files must match exactly.

...