Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0682

...

Mismatched Schemas

D s product
rtrue
 expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

...

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas. 
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.
  • You cannot apply parameters to write or publishing operations.
  • Parameter and variable names can be up to 255 characters in length.
  • For regular expression patterns, the following reference types are not supported due to the length of time to evaluate:
    • Backreferences. The following example matches on axa, bxb, and cxc yet generates an error:

      Code Block
      ([a-c])x\1
    • Lookahead assertions: The following example matches on a, but only when it is part of an ab pattern. It generates an error:

      Code Block
      a(?=b)
  • For some source file types, such as Parquet, the schemas between source files must match exactly.

...