Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

Tip

Tip: You can use the $filepath and $sourcerownumber to create a primary key to identify source information for any row in your file-based datasets.

$filepath

This reference retrieves the fully qualified path for a row of data sourced from a file. As you are working with a dataset in the application, it can be helpful to know where the file from which each row of data originated. Using the $filepath function, you can generate columns of data early in your recipe to retain this useful information.

The following transforms might make file path information invalid or otherwise unavailable:

  • pivot
  • join
  • unnest
Info

NOTE: This reference returns the file path. It does not include the scheme or authority information from the URI. So, protocol identifiers such as http:// are not available in the output.

Info

NOTE: For Avro sources, if you imported a folder containing multiple Avro files, the folder path is returned. The full path is returned for single Avro files and for other multi-file datasets.

Supported File Formats

Base file formats:

...

File format

...

Supported?

...

Notes

...

Full path to the converted CSV file on backend datastore is returned, which is a temporary location. Original XLSX file path is not available.

...

Limitations

  • The FILEPATH function produces a null value for any samples collected before the feature was enabled, since the information was not available. To see that lineage information, you must switch to the initial sample or collect a new sample.
  • After this function is enabled, non-initial samples collected in the future are slightly smaller in size, due to the space consumed by the filepath lineage information that is tracked as part of the sample. You may see a change in the number of rows in your sample.

Example:

The following example generates a column containing the filepath information for each row in the dataset:

D code

derive type: single value: $filepath as: 'src_filepath'

You can use the following additional steps to extract the filename from the above src_filepath column:

D code

derive type: single value: RIGHTFIND(src_filepath, '\/', false, 0) as: 'rightfind__src_filepath'

D code

derive type: single value: SUBSTRING(src_filepath, rightfind_src_filepath + 1, LEN(src_filepath)) as: 'filename'

D code

drop col: rightfind_src_filepath action: Drop

$sourcerownumber

...

$sourcerownumber

The $sourcerownumber variable is a reference to the row number in which the current row originally appeared in the source of the data.

Tip

Tip: If the source row information is still available, you can hover over the left side of a row in the data grid to see the source row number in the original source data.


Limitations:

  • The following transforms might make original row information invalid or otherwise unavailable. In these cases, the reference returns null values:
    • pivot
    • flatten
    • join
    • lookup
    • union
    • unnest
    • unpivot
Info

NOTE: If the dataset is sourced from multiple files, a predictable original source row number cannot be guaranteed, and null values are returned.

...



Example:

The following example generates a new column containing the source row number for each row in the dataset, if available: 

d-

...

trans
RawWrangletrue
p03Value'src_rownumber'
Typestep
WrangleTextderive type: single value: $sourcerownumber as: 'src_rownumber'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$sourcerownumber
p03NameNew column name
SearchTermNew formula

If you have already used the $filepath reference, as in the previous example, you can combine these two columns to create a unique key to the source of each row:

d-

...

trans
RawWrangletrue
p03Value'src_key'
Typestep
WrangleTextderive type: single value: MERGE([src_filename,src_rownumber],'-') as: 'src_key'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueMERGE([src_filename,src_rownumber],'-')
p03NameNew column name
SearchTermNew formula

$col

The $col variable is a reference to the column that is currently being evaluated. This variable references the state of the current dataset, instead of the original source.

Info

NOTE: This reference works only for the edit with formula transformation (set transform).

In the following example, all columns in the dataset that are of String type are converted to uppercase:

d-

...

trans
RawWrangletrue
Typestep
WrangleTextset col: * value: IF(ISMISMATCHED($col, ['String']), $col, UPPER($col))
p01NameColumns
p01ValueAll
p02NameFormula
p02ValueIF(ISMISMATCHED($col, ['String']), $col, UPPER($col))
SearchTermEdit column with formula

In the above, the wildcard applies the edit to each column. Each column is tested to see if it is mismatched with the String data type. If mismatched, the value in the column ($col) is written. Otherwise, the value in the column is converted to uppercase (UPPER($col)).

...