Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0682

D toc

D s lang
 supports a set of variables, which can be used to programmatically reference aspects of the dataset or its source data. These metadata references allow you to create individual transformations of much greater scope and flexibility.

Tip

Tip: Some transformation steps make access to metadata about the original data source impossible to retain. It's best to use these references, where possible, early in your recipe. Additional information is available below.

Tip

Tip: You can use the $filepath and $sourcerownumber to create a primary key to identify source information for any row in your file-based datasets.

$filepath

This reference retrieves the fully qualified path for a row of data sourced from a file. As you are working with a dataset in the application, it can be helpful to know where the file from which each row of data originated. Using the $filepath function, you can generate columns of data early in your recipe to retain this useful information.

The following transforms might make file path information invalid or otherwise unavailable:

  • pivot
  • join
  • unnest
  • deduplicate
Info

NOTE: This reference returns null values for values from relational database sources.

Info

NOTE: This reference returns the file path. It does not include the scheme or authority information from the URI. So, protocol identifiers such as http:// are not available in the output.

Supported File Formats

Base file formats:

File format
Supported?
Notes
CSVYes
JSONYes 
ExcelLimited

Full path to the source location of the Excel file.

  • For uploaded files, this path is to the location in the base storage layer.
  • If the file contains multiple worksheets, this value includes sheet names. Example: /path/to/my/Excel/file.xlsx/Sheet1
Compressed files (Gzip, Bzip2, etc)LimitedSupport for single-file archives only. Full path is returned only if the archive contains a single file.
foldersYesFull path to the file is returned. You can modify the output column to return the folder path only.

Additional file formats:

File format

Supported?

Notes

AvroYes 
ParquetYes 

Other source types:

Format
Supported?
Notes
Relational tablesNo 

Limitations

  • The FILEPATH function produces a null value for any samples collected before the feature was enabled, since the information was not available. To see that lineage information, you must switch to the initial sample or collect a new sample.
  • After this function is enabled, non-initial samples collected in the future are slightly smaller in size, due to the space consumed by the filepath lineage information that is tracked as part of the sample. You may see a change in the number of rows in your sample.

Example 1 - Generate filename column

The following example generates a column containing the filepath information for each row in the dataset:

D trans
RawWrangletrue
p03Value'src_filepath'
Typestep
WrangleTextderive type: single value: $filepath as: 'src_filepath'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$filepath
p03NameNew column name
SearchTermNew formula

You can use the following additional steps to extract the filename from the above src_filepath column:

D trans
RawWrangletrue
p03Valuerightfind__src_filepath
Typestep
WrangleTextderive type: single value: RIGHTFIND(src_filepath, '\/', false, 0) as: 'rightfind__src_filepath'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueRIGHTFIND(src_filepath, '\/', false, 0)
p03NameNew column name
SearchTermNew formula

D trans
RawWrangletrue
p03Valuefilename
Typestep
WrangleTextderive type: single value: SUBSTRING(src_filepath, rightfind_src_filepath + 1, LEN(src_filepath)) as: 'filename'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueSUBSTRING(src_filepath, rightfind_src_filepath + 1, LEN(src_filepath))
p03NameNew column name
SearchTermNew formula

D trans
RawWrangletrue
Typestep
WrangleTextdrop col: rightfind_src_filepath action: Drop
p01NameColumns
p01Valuerightfind_src_filepath
p02NameAction
p02ValueDelete selected columns
SearchTermDelete columns

Example 2 - Source row number across dataset with parameters

When you import a dataset with parameters, the $sourcerownumber value returns a continuously incrementing row number across all files in the dataset, effectively creating a primary key. Using the following example, you can create a new column to capture the source row number within individual files.

Source:

Here is some example data spread across three files after import using a single dataset with parameters.

column2column3
line1-col1line1-col2
line2-col1line2-col2
line3-col1line3-col2
line1-col1line1-col2
line2-col1line2-col2
line3-col1line3-col2
line1-col1line1-col2
line2-col1line2-col2
line3-col1line3-col2

As you can see, lineage is hard to determine across the files.

Transformation:

Gather the filepath and source row number information into two new columns:

D trans
RawWrangletrue
p03Value'filepath'
Typestep
WrangleTextderive type: single value: $filepath as: 'filepath'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$filepath
p03NameNew column name
SearchTermNew formula

D trans
RawWrangletrue
p03Value'source_row_number'
Typestep
WrangleTextderive type: single value: $sourcerownumber as: 'source_row_number'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$sourcerownumber
p03NameNew column name
SearchTermNew formula

Create a new column called start_of_file_offset which contains the offset value of the row from the first row in the file. In the first statement, mark the value of $sourcerownumber for the first row of the file, leaving the other rows for the file empty:

D trans
RawWrangletrue
p03Value$sourcerownumber
Typestep
WrangleTextderive type: multiple value: IF(PREV($filepath, 1) == $filepath, NULL(), $sourcerownumber) order: $sourcerownumber as: 'start_of_file_offset'
p01NameFormula type
p01ValueMultiple row formula
p02NameFormula
p02ValueIF(PREV($filepath, 1) == $filepath, NULL(), $sourcerownumber)
p03NameSort rows by
p04Value'start_of_file_offset'
p04NameNew column name
SearchTermNew formula

Create a new column that contains the values from the previous column, with the empty rows filled in with the last previous value, which is the $sourcerownumber for the first row of the current file:

D trans
RawWrangletrue
p03Value$sourcerownumber
Typestep
WrangleTextderive type: multiple value: FILL(start_of_file_offset, -1, 0) order: $sourcerownumber as: 'filled_start_file_offset'
p01NameFormula type
p01ValueMultiple row formula
p02NameFormula
p02ValueFILL(start_of_file_offset, -1, 0)
p03NameSort rows by
p04Value'filled_start_file_offset'
p04NameNew column name
SearchTermNew formula

Create a new column computing the file-based source row number as the difference between the raw source row number and the start of file offset value computed in the previous step. This generated value is the source row number for a row within its own file:

D trans
RawWrangletrue
p03Value'source_row_number_per_file'
Typestep
WrangleTextderive type: single value: ($sourcerownumber - filled_start_file_offset) + 1 as: 'source_row_number_per_file'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value($sourcerownumber - filled_start_file_offset) + 1
p03NameNew column name
SearchTermNew formula

Delete the columns used for the intermediate calculations:

D trans
RawWrangletrue
Typestep
WrangleTextdrop col: filled_start_file_offset,start_of_file_offset action: Drop
p01NameColumns
p01Valuefilled_start_file_offset,start_of_file_offset
p02NameAction
p02ValueDelete selected columns
SearchTermDelete columns

Results:

column2column3filepathsource_row_numbersource_row_number_per_file
line1-col1line1-col2/myPath/file001.txt11
line2-col1line2-col2/myPath/file001.txt22
line3-col1line3-col2/myPath/file001.txt33
line1-col1line1-col2/myPath/file002.txt41
line2-col1line2-col2/myPath/file002.txt52
line3-col1line3-col2/myPath/file002.txt63
line1-col1line1-col2/myPath/file003.txt71
line2-col1line2-col2/myPath/file003.txt82
line3-col1line3-col2/myPath/file003.txt93

$sourcerownumber

The $sourcerownumber variable is a reference to the row number in which the current row originally appeared in the source of the data.

Tip

Tip: If the source row information is still available, you can hover over the left side of a row in the data grid to see the source row number in the original source data.


Limitations:

  • The following transforms might make original row information invalid or otherwise unavailable. In these cases, the reference returns null values:
    • pivot
    • flatten
    • join
    • lookup
    • union
    • unnest
    • unpivot

  • This reference does not apply to relational database sources.

  • For files converted on import in the backend datastore, such as Microsoft Excel, this reference returns the source row value for the converted file on the backend infrastructure. If the conversion results in multiple files, the row numbers are continued across files.

Example:

The following example generates a new column containing the source row number for each row in the dataset, if available: 

D trans
RawWrangletrue
p03Value'src_rownumber'
Typestep
WrangleTextderive type: single value: $sourcerownumber as: 'src_rownumber'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02Value$sourcerownumber
p03NameNew column name
SearchTermNew formula

If you have already used the $filepath reference, as in the previous example, you can combine these two columns to create a unique key to the source of each row:

D trans
RawWrangletrue
p03Value'src_key'
Typestep
WrangleTextderive type: single value: MERGE([src_filename,src_rownumber],'-') as: 'src_key'
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueMERGE([src_filename,src_rownumber],'-')
p03NameNew column name
SearchTermNew formula

$col

The $col variable is a reference to the column that is currently being evaluated. This variable references the state of the current dataset, instead of the original source.

Info

NOTE: This reference works only for the edit with formula transformation (set transform).

In the following example, all columns in the dataset that are of String type are converted to uppercase:

D trans
RawWrangletrue
Typestep
WrangleTextset col: * value: IF(ISMISMATCHED($col, ['String']), $col, UPPER($col))
p01NameColumns
p01Value*
p02NameFormula
p02ValueIF(ISMISMATCHED($col, ['String']), $col, UPPER($col))
SearchTermEdit column with formula

In the above, the wildcard applies the edit to each column. Each column is tested to see if it is mismatched with the String data type. If mismatched, the value in the column ($col) is written. Otherwise, the value in the column is converted to uppercase (UPPER($col)).

Tip

Tip: $col is useful for multi-column transformations.