Metadata is data about your data. For example, you might decide that one or more of the following types of information about your dataset should be tracked:
This section provides some methods for how to insert metadata into your dataset.
The following example describes how to insert a single column of metadata. In this case, the full path to the source is inserted as a new column in the dataset.
In the Dataset Details page, select the entire value for the Location, which is the storage location of the source.
Tip: If the full path of the dataset is too long for screen display, be sure to include the ellipsis (...) at the end of the Location value.
Copy the value. Paste the value into a text editor. You should see the full path, like the following:
<root_dir>/uploads/1/2580298d-3477-4907-bfa7-f71978eace04/SF Restaurants - businesses.csv
Add a step similar to the following to the recipe, replacing the path below with the value you want to insert:
derive value:'<root_dir>/uploads/1/2580298d-3477-4907-bfa7-f71978eace04/SF Restaurants - businesses.csv' as:'source_path' as:'datasetPath'
You might need to track more fields of dataset information. While you might be able to perform these kinds of individual inserts, it might be easier to build this information from a separate file.
NOTE: This method uses the
Tip: You can perform a similar merging of datasets using the Join tool. See Join Page.
For example, you want to track the following fields as metadata:
You could create a CSV file that looks like the following:
source_system,source_author,source_date_create,source_path Excel,Joe Guy,12/9/15,<root_dir>/uploads/1/2580298d-3477-4907-bfa7-f71978eace04/SF Restaurants - businesses.csv
In this case, the column headers are in the first line, and the values for each column are in the second line.
headertransform to your recipe.
In the recipe panel of the Transformer page, add a new step. In the Transformation textbox, enter
Sort your data by a key value (e.g.
Determine an appropriate grouping parameter. This step is necessary to simplify the filling process when the job runs at scale. Ideally, you should choose a grouping column that contains a relative few number of values in it (e.g.
Fill values in the data rows with metadata column values. For each metadata column, add the following command, done here for the
source_system column of metadata.
window value: FILL(source_system, 1) order: business_id group:region
Repeat the above step for each metadata column you want to insert.
windowcolumns to use a more appropriate name.