Skip to main content

Problem Setup for Modeling

During the Problem Setup stage, upload your data, view and edit your dataset, choose your target variable, and select your machine learning problem type.

To begin, Upload Data from your computer or select the dropdown to import a file from Alteryx Analytics Cloud. Your data must be a CSV file.

Important

We treat rows that start with a number sign (#) as comments. If the first row in your CSV file is a comment, then the second row becomes the column header.

We also remove empty rows after the header row in your CSV file. If rows contain some empty cells, we convert these empty cells to null values instead. Note that it's good practice to upload files without empty rows.

During the model setup process, Machine Learning displays warnings or errors that occur while running automatic data checks. Keep in mind that some errors won’t allow you to continue until you resolve them.

To learn more about the data checks that Machine Learning runs, select See Details:

  • On the Jobs tab, you can see what jobs have run, their statuses, how long the jobs took to run, and their progress.

  • On the Data Checks tab, you can see what data checks have run, their statuses, any details, and recommended actions.

Select the X icon or Hide Details to exit the data-checks window.

Upload Data

To change your dataset, select Change Data to upload a file from your computer. You can also select the dropdown to import a file from Alteryx Analytics Cloud.

To change the data type of your columns, use the dropdown next to each column. You can also drop a column or set it as an ID column with the 3-dot menu.

ML Integration with Plans

To retrain your model with an updated dataset, you can use Plans to add new data from a workflow to an existing Machine Learning project. Learn more about ML integration with Plans.

Data Profiling

Use Data Profiling to view a graphical summary of the data in each column. To enable Data Profiling, select the vertical bar graph icon in the upper-right corner of the data table.

Outliers

Use Outliers to find data points that fall outside the expected distribution of your data. To open Outliers, select the scatterplot icon in the upper-right corner of the data table. The box plots give you a high-level understanding of the distribution of data for each feature. Select a box plot to see a detailed distribution. We've highlighted outliers in red. Refer to the Outliers info page to learn more about how outliers affect machine learning models.

If a feature contains rows that are outliers, you can remove those rows from the dataset. Check the box associated with the rows you want to delete, then select Delete Outliers.

Warning

If you delete outliers, you won't be able to select Time Series for your Problem Type. Deleting rows introduces gaps in your time series data. These gaps make it difficult for the model to identify trends over time.

Data Insights

Use Data Insights to find relationships between your features. To open Data Insights, select the horizontal bar graph and magnifying glass icon in the upper-right corner of the data table. The Correlation Matrix and Chord Diagram give you to ways to see the strength of correlations between features in your dataset. To fine-tune your view, adjust the Correlation Threshold to filter out weak correlations in the Chord Diagram. Refer to the Data Insights info page to learn more about how to interpret the Correlation Matrix and Chord Diagram.

Manage Columns

Use Manage Columns to easily select the columns you want to include in the modeling process. To open Manage Columns, select the pen-and-paper icon in the upper-right corner of the data table. If you have many columns, use Search to find a specific column by name.

To change the data type of a column, use the dropdown under Type. After you make changes to your dataset, select Apply Changes.

If you select a data type that isn’t valid for a column, you must Revert to the previous data type or Cancel all changes you’ve made.

Choose a Target Column

Choose the Target Column you want to predict. Machine Learning uses the rest of your data to predict values for this target column.

Select Problem Type

Machine Learning automatically selects a Machine Learning Method for you based on your target column and data. You can still select another method if you choose. Refer to these methods to make your selection:

Classification

Classification is a technique aimed at predicting a categorical value.

Use when identifying a grouping or label such as:

  • The type of fruit you have in a grocery store.

  • Credit card fraud detection.

  • Likelihood of a loan defaulting.

Regression

Regression is a technique aimed at predicting a continuous quantity by investigating the relationships between various columns. These findings are then used to make a prediction on the target column to get predicted values.

Use when your target is a continuous numerical value such as:

  • The number of birds you might observe at a nature preserve.

  • The cause-and-effect relationship between variables.

Time Series

Time Series is a technique that forecasts future numeric values based on the history of target values. It does so by analyzing a sequence of data points collected over an interval of time.

Use when wanting to predict future values such as:

  • Quarterly electricity demand for a particular city.

  • Demand forecasts for procurement or pricing.

Time Series Parameters

If you selected Time Series, select Set Up Time Series in the upper-left corner to set the initial parameters for your model. You must do this before you continue to the Data Insights stage.

  • DateTime Column: This column indicates your time index. The column must be a DateTime data type.

  • Forecast Horizon: Select how far into the future your model should predict.

Rename Your Project

At any time during the modeling stages, you can rename your project. To rename your project, select the project name at the top of the screen and then enter a new name. Select the Enter key to confirm or click outside the name field to cancel. Some project name restrictions apply:

  • Can't exceed 100 characters.

  • Must be unique.

  • Must contain at least 1 character.