Skip to main content

Automated Data Insights

The Correlation Matrix and Chord Diagram sections give you different views of correlations between features. Use these charts to gain insights about your data.

Correlation Matrix

Each cell in the Correlation Matrix shows the closeness of the relationship between the pair of features in its row and column. Machine Learning uses Mutual Information to quantify this closeness. Mutual Information allows for the comparison of categorical (text) columns and numeric columns. It also captures non-linear relationships between numeric columns.

When you inspect the Correlation Matrix, there are 3 cases to consider: Features that highly correlate with your target, features that have very low correlation with your target, and features that highly correlate with other features.

  1. Features that highly correlate with your target indicate target leakage. Target leakage means features contain information that won’t be available at prediction time. If you have features that exhibit target leakage, you should remove them from your dataset.

  2. Features that have very low correlation with the target aren't helpful. Unless they interact with another feature in some way, they won't be predictive. Including such features adds noise to the data. This makes it harder to build an accurate and stable model. You should remove any of these features.

  3. Groups of 2 or more features that highly correlate with each other can make it difficult to interpret your models. Consider a dataset about people that contains the features height_in_cm and height_in_inches. These features contain the same information represented in 2 different ways. A model might use one, the other, or a blend of both with no loss in accuracy. In the latter case, their importance might not be obvious when you look at Feature Importance or Partial Dependence plots due to split contributions from each feature. As another example, features like height and weight relate to each other but aren't independent of each other. It’s important to be aware of these types of features when looking at their Partial Dependence plots. You can't consider the effect on the target of either one of them without the other.

2-Variable Plots

Select a cell in the Correlation Matrix to see a plot of 2 features. Machine Learning automatically chooses the plot type based on the types of the 2 columns you choose to compare. Looking at features with high correlations can provide useful information about the data. The plot provides an intuitive sense of relationships between features.

Chord Diagram

Use the Chord Diagram for a different view of the same data you see in the Correlation Matrix. The Chord Diagram arranges your features around the outside of the circle and connects correlated features by shaded arcs. Hover over a feature to highlight all the relationships for that feature. Use the Correlation Threshold slider to filter out less important relationships. This helps you focus on the most important correlations.

Considerations

If features correlate, that doesn't always mean they have a causal relationship. For example, a third feature might explain their relationship, or they might relate by chance. Be mindful that correlations don't reveal which feature causes the other to occur. They only tell us that features have a relationship that you can investigate further.

Remember that a strong correlation between features isn’t the same as a meaningful one. Consider the possibility that the correlation is coincidental.