Skip to main content

Oversample Column Tool

It is often the case for data used to develop a binary classification predictive model that the target variable has a much higher proportion of negative (no) responses than positive (yes) responses. For example, in the case of untargeted direct mail campaigns, it is not uncommon to find that 2% of potential prospects respond favorably to an appeal, while 98% do not. In this case, predictive models have a difficult time distinguishing the signal from the noise since the cost of classifying all potential prospects in the "no" category will nearly always be correct.

To avoid this problem, it is not uncommon to create a new sample for analysis that has an elevated percentage of positive responses (often a 50-50 split of positive and negative responses is used). This is typically accomplished by including all of the positive responses and taking a random sample of the negative responses, with the size of the sample of negative responses determined by the percentage of favorable responses that are desired in the new database, which is the approach used in this tool.

Configure the Tool

  1. Column to Oversample: The column that contains the value to be oversampled, typically the target variable column in a binary classification predictive model.

  2. Column Value to Oversample: The level that is to be oversampled, typically the positive ("yes") response in a binary classification predictive model.

  3. Desired Percentage of Rows with Value: An integer value between 1 and 100. This value should not be less than the percentage that this level of the column of interest represents in the original data. For example, if 30% of the original data has the desired value for the column of interest, the value for this parameter should not be set below 30%.