The classic approach to include a categorical variable into the modeling process involves dummy variable coding
. A categorical variable with k classes of qualitative (non-numerical) information is replaced by a set of k-1 quantitative dummy variables. The dummy variables are defined by the present (have a value of 1) or absent of the class values (have a value of 0). The class left out is called the reference class, to which the other classes are compared when interpreting the effects of dummy variables on response. The classic approach instructs that the complete set of k-1 dummy variables is included in the model regardless of the number of dummy variables that are declared non-significant. This approach is problematic when the number of classes is large, which is typically the case in big data applications. By chance alone, as the number of classes increases, the probability of one or more dummy variables being declared non-significant increases. To put all the dummy variables in the model effectively adds “noise” or unreliability to the model, as non-significant variables are known to be “noisy.” Intuitively, a large set of inseparable dummy variables poses a difficulty in model building, in that they quickly “fill up” the model not allowing room for other variables.
The purpose of this article is to present the GenIQ Model
©’s approach of treating a categorical variable for model inclusion, as it explicitly addresses the problems associated with a large set of dummy variables. It reduces the number of classes by merging (smoothing or averaging) the classes with comparable values of the target variable under study, which for the application of response modeling is the response rate. The smoothed
categorical variable, now with fewer classes, is less likely to add noise in the model and allows more room for other variables to get into the model.