Data defines the model by dint of genetic programming, producing the best decile table.


Handling Qualitative Attributes: Upgrading Discrete Heritable Information
Bruce Ratner, Ph.D.

The classic approach to include a qualitative attribute, namely, nominal-level categorical variables, into the modeling process involves dummy variable coding. A categorical variable with k classes of qualitative (non-numerical) information is replaced by a set of k-1 quantitative dummy variables. The dummy variables are defined by the present or absent of the class values. The class left out is called the reference class, to which the other classes are compared when interpreting the effects of dummy variables on response. The classic approach instructs that the complete set of k-1 dummy variables is included in the model regardless of the number of dummy variables that are declared non-significant. This approach is problematic when the number of classes is large, which is typically the case in big data applications. By chance alone, as the number of classes increases, the probability of one or more dummy variables being declared non-significant increases. To put all the dummy variables in the model effectively adds "noise" or unreliability to the model, as non-significant variables are known to be "noisy." Intuitively, a large set of inseparable dummy variables poses a difficulty in model building, in that they quickly "fill up" the model not allowing room for other variables. The purpose of this article is to present a new method that upgrades the complete set of nominal-level dummy variables into a smaller set of smooth (reliable) interval-level quantitative variables, which retains a large percentage of the original information. Thus, the new variables not only offer greater reliability in the database model, but make room for other variables.


Post Script
Recently, I got a question, via my stat-chat button on my website, about modeling an 8-level income dependent variable. The inquirer wanted my comments about using the multinomial logistic regression model (MLRM). My general remark was that in theory, one could use MLRM, but in practice, the data will never yield a useable model; a null and void model will in all likelihood result. My experience is that one can build a MLRM, at best, a 4-level dependent variable with proportions across the four levels that are not greatly varied.

My advice was to upgrade the 8-level income dependent variable into an interval scale variable using a technique as described in the article abstract, above. Then, build the Income Model using OLS regression. I think this method should be in every statistical modeler's toolkit. Any questions, freely call, or email me.

Thanks for your attention this far.

Statistically yours,
BRsig

For more information about this article, call Bruce Ratner at 516.791.3544, or e-mail at me br@dmstat1.com.

Sign-up for a free GenIQ webcast: Click here.