
Data defines the model by dint of genetic programming, producing the best decile table.


Handling Qualitative Attributes: Upgrading Discrete Heritable Information Bruce Ratner, Ph.D. 

The classic approach to include a qualitative attribute, namely, nominallevel categorical variables, into the modeling process involves dummy variable coding. A categorical variable with k classes of qualitative (nonnumerical) information is replaced by a set of k1 quantitative dummy variables. The dummy variables are defined by the present or absent of the class values. The class left out is called the reference class, to which the other classes are compared when interpreting the effects of dummy variables on response. The classic approach instructs that the complete set of k1 dummy variables is included in the model regardless of the number of dummy variables that are declared nonsignificant. This approach is problematic when the number of classes is large, which is typically the case in big data applications. By chance alone, as the number of classes increases, the probability of one or more dummy variables being declared nonsignificant increases. To put all the dummy variables in the model effectively adds "noise" or unreliability to the model, as nonsignificant variables are known to be "noisy." Intuitively, a large set of inseparable dummy variables poses a difficulty in model building, in that they quickly "fill up" the model not allowing room for other variables. The purpose of this article is to present a new method that upgrades the complete set of nominallevel dummy variables into a smaller set of smooth (reliable) intervallevel quantitative variables, which retains a large percentage of the original information. Thus, the new variables not only offer greater reliability in the database model, but make room for other variables.
Post Script Recently, I got a question, via my statchat button on my website, about modeling an 8level income dependent variable. The inquirer wanted my comments about using the multinomial logistic regression model (MLRM). My general remark was that in theory, one could use MLRM, but in practice, the data will never yield a useable model; a null and void model will in all likelihood result. My experience is that one can build a MLRM, at best, a 4level dependent variable with proportions across the four levels that are not greatly varied.
My advice was to upgrade the 8level income dependent variable into an interval scale variable using a technique as described in the article abstract, above. Then, build the Income Model using OLS regression. I think this method should be in every statistical modeler's toolkit. Any questions, freely call, or email me.
Thanks for your attention this far.
Statistically yours,

For more information about this article, call Bruce Ratner at 516.791.3544, or email at me br@dmstat1.com.

Signup for a free GenIQ webcast: Click here. 

