
Data defines the model by dint of genetic programming, producing the best decile table.


Statistics versus Machine Learning: A Significant Difference for Database Response Modeling Bruce Ratner, Ph.D. 

The regnant statistical paradigm for database response modeling is: The data analyst fits the data to the presumedly true logistic regression model (LRM), which has the form (equation) of (log of the odds of) response is the sum of weighted predictor variables. The predictor variables are determined by a mixture of wellestablished variable selection methods and the will of the data analyst to reexpress the original variables and construct new variables (data mining). The weights, better known as the regression coefficients, are determined by the preprogrammed machinecrunching method of calculus. The purpose of this article is to show a significant difference for database response modeling when implementing the antithetical machine learning paradigm: The data suggests the “true” model form, as the machine learning process acquires knowledge of the form without being explicitly programmed
I use the machine learning GenIQ Model© and LRM to build a database response model, which predicts the rankorder likelihood of response, to illustrate the advantages and the singular weakness of the machine learning paradigm. Specifically, the GenIQ Model shows the superiority of the machine learning paradigm over the statistical paradigm, as it not only specifies the true model form (a computer program), but simultaneously performs variable selection and data mining. The difficulty in interpreting the computer program often accounts for the limited use of the machine learning paradigm
Outline of Article
I. Situation When my daughter Amanda was in grade school, she could not understand the decisionmaking process of her principal Dr. Katz. On some rainy days, Dr. Katz would permit the class to go outside for recess to play. On other days when it was sunny, Dr. Katz would said, “no play.” As a statistician’s daughter, Amanda collected some weather information, and asked me to build a model to predict what Dr. Katz will do in the days to come. Amanda created a “Let’s Play” database, in Table 1 (also in Quinlan’s C4.5, page 18!), which included the weather conditions for two weeks:
 Outlook (sunny, rainy, overcast)
 Temperature
 Humidity
 Windy (yes, no), and of course
 Play (yes, no).
I built the easytointerpret LRM, and the notsoeasytointerpret GenIQ Model for the target variable Play (yes). This creates a counterpoint where the data analyst now can choose between a good interpretable model and a potentially better unexplainable model.
II. LRM Output The LRM output (Analysis of Maximum Likelihood Estimates) and arguably the best PlayLRM equation are below.
(Log of odds of) Play (yes) = 11.7403  2.2682*Outlook(sunny)  0.1124*Humidity  2.0470*Windy(yes)
III. PlayLRM Results The results of the PlayLRM are in Table 2. There is not a perfect rankorder prediction of Play for days 6, 1, 12 and 11.
IV. GenIQ Model Output The PlayGenIQ Model tree display, and its form (computer program) are below.
The GenIQ Model (Tree Display)
The GenIQ Model (Code)
If outlook = "overcast" Then x1 = 1; Else x1 = 0; If windy = "no" Then x2 = 1; Else x2 = 0; If outlook = "rainy" Then x3 = 1; Else x3 = 0; x2 = x2 * x3; x1 = x1 + x2; If outlook = "rainy" Then x2 = 1; Else x2 = 0; x3 = humidity; x2 = x2 + x3; x3 = humidity; x4 = temperature; If x3 NE 0 Then x3 = x4 / x3; Else x3 = 1; If x2 NE 0 Then x2 = x3 / x2; Else x2 = 1; x1 = x1 + x2; GenIQvar = x1;
V. GenIQ Variable Selection
GenIQ variable selection provides a rankordering of variable importance for a predictor variable with respect to other predictor variables considered jointly. This is in stark contrast to the wellknown, alwaysused statistical correlation coefficient, which only provides a simple correlation between a predictor variable and the target variable  independent of the other predictor variables under consideration. Variable Importance (w/r/to other variables considered jointly)
 Outlook (overcast)
 Outlook (rainy)
 Windy (no)
 Humidity
 Outlook (sunny)
 Windy (yes)
 Temperature
VI. GenIQ Data Mining GenIQ data mining is directly apparent from the GenIQ tree itself: Each branch is a newly constructed variable, which has power to increase the rankorder predictions.
 Var1 = Temperature / Humidity
 Var2 = Humidity + Outlook (rainy)
 Var3 = Var1 / Var2
 Var4 = Outlook (rainy) * Windy (no)
 Var5 = Var4 + Outlook (overcast)
 GenIQ Model = Var3 + Var5
VII. PlayGenIQ Model Results The results of the PlayGenIQ Model are in Table 3. There is a perfect rankorder prediction of Play.
IIX. Summary
The machine learning paradigm (MLP) “let the data suggest the model” is a practical alternative to the statistical paradigm “fit the data to the LRM equation,” which has its roots when data were only “small.” It was – and still is – reasonable to fit small data to a rigid parametric, assumptionfilled model. However, the current information (big data) in, say, cyberspace require a paradigm shift. MLP is a utile approach for database response modeling when dealing with big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not “fit.” As demonstrated with the “Let’s Play” data, MLP works well within small data settings.

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT1; or email at br@dmstat1.com. 
Signup for a free GenIQ webcast: Click here. 

