
Data defines the model by dint of genetic programming, producing the best decile table.


Genetic Data Mining: The Correlation Coefficient Bruce Ratner, Ph.D. 

Assessing the relationship between a predictor variable and a target variable is an essential task in statistical linear regression model building. If the relationship is straightline (linear), then no extra work of straightening the relationship is needed: Simply test the predictor variable's statistical importance to stay in the model. If the relationship is not linear, then one of the two variables is reexpressed (altho, sometimes both variables are reexpressed) to affect the observed relationship such that the "reexpressed" relationship is as linear as the data permit. Then, the reexpressed variable is tested for inclusion into the model. Most methods of assessing relationships among variables are based on the wellknown correlation coefficient, which is often misused because its linearity assumption (i.e., the true underlying relationship is straightline) is not tested by the scatterplot. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model – that is one of the better "datastraightener" methods available. I use a small dataset to make the "GenIQ datastraightener" method tractable and attractive for the everyday model builder to make it part of the modeler's toolkit. I present a succinct discussion of the geneticbased method, along with a basic statement of the GenIQ Model, and what's "goodtoknow" about the GenIQ Model output to ease the understanding of genetic data mining for the correlation coefficient.
0. GoodtoKnow about the GenIQ Model Output
The GenIQ Model© is a machine learning alternative model to the statistical ordinary least squares and logistic regression models. GenIQ lets the data define the model – automatically data mines for new variables, performs variable selection, and then specifies the model equation – so as to "optimize the decile table," to fill the upper deciles with as much profit/many responses as possible. GenIQ requires no programming, produces models that outdo statistical models, and is a different model: unsuspected equation, ungainly interpretation, and easy implementation. An optimal decile table in a "correlation" application, like the example to follow, is tantamount to obtain the best possible ranking of the target variable scores based on the GenIQ Model scores. Note, when assessing a correlation there is no assignment of target and predictor variables. However, because this article is about assessing the relationship between variables in the context of linear regression modeling, I arbitrarily select y as the target variable. Ergo, x is the predictor variable. The GenIQ Model output consists of two parts: The GenIQ Model tree (aka a parse tree), and the GenIQ Model "equation," which is actually computer code. Unfortunately, the model equation/code is ungraspable in virtually all situations. However, the tree serves as a tool for modelers, who think visually, to assist in understanding the code, which in turn, to help in interpreting the model. Using the third (x3, y3) values – which I must rename (x, y) for sake of GenIQ [1] – from the wellknown Anscombe Data, in Table I, above, I build a GenIQ Model with y against x. I discuss the GenIQ Model output, and GenIQ Model results, in sections II and III, respectively, below. [2]III. GenIQ Model Results The results of the GenIQ datastraightener lie within Table 2, below. There is a perfect prediction of the ranking of y, based on the GenIQ Model score GenIQvar, or interchangeably GenIQvar(y). The construction of Table 2 is as follows:
1) The GenIQvar(y) values are descendingly ranked from the first position value of 20.4910 to the last position value of 5.9607. 2) The y values are ranked based on their corresponding ranked GenIQvar(y) values. 3) The x values, being paired with immutable y values, take the positions of their corresponding y values.
In sum, the (x, y) values are ranked based on the y values that correspond to the ranked GenIQvar(y) values.
The best way of illustrating the GenIQ Model's data mining feature as a datastraightener method is by examining the relationships displayed in the two scatterplots, Plot y*x and Plot GenIQvar*x, below, using the data in Table 2. The red line is the presumed true underlying straightline relationship between the two variables at hand. It is clear from the first plot that the underlying relationship is impossibly linear because of the "farout" point ID #1, greatly above the red line. However, on the "other side of the red line" in the second plot, it is clearer that the underlying relationship is "righton" linear, as the minikin scatter above and below the red line is random.
IV. Summary
GenIQs data mining feature as a datastraightener is richly illustrated. Plot GenIQvar*x clearly indicates GenIQ virtually straightens the data to a straight line by bringing individual ID #3 "inline." (It is worthy to note that GenIQs positioning outlier ID #3 inline suggests that the GenIQ Model itself as a model building technique is resistant to outliers. In contrast, statistical regression modeling is well known for being quite sensitive to outliers, resulting most of the time discarding the outliers, and consequentially introducing potential bias in the regression model.) The reexpression of y is the computer code, which has as its visual complement the Picassolike GenIQ tree displayed above.
The effectiveness of the GenIQ datastraightener method as seen in this "small" exercise is typical, for big data (i.e., many variables and many observations) as well as small data. (Also worthy to note, for big data, the GenIQ datastraightener simultaneously, in a multivariate sense, straightens all the predictor variables with the target variable.) Oh! Two things  the correlation coefficient value for x and y is 0.81629, a dubious measure the linear relationship between the two variables; the correlation coefficient value for x, and GenIQvar(y) is 0.9895, an undoubtedly reliable measure of the linear relationship between the two variables.
References 1. GenIQ uses x1, x2, ..., xn as intermediate variables in generating its computer code. 2. Anscombe, F. J., Graphs in statistical analysis. American Statistician, 1973.

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT1; or email at br@dmstat1.com. 
Signup for a free GenIQ webcast: Click here. 

