|Data defines the model by dint of genetic programming, producing the best decile table.|
|Genetic Data Mining: The Correlation Coefficient
Bruce Ratner, Ph.D.
Assessing the relationship between a predictor variable and a target variable is an essential task in statistical linear regression model building. If the relationship is straight-line (linear), then no extra work of straightening the relationship is needed: Simply test the predictor variable's statistical importance to stay in the model. If the relationship is not linear, then one of the two variables is re-expressed (altho, sometimes both variables are re-expressed) to affect the observed relationship such that the "re-expressed" relationship is as linear as the data permit. Then, the re-expressed variable is tested for inclusion into the model. Most methods of assessing relationships among variables are based on the well-known correlation coefficient, which is often misused because its linearity assumption (i.e., the true underlying relationship is straight-line) is not tested by the scatterplot. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model – that is one of the better "data-straightener" methods available. I use a small dataset to make the "GenIQ data-straightener" method tractable and attractive for the everyday model builder to make it part of the modeler's toolkit. I present a succinct discussion of the genetic-based method, along with a basic statement of the GenIQ Model, and what's "good-to-know" about the GenIQ Model output to ease the understanding of genetic data mining for the correlation coefficient.0. Good-to-Know about the GenIQ Model Output
The GenIQ Model
© is a machine learning alternative model to the statistical ordinary least squares and logistic regression models. GenIQ lets the data define the model – automatically data mines
for new variables, performs variable selection, and then specifies the model equation – so as to "optimize the decile table," to fill the upper deciles with as much profit/many responses as possible. GenIQ requires no programming, produces models that outdo statistical models, and is a different model: unsuspected equation, ungainly interpretation, and easy implementation.
An optimal decile table in a "correlation" application, like the example to follow, is tantamount to obtain the best possible ranking of the target variable scores based on the GenIQ Model scores. Note, when assessing a correlation there is no assignment of target and predictor variables. However, because this article is about assessing the relationship between variables in the context of linear regression modeling, I arbitrarily select y
as the target variable. Ergo, x is the predictor variable.
The GenIQ Model output consists of two parts
: The GenIQ Model tree (aka a parse tree), and the GenIQ Model "equation," which is actually computer code. Unfortunately, the model equation/code is ungraspable in virtually all situations. However, the tree serves as a tool for modelers, who think visually, to assist in understanding the code, which in turn, to help in interpreting the model. Using the third (x3, y3) values – which I must rename (x, y) for sake of GenIQ  – from the well-known Anscombe Data, in Table I, above, I build a GenIQ Model with y against x. I discuss the GenIQ Model output, and GenIQ Model results, in sections II and III, respectively, below. III. GenIQ Model Results
The results of the GenIQ data-straightener lie within Table 2, below. There is a perfect prediction of the ranking of y, based on the GenIQ Model score GenIQvar, or interchangeably GenIQvar(y). The construction of Table 2 is as follows:
1) The GenIQvar(y) values are descendingly ranked from the first position value of 20.4910 to the last position value of 5.9607.
2) The y values are ranked based on their corresponding ranked GenIQvar(y) values.
3) The x values, being paired with immutable y values, take the positions of their corresponding y values.
In sum, the (x, y) values are ranked based on the y values that correspond to the ranked GenIQvar(y) values.
The best way of illustrating the GenIQ Model's data mining feature as a data-straightener method is by examining the relationships displayed in the two scatterplots, Plot y*x and Plot GenIQvar*x, below, using the data in Table 2. The red line is the presumed true underlying straight-line relationship between the two variables at hand. It is clear from the first plot that the underlying relationship is impossibly linear because of the "far-out" point ID #1, greatly above the red line. However, on the "other side of the red line" in the second plot, it is clearer that the underlying relationship is "right-on" linear, as the minikin scatter above and below the red line is random.
GenIQs data mining feature as a data-straightener is richly illustrated. Plot GenIQvar*x clearly indicates GenIQ virtually straightens the data to a straight line by bringing individual ID #3 "in-line." (It is worthy to note that GenIQs positioning outlier ID #3 in-line suggests that the GenIQ Model itself as a model building technique is resistant to outliers. In contrast, statistical regression modeling is well known for being quite sensitive to outliers, resulting most of the time discarding the outliers, and consequentially introducing potential bias in the regression model.) The re-expression of y is the computer code, which has as its visual complement the Picasso-like GenIQ tree displayed above.
The effectiveness of the GenIQ data-straightener method as seen in this "small" exercise is typical, for big data (i.e., many variables and many observations) as well as small data. (Also worthy to note, for big data, the GenIQ data-straightener simultaneously, in a multivariate sense, straightens all the predictor variables with the target variable.) Oh! Two things - the correlation coefficient value for x and y is 0.81629, a dubious measure the linear relationship between the two variables; the correlation coefficient value for x, and GenIQvar(y) is 0.9895, an undoubtedly reliable measure of the linear relationship between the two variables.
1. GenIQ uses x1, x2, ..., xn as intermediate variables in generating its computer code.
2. Anscombe, F. J., Graphs in statistical analysis. American Statistician, 1973.
|For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at firstname.lastname@example.org.|
Sign-up for a free GenIQ webcast: Click here.