|
Assessing the relationship between a predictor variable and a target variable is an essential task in the model building process. If the relationship is identified and tractable, then the predictor variable is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today. I use the third pair of x and y values from the well-known Anscombe data.
OUTLINE
I. Ancombe Data
ID x y
1 10 7.46 2 8 6.77 3 13 12.74 4 9 7.11 5 11 7.81 6 14 8.84 7 6 6.08 8 4 5.39 9 12 8.15 10 7 6.42 11 5 5.73
II. GenIQ Model (Tree Display)
The GenIQ Model (Code)
x1 = .6550772; x2 = x; If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1; x2 = x; x3 = x; x2 = x2 + x3; x2 = Cos(x2); x1 = x1 + x2; GenIQvar(y) = x1;
III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending GenIQ model score GenIQvar(y), which is used to order the table. Table 2. GenIQ Model Rank-order Prediction
ID x y GenIQvar(y)
3 13 12.74 20.4919 6 14 8.84 20.4089 9 12 8.15 18.7426 5 11 7.81 15.7920 1 10 7.46 15.6735 4 9 7.11 14.3992 2 8 6.77 11.2546 10 7 6.42 10.8225 7 6 6.08 10.0031 11 5 5.73 6.7936 8 4 5.39 5.9607
Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below: Plot y*x and Plot GenIQ*x.
IV. Summary Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool all-in-one? What do you think? Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively.
|