Data defines the model by dint of genetic programming, producing the best decile table.


Genetic Data Mining Method for the Proper Use of the Correlation Coefficient
Bruce Ratner, Ph.D.
Live chat by Boldchat
Live chat by Boldchat

Assessing the relationship between a predictor variable and a target variable is an essential task in the model building process. If the relationship is identified and tractable, then the predictor variable is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today. I use the third pair of x and y values from the well-known Anscombe data.


OUTLINE



I. Ancombe Data

ID     x         y

1      10      7.46
2        8      6.77
3      13     12.74
4        9       7.11
5      11       7.81
6      14       8.84
7        6       6.08
8        4       5.39
9      12        8.15
10      7        6.42
11      5        5.73



II. GenIQ Model (Tree Display)

gtree

The GenIQ Model (Code)

x1 = .6550772; 
          x2 = x; 
     If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1; 
          x2 = x; 
               x3 = x; 
          x2 = x2 + x3; 
          x2 = Cos(x2); 
     x1 = x1 + x2;
GenIQvar(y) = x1;



III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending GenIQ model score GenIQvar(y), which is used to order the table.

Table 2. GenIQ Model Rank-order Prediction

ID    x         y      GenIQvar(y)

 3     13    12.74      20.4919
 6     14      8.84      20.4089
 9     12      8.15      18.7426
 5     11      7.81      15.7920
 1     10      7.46      15.6735
 4       9      7.11      14.3992
 2       8      6.77      11.2546
10      7      6.42      10.8225
 7       6      6.08      10.0031
11      5      5.73        6.7936
8        4      5.39        5.9607



Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below:
Plot y*x and Plot GenIQvar(y)*x.

gplots


IV. Summary
Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool all-in-one? What do you think?
Oh, two things - the correlation coefficients between y and x, and GenIQvar(y) and x are: 0.81629 and 0.9895, respectively.


For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.