Data defines the model by dint of genetic programming, producing the best decile table.

Different Data, Identical Regression Models: Which Model is Better?
Bruce Ratner, Ph.D.

As a data miner, I can say with 100% confidence that I have discovered two patterns when 1) submitting a Regression Model Proposal to a prospective client, and 2) handing out the final presentation deck to the client: The prospective client always looks at the last page of the proposal for the cost of the project. If I get the project, the client always looks for the page with the R-squared for the regression model built.

As a consulting statistician presenting the realized regression model, I have an effortful job explaining (rather, re-teaching) to otherwise bright clients, who inform me of the statistics courses under their ("black") belts, some basics concepts that they never fully understood at the outset. At this point, I know that building the model is not the hard part of the project, but presenting the results is. Invariably, the first obstacle is explaining why their “pet” variables are not in the model. The second hurdle is to absorb their shock when they discover their misunderstanding of R-squared.

Accordingly, I always scrounge for statistical tidbits as aids in helping me explain (by show-and-tell) statistical concepts, which the clients have unwittingly misunderstood for too long. My latest statistical tidbit: I build two OLS regression models, regressing Y1 on X1, and regressing Y2 on X2, using the data in Table 1, below. Both OLS regression models are identical! Which Model is Better? Which pair of variables is better explained by this one model? Which model produces the more accurate predictions?

Y = 134.94743 + 0.10005*X

If you like, please email me your answer, or email me for the answer.


To download The Data in "txt" format, click here.

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at
Sign-up for a free GenIQ webcast: Click here.