Data defines the model by dint of genetic programming, producing the best decile table.


Overfitting: Old Problem, New Solution
Bruce Ratner, Ph.D.

Overfitting, a problem akin to model inaccuracy, is as old as model building itself, as it is part and parcel of the modeling process. An overfitted model is one that approaches reproducing the training data on which the model is built – by “capitalizing on the idiosyncrasies” of the training data. The model brings about the complexity of the idiosyncrasies by including in the model extra unnecessary variables, interactions, and variable construction(s), all which are not part of the sought-after predominant pattern in the data. As a result, a major characteristic of an overfitted model is that it has too many variables. Ergo,  the overfitted model can be thought of a "too perfect“ picture of the predominant pattern, essentially memorizing the training data instead of capturing the desired pattern. As such, individuals of a holdout data, drawn from the population of the training data, strangers who are unacquainted with the training data, cannot expect to “fit into” the model’s perfect picture of the predominant pattern to produce good predictions. When a model’s accuracy based on the holdout data is “out of the neighborhood” of the model’s accuracy based on the training data, the problem is one of overfitting, and the model is said to be an overfitted model.

It is “fitting” to digress here for guidelines of model building. A well-fitted model is one that faithfully represents the sought-after predominant pattern within the data, ignoring the idiosyncrasies in the training data. A well-fitted model is typically defined by a handful of variables because it does not include “idiosyncrasy’ variables. Individuals of a holdout data, the everyman and everywoman unacquainted with the training data, can expect to fit into the model’s faithfully rendered picture of the predominant pattern to produce good predictions. The accuracy of the well-fitted model on the holdout data will be “within the neighborhood” of the model’s accuracy (based on the training data).

The problem of overfitting is well-studied as the literature is replete with how to prevent and fix overfitted models. The purpose of this article is to provide a new preventive solution using the GenIQ Model© for 1) identifying the complexity (the variables and their structure) of the idiosyncrasies, and 2) deleting the actual the records that define the complexity from the training data (and optionally the holdout data). Thus, the model now built on the “cleaned” training data reliably represents the predominant pattern, yielding a well-fitted model, which has “holdout-data” model accuracy within the neighborhood of the model’s accuracy.

Do not be diffident, make your request by email for a power point presentation of GenIQ as a new, unique solution to the overfitting model problem. When you do, you will be different, having the know-how to eliminate "noise" from your data for efficient, effective modeling.


Hope to hear from you!

BRoverfit

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.
Sign-up for a free GenIQ webcast: Click here.