The classical and quotidian technique for predicting a continuous target variable is the statistical ordinary least squares (OLS) regression model. Specifically, it is used to estimate the mean of the target variable based on the values of the predictor variables that define the model. The OLS regression model is assumption-full; the best known of the eight assumptions is the target variable is normally distributed. When the latter is not met, the mean predictions are questionable. Profit distributions as found in database marketing are typically not normal (bell-shaped); thus OLS prediction of Profit is dubitable.
The remedied approach for predicting a skewed target variable is median regression, which was introduced by Koenker and Bassett (1978) under the more general setting of quantile regression model (QRM). The pth quantile is that value of the target variable distribution below which the proportion of the population is p. For example, the median is the 0.5th quantile, the value of a distribution where 50% of the observations are below the median. Other common quantiles are:
- quartile is the 0.25th quantile – the value of a distribution where 25% of the observations are below the quartile
- quintile is the 0.20th quantile – the value of a distribution where 20% of the observations are below the quintile
- decile is the 0.10th quantile – the value of a distribution where 10% of the observations are below the decile
- percentile is the 0.01th quantile – the value of a distribution where 1% of the observations are below the percentile
The idea of the QRM is not new, as it dates back to 1760 when Josip Boscovich sought computational advice for his hopeful method of median regression. Needless to say, QRM is widely available in major statistical packages due to today’s computational power of the personal computer. The computational estimation of the QRM runs parallel to the OLS optimization of the measure: minimize the sum of squared errors. The median regression optimization minimizes the sum of absolute squared errors. For a specific quantile other than the median, the optimization measure is: minimize an adjusted sum of absolute squared errors, where the adjustment factor accommodates the specific quantile (location of the distribution). The value of QRM in database marketing is that it provides an encompassing methodology to go from a complete picture of a distribution to a zoomed-in segment of special interest. For example, the 0.99th quantile regression model provides reliable Profit estimates for targeting best customers, and the key drivers (predictor variables) necessary to develop marketing strategies aimed at best customers. Heretofore, OLS regression only provided grand summary (mean) Profit predictions with unknown unreliability, and misguided key drivers.
The purpose of this article is to compare and contrast the parametric, assumption-full quantile regression model with a machine learning alternative quantile regression model – the GenQR© Model
. GenQR is an assumption-free, nonparametric model – model-free
where the data define the model – automatically data mines for new variables, performs variable selection, and then specifies the model based on the machine learning paradigm of genetic programming
. The gen
egression (GenRQ) offers a clear advantage over the statistical quantile regression, whose performance is dependent on theoretical assumptions, a pre-specified parametric model, and data restrictions. Pointedly, the GenQR Model automatically determines the best set of predictor variables (from the original variables, and newly constructed – genetically data mined – variables) based on a virtually unbiased assessment of all variables under consideration, an achievement not possible with statistical methods.