Data defines the model by dint of genetic programming, producing the best decile table.


Real World Data are Dirty:
Data Cleaning and the "Noise" Problem

Bruce Ratner, Ph.D.

Data cleaning (aka data cleansing or scrubbing) is step 0 (the first task) of any data analysis, and statistical modeling to ensure the quality and soundness of the data. After a good data scrub for detecting and removing errors and inconsistencies in the data, the resultant analysis/model can be stamped: “Results with Confidence.” Otherwise, if the data analysis/model is performed without concern for the caliber of the data, then the stamp should read: “Results are Wanting.” Although the list of steps for cleaning “dirty” data is as varied as the analyst doing the “dirty” work, there are the Ten Basics.

After the ten basic and analyst-specific checks are done, data cleaning is not completed until the noise in the data is eliminated. Noise is the idiosyncrasies of the data: The particulars, the “nooks and crannies” that are not part of the sought-after essence (e.g., predominant pattern) of the data with regard to the objective of the analysis/model. Ergo, the data particulars are lonely, not-really-belonging-to pieces of information that happen to be both in the population from which the data was drawn and in the data itself (what an example of a double-chance occurrence!) Paradoxically, as the analyst includes more and more of the prickly particulars in the analysis/model, the analysis/model becomes better and better, yet the analysis/model validation becomes worse and worse. Noise must be eliminated from the data.

The purpose of this article is to provide the outline of the Ten Basics of data cleaning, and the procedure for eliminated noise from data. The GenIQ Model© is used to  1) Identify the idiosyncrasies, and 2) Deleting the actual records that define the idiosyncrasies of the data. Now, the analysis/model can be built with “cleaned” data that reliably represents the sought-after essence of the data, yielding a well conducted analysis and a well-fitted model.

Do not be diffident, make your request by email for a power point presentation of GenIQ as a new, unique procedure for eliminating noise from data. When you do, you will be different, having the know-how to eliminate noise from your data for efficient, effective data cleaning.


Hope to hear from you!

BRoverfit



For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.
Sign-up for a free GenIQ webcast: Click here.