Data defines the model by dint of genetic programming, producing the best decile table.

When Data Are Too Large to Handle in the Memory of Your Computer
Bruce Ratner, Ph.D.

There is a growing area of inquiry in data preparation, in particular procedures for data that are too large to be handled in the memory of your computer. The rife approach of handling big data is subsampling the original data in some manner so as to not lose accuracy. The purpose of this article is to acquaint the data analyst with the latest such subsampling procedures, which use partitioning and bagging. The procedures have two advantages: 1) the subsample size can simply be set at whatever amount of the original data can be comfortably handled on your computer, and 2) the procedures have potentially better accuracy (via a method of "committee averaging" of models across the subsamples) than the single model results based on all the data. The procedures are:

  1. Disjoint Partition: Divide the original data into N disjoint partitions of size 1/N-th of the original data. Each partition has randomly selected elements. Thus, replication can neither occur within nor across the bags.
  2. Small Bags: Create N “bags,” namely, resamples with replacement of the original data into “new” datasets, called bags, of size 1/N-th of the original data. Each bag is created independently by random sampling with replacement. Thus, replication can occur within and across the bags.
  3. No Replication-Small Bags: Similar to small bags in #2, but without replacement for each individual bag. Thus, no replication within bags, but replication may be across bags.
  4. Disjoint Bags: Begins with the disjoint partitions in #1, then independently for each partition, a number of its elements are randomly selected with replacement. The number of added elements is equal to the average number of repeated elements in a small bag in #2.
These latest procedures of - handling data too large for your computer - will be a welcomed addition the data analysts’ data preparation toolkit.

For more information about this article, call me at 516.791.3544, or e-mail,
My publisher owns the copyright of the article, about which this abstract addresses. The article will appear in my forthcoming book.
My publisher has granted me permission to discuss orally the article's content, but by no means provide an outline, a draft or proof-ready of the article.

Sign-up for a free GenIQ webcast: Click here.