CHAID, a technique whose original intent was to detect interaction between variables (i.e., "combination" variables), recursively partitions a population into separate and distinct groups, which are defined by a set of independent (predictor) variables, such that the CHAID Objective is met: the variance of the dependent (target) variable is minimized within the groups, and maximized across the groups.
CHAID stands for CHi-squared Automatic Interaction Detection:
- CHi-squared
- Automatic
- Interaction
- Detection (not detector)
Its advantage is that its output is highly visual and contains no equations. It commonly takes the form of an organization chart, more commonly referred to as a tree display. As an illustration, consider the Response CHAID Tree, below. The tree can "loosely" be interpreted as: The overall Response of 10% (from a population of size 1000) is explained and predicted by primarily Martial Status, and secondarily Gender and Pet Ownership. Note: CHAID does not work well with small sample sizes as respondent groups can quickly become too small for reliable analysis.
In addition to CHAID detecting interaction between independent variables – for explanatory studies that are concerned with the impact that many variables have on each other (e.g., in the Response Tree above, Martial Status & Gender, and Martial Status & Pet Ownership are two interaction variables as they differentially affect response rates across the bottom respondent groups) – it is often used as a prediction method. Using CHAID, the data analyst can uncover relationships between a dependent variable, say, response to a mail solicitation, and identify important interaction variables and other predictor variables of response. Accordingly, the result is a CHAID regression tree that allows the data analyst to predict which individuals are most likely to respond in the future to a similar mail solicitation.
The above describes CHAIDs original intent, and frequent usage. However, today it is mostly used as an exploratory method, and is an alternative to the multiple regression model, especially when the dataset is not well-suited to the formality of the parametric (i.e., rigid) statistical multiple regression analysis. Supplementary to what is already presented, I have worked out nine common problems, which are beyond CHAIDs original intent, for which CHAID is quite utile. Additionally, I compare CHAID, as the statistical forerunner of the machine learning tree, to the newbie genetic regression-tree. And, lastly, I provide an historitical reference of CHAID: a perspective of CHAIDs origin.
- CHAID for Uncovering Relationships: A Data Mining Tool
- A Regression-tree Approach for Optimizing Price and Package Offerings
- Quasi-MAID: An Alternative Method for Multivariate Regression
- CHAID: A Method for Filling In Missing Values
- CHAID for Specifying a Model with Interaction Variables
- Market Segment Classification Modeling with Logistic Regression and CHAID
- CHAID For Interpreting A Logistic Regression Model
- Market Segmentation: Defining Target Markets with CHAID
- Data Smoothing: An Application of CHAID
- CHAID Regression-tree vs. Genetic Regression-tree
- Interpretation of Coefficient-free Models
Reference: A Pithy History of CHAID and its Offspring
|