Multicollinearity
Multicollinearity is
defined as the linear relationship between two or more independent variables
while performing regression analysis between a dependent variables and set of
independent variables.
Multicollinearity
presents a severe problem during regression modelling. Inclusion of independent
variables having linear relationship with each other leads to parameter
estimation with higher standard error. This in turn leads to inaccurate
parameter estimation. Furthermore, due to inaccurate parameters regression
model becomes unstable.
The unstable models
performs badly on the validation and test samples. When model is unstable, its
performance deteriorate very fast compared to stable model over the period,
though model is scored on the data of sample of same population. In such a
situation an analyst must investigate for the multicollinearity, before
finalizing the model.
Next question is how to
investigate and which variable should be kept if some variables are found to be
linearly related with each other. This is done by performing linear regression
analysis between each pair of independent variables. This gives us a metric
called variance inflating factor (VIF). It is equal to 1/sqrt(1-Rsquare) where
R is correlation coefficient.
This can be calculated in
MS Excel, R, MATLAB, SAS as well as using many other statistical packages
available in the market. I discuss here how to use proc reg to measure the
multicollinearity among the independent variables.
proc reg data =
Dev_samp_cust;
model spend = salary age
education tax_paid travelling_spend wealth family_members / vif tol collin;run;
quit;
The VIF option used with
Collin results in Variance Inflation factor whereas TOL results in the tolerance
(requested by the tol option) is the proportion of variance in a given
predictor that is NOT explained by all of the other predictors, while the VIF
(or Variance Inflation Factor) is simply 1 / tolerance.The VIF represents a
factor by which the variance of the estimated coefficient is multiplied due to
the multicollinearity in the model.Exclusion of correlated variables depends
upon the objective of the model, domain knowledge and VIF.
In banking domain I
used to drop the variable if VIF is more than 1.2 but I have also seen analysts
in other domains such as retail etc dropping the variable only if VIF is more
than 10. At this stage knowledge of data and understanding of the business is
very important as it helps in deciding the VIF cut off as well as selecting one
variable among the linearly related variables.
Comments
Post a Comment