
Outlier detection using Mahalanobis distance
One limitation of the Euclidean distance measure is its failure to consider the correlation between highly correlated variables. An alternative approach is to scale the contribution of individual variables to the distance value based on each variable's variability.
This approach is the Mahalanobis distance, which distinguishes itself from Euclidean distance by accounting for correlations between variables. It is defined as follows.
\[\begin{align} MD = \sqrt{(X-\mu)^T \Sigma^{-1}(X-\mu)} \end{align}\]
Assuming that \(X\) is a p-dimensional vector having multivariate normal distribution, \(X\sim N(\mu, \Sigma)\), the squared Mahalanobis distance has a chi-square distribution with p degrees of freedom. Therefore, we can tag as outlier any observation which has a Mahalanobis squared distance \({MD}^2\) lying above a predefined quantile of the \(\chi^2\) distribution with \(p\) degrees of freedom
R code
The following R code detects outliers in two stock indices using Mahalanobis distance in a simple manner. mahalanobis() function returns the squared Mahalanobis distance (\({MD}^2\)).
| graphics.off(); rm(list = ls()) library(quantmod) #------------------------------------------- # Data : Two stock indices #------------------------------------------- sdate <- as.Date("2005-01-01") edate <- as.Date("2010-12-31") getSymbols("^GSPC", from=sdate, to=edate) getSymbols("^IXIC", from=sdate, to=edate) price <- as.matrix(cbind(GSPC$GSPC.Adjusted,                           IXIC$IXIC.Adjusted)) colnames(price) <- c("SP500", "Nasdaq") nr <- nrow(price) # Calculate daily returns returns <- log(price[2:nr,]) - log(price[1:(nr-1),]) #------------------------------------------- # Outlier detection using MD #------------------------------------------ # Calculate Mahalanobis Distance mah_dist <- mahalanobis(returns,                          colMeans(returns),                          cov(returns)) # Set a threshold threshold <- qchisq(0.99, df = ncol(returns)) # Identify outliers outliers <- which(mah_dist > threshold) # Print outliers cat("Outliers:", paste(names(outliers),                         collapse = ", "), "\n") # Plot x11() plot(index(IXIC)[-1], mah_dist, type = "b",       pch = 19, cex=0.8, col = 2,       main = "Outlier detection using Mahalanobis Distance",       xlab = "Date", ylab = "Mahalanobis Distance") abline(h = threshold, col = "blue", lty = 2, lwd=3) | cs | 
As expected, a group of outliers is evident around the 2008 Global Financial Crisis.

 
No comments:
Post a Comment