R: Outlier Detection with Mahalanobis Distance in R

This post demonstrates outlier detection using Mahalanobis distance with R.



Outlier detection using Mahalanobis distance



One limitation of the Euclidean distance measure is its failure to consider the correlation between highly correlated variables. An alternative approach is to scale the contribution of individual variables to the distance value based on each variable's variability.

This approach is the Mahalanobis distance, which distinguishes itself from Euclidean distance by accounting for correlations between variables. It is defined as follows.

\[\begin{align} MD = \sqrt{(X-\mu)^T \Sigma^{-1}(X-\mu)} \end{align}\]
Assuming that \(X\) is a p-dimensional vector having multivariate normal distribution, \(X\sim N(\mu, \Sigma)\), the squared Mahalanobis distance has a chi-square distribution with p degrees of freedom. Therefore, we can tag as outlier any observation which has a Mahalanobis squared distance \({MD}^2\) lying above a predefined quantile of the \(\chi^2\) distribution with \(p\) degrees of freedom


R code


The following R code detects outliers in two stock indices using Mahalanobis distance in a simple manner. mahalanobis() function returns the squared Mahalanobis distance (\({MD}^2\)).

graphics.off(); rm(list = ls())
 
library(quantmod)
 
#-------------------------------------------
# Data : Two stock indices
#-------------------------------------------
sdate <- as.Date("2005-01-01")
edate <- as.Date("2010-12-31")
getSymbols("^GSPC"from=sdate, to=edate)
getSymbols("^IXIC"from=sdate, to=edate)
price <- as.matrix(cbind(GSPC$GSPC.Adjusted, 
                         IXIC$IXIC.Adjusted))
colnames(price) <- c("SP500""Nasdaq")
nr <- nrow(price)
 
# Calculate daily returns
returns <- log(price[2:nr,]) - log(price[1:(nr-1),])
 
 
#-------------------------------------------
# Outlier detection using MD
#------------------------------------------
 
# Calculate Mahalanobis Distance
mah_dist <- mahalanobis(returns, 
                        colMeans(returns), 
                        cov(returns))
# Set a threshold
threshold <- qchisq(0.99, df = ncol(returns))
 
# Identify outliers
outliers <- which(mah_dist > threshold)
 
# Print outliers
cat("Outliers:", paste(names(outliers), 
                       collapse = ", "), "\n")
# Plot
x11()
plot(index(IXIC)[-1], mah_dist, type = "b"
     pch = 19, cex=0.8, col = 2
     main = "Outlier detection using Mahalanobis Distance"
     xlab = "Date", ylab = "Mahalanobis Distance")
abline(h = threshold, col = "blue", lty = 2, lwd=3)
 
cs


As expected, a group of outliers is evident around the 2008 Global Financial Crisis.





No comments:

Post a Comment