A blue whale
Extract data from an image file by using digitize R package
We dealt with how to scan numeric data from a figure image by using Excel in the previous post,
This work can be done more precisely by using digitize R package.
Data Plot
We use the same image file (brl_yield_spread.png) of the previous post, which is a time series of yield spreads ranging from 1995/01 ~ 2007/03 and the number of observations is 147.
R code
Given the above plot image we need to run following R code with some actions. We click 4 points for setting x and y axes and also click all available points on along the target line graph on the plot panel of R studio while calling digitize() function.
It is worth noting that you should run the following R code two parts separately. The first part is line 1 to 17 (calling digitize() function) and the second part line 17 to 41 (remaining work).
Do not run this code at once since after running the first part, we need to click 4 axis points and all relevant points. When we complete this job, the remaining part can be ran given the results of digitize() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | #========================================================# # Quantitative Financial Econometrics & Derivatives # ML/DL using R, Python, Tensorflow by Sang-Heon Lee # # https://shleeai.blogspot.com #--------------------------------------------------------# # Extract data from an image file #========================================================# graphics.off(); rm(list = ls()) library(digitize) setwd("your folder") # extract data from the given image file df <- digitize("brl_yield_spread.png") df # draw extracted data x11(); plot(df[,1], df[,2], type="l", lty=1, lwd = 5, main = "extracted data from image file", xlim = c(0,148), ylim = c(-20,30), col=4) # round x axis to preserve data points during interpolation df.x_round <- df; df.x_round$x <- round(df.x_round$x,0) df.x_round # first data is set to x = 1 df.x_round$x<- df.x_round$x - (df.x_round$x[1]-1) df.x_round # interpolation df.inter <- approx(df.x_round[,1], df.x_round[,2], xout <- 1:147, method = "linear", rule = 2) df.inter # draw interpolation of the extracted data x11(); plot(df.inter$x, df.inter$y, type="l", lty=1, lwd=5, main = "interpolated data from extracted data", xlim = c(0,148), ylim = c(-20,30), col=2) | cs |
Cliking 4 axis points
Given the above plot image we need to run following R code wth some actions. We select 4 points for setting x and y axes and also select all available points on along the target line graph on the plot panel of R studio while calling digitize() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | > df <- digitize("brl_yield_spread.png") ...careful how you calibrate. Click IN ORDER: x1, x2, y1, y2 Step 1 ----> Click on x1 | | | | |________x1__________________ Step 2 ----> Click on x2 | | | | |_____________________x2_____ Step 3 ----> Click on y1 | | | y1 |____________________________ Step 4 ----> Click on y2 | y2 | | |____________________________ What is the return of x1 ? 0 What is the return of x2 ? 148 What is the return of y1 ? -20 What is the return of y2 ? 30 | cs |
According the above instruction, we click the minimum and maximum values of x and y axis (x1, x2, y1, and y2) in order like the following figure. In this case, x1 and y1 have the same origin (0,0). But it is not a must and you can change the suitable ranges of two axis.
Cliking all relevant data points
Next, the following instruction comes out.
1 2 3 4 5 6 7 8 9 10 11 12 13 | ..............NOW ............. Click all the data. (Do not hit ESC, close the window or press any mouse key.) Once you are done - exit: - Windows: right click on the plot area and choose 'Stop'! - X11: hit any mouse button other than the left one. - quartz/OS X: hit ESC | cs |
According to the above instruction, we click all or relevant data points in order like the next figure. To complete it, press the finish button on the top right side of the plot panel. For an illustration purpose, I mark the first 4 points and the last 4 points with green circles. Of course, you need to click all relevant points in between.
Output
in the following result, df.extract has non-integer x values and df.round has integer x values but is not equidistant. To make equidistant x values, a linear interpolation is applied, and new equidistant x values and the corresponding y values are generated, which is df.inter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | > cbind(df.extract[1:20,], df.x_round[1:20,], df.inter$x[1:20], df.inter$y[1:20]) x y x y df.inter$x[1:20] df.inter$y[1:20] 1 1.662921 -12.4901186 1 -12.81437126 1 -12.81437126 2 4.751204 20.5138340 3 20.71856287 2 3.95209581 3 8.314607 -3.9920949 7 -3.98203593 3 20.71856287 4 11.165329 -4.3873518 10 -4.28143713 4 14.54341317 5 14.016051 -7.3517787 13 -7.57485030 5 8.36826347 6 16.866774 -2.8063241 16 -2.48502994 6 2.19311377 7 20.905297 0.3557312 19 0.05988024 7 -3.98203593 8 23.043339 0.9486166 22 1.10778443 8 -4.08183633 9 26.369181 -5.9683794 25 -5.92814371 9 -4.18163673 10 29.457464 2.7272727 28 2.75449102 10 -4.28143713 11 32.308186 3.9130435 31 3.95209581 11 -5.37924152 12 35.634029 -8.1422925 34 -8.47305389 12 -6.47704591 13 38.484751 -3.7944664 37 -3.68263473 13 -7.57485030 14 41.810594 11.2252964 41 11.58682635 14 -5.87824351 15 44.661316 8.8537549 44 8.89221557 15 -4.18163673 16 47.987159 -10.3162055 47 -10.11976048 16 -2.48502994 17 51.075441 -6.3636364 50 -6.37724551 17 -1.63672655 18 54.163724 -0.4347826 53 -0.38922156 18 -0.78842315 19 57.014446 12.6086957 56 12.78443114 19 0.05988024 20 63.428571 1.3438735 62 1.25748503 20 0.40918164 > | cs |
Comparison of the target image file and extracted data
We can find that two line graphs are so similar in magnitude. Actually, there is no difference between two graphs.
No comments:
Post a Comment