R code : Extract data from an image file using digitize R package

This post shows how to extract data from an image file using digitize R package. By clicking relevant data points on image, the corresponding data points are read. To improve an applicability of this scanned data to impose the equidistance between adjacent data points, I apply a numerical rounding and then an interpolation.

A blue whale

Extract data from an image file by using digitize R package



We dealt with how to scan numeric data from a figure image by using Excel in the previous post,


This work can be done more precisely by using digitize R package.



Data Plot


We use the same image file (brl_yield_spread.png) of the previous post, which is a time series of yield spreads ranging from 1995/01 ~ 2007/03 and the number of observations is 147.

Excel : Scan Numeric Data From a Time Series Plot Image
brl_yield_spread.png


R code


Given the above plot image we need to run following R code with some actions. We click 4 points for setting x and y axes and also click all available points on along the target line graph on the plot panel of R studio while calling digitize() function.

It is worth noting that you should run the following R code two parts separately. The first part is line 1 to 17 (calling digitize() function) and the second part line 17 to 41 (remaining work).

Do not run this code at once since after running the first part, we need to click 4 axis points and all relevant points. When we complete this job, the remaining part can be ran given the results of digitize() function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#========================================================#
# Quantitative Financial Econometrics & Derivatives 
# ML/DL using R, Python, Tensorflow by Sang-Heon Lee 
#
# https://shleeai.blogspot.com
#--------------------------------------------------------#
# Extract data from an image file
#========================================================#
 
graphics.off(); rm(list = ls())
 
library(digitize)
 
setwd("your folder")
 
# extract data from the given image file
df <- digitize("brl_yield_spread.png")
df
 
# draw extracted data
x11(); plot(df[,1], df[,2], type="l", lty=1, lwd = 5,
            main = "extracted data from image file",
            xlim = c(0,148), ylim = c(-20,30), col=4)
 
# round x axis to preserve data points during interpolation
df.x_round <- df; df.x_round$x <- round(df.x_round$x,0)
df.x_round
 
# first data is set to x = 1
df.x_round$x<- df.x_round$x - (df.x_round$x[1]-1)
df.x_round
 
# interpolation
df.inter <- approx(df.x_round[,1], df.x_round[,2], 
                   xout <- 1:147, method = "linear", rule = 2)
df.inter
 
# draw interpolation of the extracted data
x11(); plot(df.inter$x, df.inter$y, type="l", lty=1, lwd=5,
            main = "interpolated data from extracted data",
            xlim = c(0,148), ylim = c(-20,30), col=2)
 
cs


Cliking 4 axis points


Given the above plot image we need to run following R code wth some actions. We select 4 points for setting x and y axes and also select all available points on along the target line graph on the plot panel of R studio while calling digitize() function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
> df <- digitize("brl_yield_spread.png")
...careful how you calibrate.
Click IN ORDER: x1, x2, y1, y2
 
    Step 1 ----> Click on x1
  |
  |
  |
  |
  |________x1__________________
   
    Step 2 ----> Click on x2
  |
  |
  |
  |
  |_____________________x2_____
  
 
    Step 3 ----> Click on y1
  |
  |
  |
  y1
  |____________________________
  
 
    Step 4 ----> Click on y2
  |
  y2
  |
  |
  |____________________________
  
 
What is the return of x1 ?
0
What is the return of x2 ?
148
What is the return of y1 ?
-20
What is the return of y2 ?
30
 
cs


According the above instruction, we click the minimum and maximum values of x and y axis (x1, x2, y1, and y2) in order like the following figure. In this case, x1 and y1 have the same origin (0,0). But it is not a must and you can change the suitable ranges of two axis.


Cliking all relevant data points


Next, the following instruction comes out.

1
2
3
4
5
6
7
8
9
10
11
12
13
 
..............NOW .............
 
Click all the data. (Do not hit ESC, close the window or press any mouse key.)
 
Once you are done - exit:
 
 - Windows: right click on the plot area and choose 'Stop'!
 
 - X11: hit any mouse button other than the left one.
 
 - quartz/OS X: hit ESC
 
cs


According to the above instruction, we click all or relevant data points in order like the next figure. To complete it, press the finish button on the top right side of the plot panel. For an illustration purpose, I mark the first 4 points and the last 4 points with green circles. Of course, you need to click all relevant points in between.


Output


in the following result, df.extract has non-integer x values and df.round has integer x values but is not equidistant. To make equidistant x values, a linear interpolation is applied, and new equidistant x values and the corresponding y values are generated, which is df.inter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> cbind(df.extract[1:20,], df.x_round[1:20,], df.inter$x[1:20], df.inter$y[1:20])
           x           y  x            y df.inter$x[1:20] df.inter$y[1:20]
 
1   1.662921 -12.4901186  1 -12.81437126                1     -12.81437126
2   4.751204  20.5138340  3  20.71856287                2       3.95209581
3   8.314607  -3.9920949  7  -3.98203593                3      20.71856287
4  11.165329  -4.3873518 10  -4.28143713                4      14.54341317
5  14.016051  -7.3517787 13  -7.57485030                5       8.36826347
6  16.866774  -2.8063241 16  -2.48502994                6       2.19311377
7  20.905297   0.3557312 19   0.05988024                7      -3.98203593
8  23.043339   0.9486166 22   1.10778443                8      -4.08183633
9  26.369181  -5.9683794 25  -5.92814371                9      -4.18163673
10 29.457464   2.7272727 28   2.75449102               10      -4.28143713
11 32.308186   3.9130435 31   3.95209581               11      -5.37924152
12 35.634029  -8.1422925 34  -8.47305389               12      -6.47704591
13 38.484751  -3.7944664 37  -3.68263473               13      -7.57485030
14 41.810594  11.2252964 41  11.58682635               14      -5.87824351
15 44.661316   8.8537549 44   8.89221557               15      -4.18163673
16 47.987159 -10.3162055 47 -10.11976048               16      -2.48502994
17 51.075441  -6.3636364 50  -6.37724551               17      -1.63672655
18 54.163724  -0.4347826 53  -0.38922156               18      -0.78842315
19 57.014446  12.6086957 56  12.78443114               19       0.05988024
20 63.428571   1.3438735 62   1.25748503               20       0.40918164
> 
cs


Comparison of the target image file and extracted data


We can find that two line graphs are so similar in magnitude. Actually, there is no difference between two graphs.

R code : Extract data from an image file



No comments:

Post a Comment