R : a combined usage of split, lapply and do.call

This post explains a combined usage of split(), lapply() and do.call() R functions, so called Split-Apply-Combine approach. These are frequently used for group based calculations such as weighted average or aggregation. It will be useful especially when we calculate relative weights within each group.


Before going into the details, there are many alternative methods for the problem solving which we deal with in this post. I just show you one of them.


lapply, split, and do.call


lapply R function takes a list as its input and apply built-in or use-defined function to its list members.

1
2
lapply(list, function)
 
cs


Raw data is typically the data.frame not the list. When we want to perform lapply() on data.frame, It is therefore needed to convert this data.frame to the corresponding list. For this purpose, we use split() R function, which take data.frame and a key column as input and return list object separated by key column.

1
2
split(data.frame, key column of data.frame)
 
cs


Although lapply() is very useful, it is somewhat annoying to deal with its returning list object. We convert this list object to the corresponding data.frame using do.call() R function in the following way.

1
2
do.call(rbind, list) 
 
cs


Structure of this approach


We want to aggregate weighted sensitivities (ws) within its currency (currency) from input data.frame by using split() and lapply() R functions. Finally, output data.frame is constructed by using do.call() R function. This overall process is illustrated by the following figure.

Split-Apply-Combine R functions


Case 1) Single output


The following R code perform a summationof ws by currency group.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#=========================================================================#
# Financial Econometrics & Derivatives, ML/DL using R, Python, Tensorflow  
# by Sang-Heon Lee 
#
# https://shleeai.blogspot.com
#-------------------------------------------------------------------------#
# A versatile usage of lapply
#=========================================================================#
 
graphics.off()  # clear all graphs
rm(list = ls()) # remove all files from your workspace
 
#--------------------------------------------------------------------------
# Input data.frame
#--------------------------------------------------------------------------
    
    str.data <- 
        "currency    maturity    ws
         USD        3m            285000000
         USD        3m            456000000
         USD        1y            112000000
         USD        2y            56000000
         EUR        3m            1785000000
         EUR        6m            200000000
         EUR        1y            250000000
         EUR        1y            1855000000
         CNY        6m            84000000
         CNY        6m            42000000
         CNY        6m            144000000
         AUD        6m            213000000
         AUD        2y            106000000
         AUD        2y            214000000"
    df.data <- read.table(text=str.data, header = TRUE)
    print(df.data)
 
 
#--------------------------------------------------
# 1) Single output
#--------------------------------------------------
 
    # split input data by key columns
    
    lt.data <- split(df.data, df.data$currency)
 
    # single output
    lt.out  <- lapply(
        lt.data, 
        function(x){ data.frame(sum_ws = sum(x$ws))})
    
    # concatenate rows
    df.out <- do.call(rbind,lt.out) 
    rownames(df.out) <- NULL; print(df.out)
 
 
---------------------------------------------------
 
> print(df.out)
    sum_ws
1 5.33e+08
2 2.70e+08
3 4.09e+09
4 9.09e+08
 
cs



Case 2) Multiple output


The following R code calculate a summation and maximum of ws by currency group with each corresponding currency


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#--------------------------------------------------
# 2) Multiple output
#--------------------------------------------------
    
    # split input data by key column and do lapply
    lt.out  <- lapply( 
        
        # split by currency for outer lapply
        split(df.data, df.data$currency), 
        function(x){
            data.frame(curr   = max(x$currency),
                       sum_ws = sum(x$ws),
                       max_ws = max(x$ws))
        })
    
    # concatenate rows
    df.out <- do.call(rbind,lt.out) 
    rownames(df.out) <- NULL; print(df.out)
 
 
---------------------------------------------------
 
>  print(df.out)
  curr   sum_ws     max_ws
1  AUD 5.33e+08  214000000
2  CNY 2.70e+08  144000000
3  EUR 4.09e+09 1855000000
4  USD 9.09e+08  456000000
 
cs



Case 3) Multiple output while original row is preserved


The following R code calculates a summation of ws by currency group with each corresponding currency and also calculate weight within its currency. This case happens when we need to calculate some variables with both ungrouped and grouped (aggregated) variables such as weights within group. For this purpose, we need to preserve the row of original input data. This can be done by returning key or other column variables without group operation such as curr = x$currency, ws = x$ws except for grouped variables such as sum_ws = sum(x$ws).


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#--------------------------------------------------
# 3) Multiple output while original row is preserved
#--------------------------------------------------
    
    # split input data by key column and do lapply
    lt.out  <- lapply( 
        
        # split by currency for lapply
        split(df.data, df.data$currency), 
        function(x){
            data.frame(curr   = x$currency,
                       ws     = x$ws,
                       sum_ws = sum(x$ws))
        })
    
    # concatenate rows
    df.out <- do.call(rbind,lt.out)
    rownames(df.out) <- NULL; print(df.out)
    
    # add another group based calculation
    df.out$group_wgt <- df.out$ws/df.out$sum_ws
    print(df.out)
 
 
---------------------------------------------------
 
>     print(df.out)
   curr         ws   sum_ws  group_wgt
1   AUD  213000000 5.33e+08 0.39962477
2   AUD  106000000 5.33e+08 0.19887430
3   AUD  214000000 5.33e+08 0.40150094
4   CNY   84000000 2.70e+08 0.31111111
5   CNY   42000000 2.70e+08 0.15555556
6   CNY  144000000 2.70e+08 0.53333333
7   EUR 1785000000 4.09e+09 0.43643032
8   EUR  200000000 4.09e+09 0.04889976
9   EUR  250000000 4.09e+09 0.06112469
10  EUR 1855000000 4.09e+09 0.45354523
11  USD  285000000 9.09e+08 0.31353135
12  USD  456000000 9.09e+08 0.50165017
13  USD  112000000 9.09e+08 0.12321232
14  USD   56000000 9.09e+08 0.06160616
 
cs



Case 4) Multiple output with multiple key columns


The following R code calculates a summation of ws by currency and maturity group with each corresponding currency. In this case, nested lapply() is used.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#--------------------------------------------------
# 4) Multiple output with multiple key columns
#--------------------------------------------------
    
    # outer lapply
    lt.out  <- lapply( 
        
        # split by currency for outer lapply
        split(df.data, df.data$currency ), 
        function(x){
            
            # inner lapply
            y <- lapply(
                
                # split by maturity for inner lapply
                split(x, x$maturity), 
                function(x) {
                    data.frame(curr = max(x$currency),
                           mat  = max(x$maturity),
                           sum_ws = sum(x$ws),
                           max_ws = max(x$ws))})
            
            # concatenate inner rows
            do.call(rbind,y)
        })
    
    # concatenate outer rows
    df.out <- do.call(rbind,lt.out)
    rownames(df.out) <- NULL; print(df.out)
 
---------------------------------------------------
 
> print(df.out)
  curr mat     sum_ws     max_ws
1  AUD  2y  320000000  214000000
2  AUD  6m  213000000  213000000
3  CNY  6m  270000000  144000000
4  EUR  1y 2105000000 1855000000
5  EUR  3m 1785000000 1785000000
6  EUR  6m  200000000  200000000
7  USD  1y  112000000  112000000
8  USD  2y   56000000   56000000
9  USD  3m  741000000  456000000
 
cs



Case 5) Multiple output with multiple key columns while original row is preserved


The following R code is the same version of case 3) when original row is preserved.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#--------------------------------------------------
# 5) Multiple output with multiple key columns
#    while original row is preserved
#--------------------------------------------------
    
    # outer lapply
    lt.out  <- lapply( 
        
        # split by currency for outer lapply
        split(df.data, df.data$currency ), 
        function(x){
            
            # inner lapply
            y <- lapply(
                
                # split by maturity for inner lapply
                split(x, x$maturity), 
                function(x) {
                    data.frame(curr = x$currency,
                               mat  = x$maturity,
                               ws   = x$ws,
                               sum_ws = sum(x$ws))})
            # concatenate inner rows
            do.call(rbind,y)
        })
    
    # concatenate outer rows
    df.out <- do.call(rbind,lt.out) 
    rownames(df.out) <- NULL
    
    # add another group based calculation
    df.out$group_wgt <- df.out$ws/df.out$sum_ws
    print(df.out)   
 
 
---------------------------------------------------
 
>     print(df.out)   
   curr mat         ws     sum_ws group_wgt
1   AUD  2y  106000000  320000000 0.3312500
2   AUD  2y  214000000  320000000 0.6687500
3   AUD  6m  213000000  213000000 1.0000000
4   CNY  6m   84000000  270000000 0.3111111
5   CNY  6m   42000000  270000000 0.1555556
6   CNY  6m  144000000  270000000 0.5333333
7   EUR  1y  250000000 2105000000 0.1187648
8   EUR  1y 1855000000 2105000000 0.8812352
9   EUR  3m 1785000000 1785000000 1.0000000
10  EUR  6m  200000000  200000000 1.0000000
11  USD  1y  112000000  112000000 1.0000000
12  USD  2y   56000000   56000000 1.0000000
13  USD  3m  285000000  741000000 0.3846154
14  USD  3m  456000000  741000000 0.6153846
 
cs


From this post, we can find that combination of split, lapply and do.call R runctions deliver output which we want to generate for the group operations. \(\blacksquare\)


2 comments:

  1. Why would you not just use the aggregate function?!

    ReplyDelete
    Replies
    1. Thanks for your interest and advice.

      There are many alternatives for this kind of job.

      I agree with you that aggregate() is a good choice since it is parsimonious and not lengthy. Very good.

      My focus is on combining original values with aggregated values(Case 3 & Case 5) such as relative weights with it currency group.

      In this case, I think that the above approach is not bad because it does not require an additional merge or join between original data.frame and newly generated aggregated data.frame.

      Thank you.

      Delete