R : data.frame merge while keeping orders of row and column

This post makes a useful wrapper R function merge() for left outer join, which preserves the orders of row and column of input x data. It is not a must but useful when we prefer these fixed orders in some case.


Left Outer Join which we try to do


The following figure demonstrates what we are trying to do. The first is to perform left outer join for determining parameters for the corresponding Bucket and Label1, which are, in fact, currency volatility classification and maturity respectively. The second is to apply the original orders of row and column of x data to this merged result.

R dataframe merge while keeping row and column orders


Read a data.frame from a String


It is typical to read data from Excel or csv file but for the purpose of quick and easy exercises, a string also can be read as a data.frame by using read.table() function. The following is to read x data.frame.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# LEFT data which has joining key fields
str.x <- "Qualifier Bucket Label1
          eur      1     3y
          cny      3     2w
          jpy      2     2w
          jpy      2     5y
          mxn      3     1y
          jpy      2     5y
          eur      1     5y
          eur      1     3y
          eur      1     3y"
<- read.table(text=str.x, header = TRUE)
 
---------------------------------------------
 
> x
  Qualifier Bucket Label1
1       eur      1     3y
2       cny      3     2w
3       jpy      2     2w
4       jpy      2     5y
5       mxn      3     1y
6       jpy      2     5y
7       eur      1     5y
8       eur      1     3y
9       eur      1     3y
> 
 
cs


In this way, we can also read and make y_sc, y_dc data.frames. y_sc (y_dc) is a right data with same (different) key field (column) names.

R dataframe merge while keeping row and column orders



Output of merge()


Using merge(), we can get the result of left outer join. There is no problem in this output and this result is generally recommended in a viewpoint of key columns.

In some case, it is somewhat inconvenient when we need the original orders of row and column for some reason because row and column of the result from merge() are rearranged in the following way.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# using merge()
merge(x = x, y = y_sc, 
      by.x = c("Bucket""Label1"), 
      by.y = c("Bucket""Label1"), 
      all.x = TRUE)
 
--------------------------------------
> 
  Bucket Label1 Qualifier Parameter
1      1     3y       eur        49
2      1     3y       eur        49
3      1     3y       eur        49
4      1     5y       eur        51
5      2     2w       jpy        14
6      2     5y       jpy        20
7      2     5y       jpy        20
8      3     1y       mxn        97
9      3     2w       cny        85
> 
cs


Output with orders of row and column of left input data.frame


To preserve orders of the original row and column (of x data), we have only to track information of row and column of x data. For this purpose, we can make the following user-defined function(f_loj_krc()).


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#-----------------------------------------------
# Function : f_loj_krc
#-----------------------------------------------
# Left outer join while keeping 
# orders of input rows and columns
# Meaning of input arguments are 
# the same as those of merge() 
#-----------------------------------------------
f_loj_krc <- function(x, y, by.x, by.y) {
 
    # save row id
    x.temp <- x; 
    x.temp$temp.id <- 1:nrow(x.temp); 
    
    # each column names
    x.cn <- colnames(x); y.cn <- colnames(y)
    
    # replace column names of y with same names of x
    # to avoid duplicate fields
    for(i in 1:length(by.y)) {
        colnames(y)[which(y.cn == by.y[i])] <- by.x[i]
    }
    
    # since two fields are the same now
    by.y <- by.x 
    
    # new column names of y
    y.cn <- colnames(y)
    
    # remove only key fields which are redundant
    # and keep only new informative fields
    y.cn.not.key <- setdiff(y.cn, by.y)
    
    # left outer join
    df <- merge(x = x.temp, y = y, 
                by.x=by.x, by.y=by.y, all.x = TRUE)
    
    # recover the original orders of row and column
    df <- df[order(df$temp.id),c(x.cn, y.cn.not.key)]
    rownames(df) <- NULL
    
    return(df)
}
 
cs


f_loj_krc() function keeps information of x data, uses merge() for left outer join, and recovers orders of the x's row and column. In particular, as x (left) and y (right) data may have either same or different key field names, we can replace key field names of y with those of x in case of two set of names are different. It's because it is preferable and parsimonious to remove redundant columns in merged output.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# same key fields names between x and y
f_loj_krc(x = x, y = y_sc, 
          by.x=c("Bucket""Label1"), 
          by.y=c("Bucket""Label1"))
 
--------------------------------------
>
  Qualifier Bucket Label1 Parameter
1       eur      1     3y        49
2       cny      3     2w        85
3       jpy      2     2w        14
4       jpy      2     5y        20
5       mxn      3     1y        97
6       jpy      2     5y        20
7       eur      1     5y        51
8       eur      1     3y        49
9       eur      1     3y        49
> 
 
 
# different key fields names between x and y
f_loj_krc(x = x, y = y_dc, 
          by.x = c("Bucket""Label1"),
          by.y = c("Bkt""Lab1"))
 
--------------------------------------
>
  Qualifier Bucket Label1 Parameter
1       eur      1     3y        49
2       cny      3     2w        85
3       jpy      2     2w        14
4       jpy      2     5y        20
5       mxn      3     1y        97
6       jpy      2     5y        20
7       eur      1     5y        51
8       eur      1     3y        49
9       eur      1     3y        49
> 
cs


In some case, it is convenient to keep orders of row and column until calculation is completed. But merge() function for joining data.frames delivers rearranged output with respect to key fields. To sidestep this rearranging effect, we make a user-defined function based on merge(), which preserves information of input x data. \(\blacksquare\)


4 comments:

  1. You might like to look at the dplyr join functions. They preserve the order of rows and columns of the left data frame "as much as possible".

    ReplyDelete
    Replies
    1. Thank you very much for your suggestion.
      I've heard that dplyr is a very useful and versatile R package.
      I'll give it a try for the above-mentioned purpose.

      Delete
  2. You could simply start with adding an ID (or order, or...) column at the beginning in your dataframe X: X <- 1:9. Then, in the end, you can just sort your dataframe as per their ID number, and remove the ID column if you so please.

    ReplyDelete
    Replies
    1. Thank you for giving me useful advices.
      I also use the similar approach as you said and add an additional column sorting.

      Delete