SHLee AI Financial Model: Basic R : Read so many CSV files

This post presents basic R code snippets to read files with given file extensions such as csv or txt. This is simple but very useful when it comes to the case where there are too many files to read manually.

If we have too many (i.e. 1000 files) csv files or its variants, it is impossible to read these files one by one manually.

For example, let's assume that there are 6 files (USD.CSV, EUR.csv, CNY.TXT, AUD.txt, CNY2.ttxt, USD.CCSV) in a target directory. The first four files (csv, CSV, txt, TXT) are the files that we want to read. The contents of these files are straightforward because the output will show these contents in the later.

In this case, we can use list.files() R function to get names of these files with some certain file extensions.

list.files() function

list.files() is a built-in R function which returns a list of names of files with a given pattern.

1
2
3
list.files(path, pattern="\\.(csv|txt)$", 
           ignore.case = TRUE, full.names = FALSE)
 
Colored by Color Scripter
cs

In the above R command, "\\.(csv|txt)$" pattern specifies that 1) it is applied at the end of file name($), 2) multiple file extensions such as csv or txt file ((csv|txt)) are allowed but not for similar extensions such as ccsv or ttxt(\\.). csv and CSV or txt and TXT are allowed because case sensitivity is ignored (ignore.case = TRUE).

R code

The following R code is easy and self-contained: 1) reads each csv (CSV) or txt (TXT) files and make each data.frame separately and 2) reads and collects them into one data.frame.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#========================================================#
# Quantitative Financial Econometrics & Derivatives 
# ML/DL using R, Python, Tensorflow by Sang-Heon Lee 
#
# https://shleeai.blogspot.com
#--------------------------------------------------------#
# Basic R : read all csv files when these are so many
#========================================================#
 
    graphics.off()  # clear all graphs
    rm(list = ls()) # remove all files from your workspace
    
    # working directory
    setwd("D:/SHLEE/blog/R/many_csv")
    
    # directory where csv files are located
    path<- file.path(getwd())
 
#-------------------------------------------------------
# make a list of all file names with csv or txt ext.
#-------------------------------------------------------
# $         : end of file name
# (csv|txt) : multiple file extentions
# \\.       : avoid unwanted cases such as .ccsv
#-------------------------------------------------------
    v.filename <- list.files(
        path, pattern="\\.(csv|txt)$", 
        ignore.case = TRUE, 
        full.names = FALSE)
 
#-------------------------------------------------------
# Test 1) read and make each data.frame
#-------------------------------------------------------
    for(fn in v.filename) {
        
        df.each <- read.csv(fn)
        
        # to do with df.each 
        
        # print
        print(fn); print(df.each)
        
        # use of assign()
        #
        # save each file as each data.frame
        # which has each file name
        assign(fn, read.csv(fn))
    }
 
#-------------------------------------------------------
# Test 2) read and collect into one data.frame
#-------------------------------------------------------
    df.all <- do.call(
        rbind, lapply(v.filename, 
                      function(x) read.csv(x)))
    print(df.all)
    
    # use of vroom package
    library(vroom)
    df.all.vroom <- vroom(v.filename)
    print(df.all.vroom)
 
Colored by Color Scripter
cs

Output

We can find that only 4 files with correct file extensions are read while 2 unwanted files (.CCSV and .ttxt) are ignored.

Read so many CSV files, list.files, file extensions

Additional Output - assign() function

Josep Pueyo-Ros kindly advised me to use assign() function to save as a different data.frame with the name of the file. I think it is very useful so I add this function to R code and get the following 4 data.frames which can be seen at Environment/Data explorer in R studio.

USD.CSV
EUR.csv
CNY.TXT
AUD.txt

Additional Output - vroom package

HP (kind blog visitor) suggests the use of vroom which is a recently developed R package designed specifically for speed. I think this has lots of merits such as fast speed and suitability for complicated task. Result from vroom() function is as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
> df.all.vroom <- vroom(v.filename)
Rows: 14 Columns: 3                  
-- Column specification ------------------- 
Delimiter: ","
chr (2): currency, maturity
dbl (1): ws
 
i<U+00A0>Use `spec()` to retrieve the full column specification for this data.
i<U+00A0>Specify the column types or set `show_col_types = FALSE` to quiet this message.
 
> print(df.all.vroom)
# A tibble: 14 x 3
   currency maturity         ws
   <chr>    <chr>         <dbl>
 1 AUD      6m        213000000
 2 AUD      2y        106000000
 3 AUD      2y        214000000
 4 CNY      6m         84000000
 5 CNY      6m         42000000
 6 CNY      6m        144000000
 7 EUR      3m       1785000000
 8 EUR      6m        200000000
 9 EUR      1y        250000000
10 EUR      1y       1855000000
11 USD      3m        285000000
12 USD      3m        456000000
13 USD      1y        112000000
14 USD      2y         56000000
> 
Colored by Color Scripter
cs

This R code is efficient and useful especially when there are too many files to read. $\blacksquare$

4 comments:

Josep Pueyo-RosSeptember 3, 2021 at 9:16 AM
Thanks for the code. Just a little suggestion to make more useful the first test. After create df.each, you can save as a different df with the name of the file using assign(). So, when the loop is finished, you get one df per each file.
hpSeptember 3, 2021 at 12:58 PM
Hi, did you try the vroom package?

... see https://youtu.be/RA9AjqZXxMU

Pages

Basic R : Read so many CSV files