Pre-process large text files in R

Question

Pre-process large text files in R

I'm writing a script, which I'm going to make public, to open the RAIS microdata (de-identified, available here) in R using MonetDB. However, the bank does not accept vigula (,) as a decimal separator. Each file UFano.Rais txt is quite large (can reach 7GB) so the solution can not require modifications that fit the RAM. Two alternatives:

A) import to the bank, as if everything were string, and then do within SQL an UPDATE creating new columns for the numeric variables and making the substitution of", "by".".

B) pre-process the file, replacing no .txt whatever is comma by period.

The question is about the alternative "b".

Is there any efficient way to do this substitution? AjDamico indicates a slow form, replacing line with line here .

As an example we can start from the Acre file of 2012 (AC2012.txt), which availability in this link

Since it is to be packaged as a command of R, the solution cannot depend on the OS or require the installation of things outside of R.

13

csv sql r txt big-data

Author: Carlos Cinelli, 2014-10-09

Source

3 answers

score 4 · Answer 1

I made a partial solution (which should not be the most efficient) that reads the file in chunks and saves in a csv, already with the decimal as a semicolon. Hence it is only you set the chunk size that fits in your RAM. It worked with the ACRE test file, in thesis it would work with larger files.

chunks <- 40000
acre <- read.csv2("AC2012.txt", fileEncoding="latin1", nrows=chunks, stringsAsFactors=FALSE)
write.table(acre, "teste.csv",sep=",")
oldrows <- chunks
while(nrow(acre)>0){
  acre <- read.csv2("AC2012.txt", fileEncoding="latin1", nrows=chunks, skip=oldrows, 
                               stringsAsFactors=FALSE)
  cat("Lido: linhas", oldrows, "a", oldrows + nrow(acre), "\n")
  oldrows <- oldrows+nrow(acre)
  write.table(acre, "teste.csv", col.names=FALSE, append=TRUE, sep=",")
}

# testando
original <- read.csv2("AC2012.txt", fileEncoding="latin1",stringsAsFactors=FALSE)
novo <- read.csv("teste.csv", header=TRUE,sep=",", stringsAsFactors=FALSE, row.names=NULL)
all.equal(novo[-1], original)
TRUE

I also tested the solution proposed in Damico's question, with chunks, and is faster, at least in this example:

file.create("acre.txt")
outcon <- file( "acre.txt" , "w" )
incon <- file("AC2012.txt" , "r" )
while( length( one.line <- readLines( incon , 40000 , encoding="latin1") ) > 0 ){
  one.line <- gsub( ',' , '.' , one.line )
  writeLines( one.line , outcon )
}

score 4 · Answer 2

Very good solution @ CarlosCinelli. But a workaround is using the iterators package. The change_dot () function basically reads a line, swaps the', 'for'.'and writes the line into a text file.

library(iterators)

change_dot <- function(file, saida='teste.txt', chunk=1) {
  con1 <- file(file, 'r')
  con2 <- file(saida, open = 'w')
  linha <- 0
  it <- ireadLines(con1, n=chunk)
  out <- tryCatch(expr=write(x = gsub(pattern = ',', replacement = '.', x = nextElem(it)), con2), 
                   error=function(e) e)

  while(!any(class(out) == "error")) {
    linha = linha + 1
    print(paste('Escrita linha ', linha))
    out <- tryCatch(expr=write(x = gsub(pattern = ',', replacement = '.', x = nextElem(it)), con2, append = T), 
                  error=function(e) e)
  }
}

system.time(change_dot(file = 'AC2012.txt', saida = 'saida.csv'))



 user  system elapsed 
  48.65    4.70   53.04

In that file, AC2012.txt, the procedure took 48 seconds on my machine.

Here it should be noted that it is possible to increase the chunk size to values greater than 1. For example, increasing to 40000 I got the following times using this solution, and the Damico solution:

change_ponto <- function() {
            file.create("acre.txt")
            outcon <- file( "acre.txt" , "w" )
            incon <- file("AC2012.txt" , "r" )
            while( length( one.line <- readLines( incon , 40000 , encoding="latin1") ) > 0 ){
              one.line <- gsub( ',' , '.' , one.line )
              writeLines( one.line , outcon )
            }
}

system.time(change_ponto())

 user  system elapsed 
   6.53    0.82    7.36

system.time(change_dot(file = 'AC2012.txt', saida = 'teste4.csv', chunk = 40000))

 user  system elapsed 
   6.71    3.12    9.92

And now testing if the files are equal:

    teste2 <- read.csv("acre.txt", header=F, sep=";", stringsAsFactors=FALSE, row.names=NULL)
    teste4 <- read.csv("teste4.csv", header=F, sep=";", stringsAsFactors=FALSE, row.names=NULL)
    all.equal(teste2, teste4)

[1] TRUE

I wrote a post about iterators on my blog a while ago: http://www.rmining.com.br/2015/09/07/preparacao-de-dados-parte-2 /

score 3 · Answer 3

I thank @ CarlosCinelli and @FlavioBarros for the great answers. I decided on a 3rd alternative to my question above, using the "ff" package to import the data in R to a "ffdf" and then export to a traditional CSV:

dat <- read.csv2.ffdf(file="AC2012.txt", strip.white=TRUE,
                      na.strings = c('{ñclass}','{ñ','{ñ class}','{ñc','000-1')))
write.csv.ffdf(dat, "AC2012.csv")

Measure the total elapsed time. For the Acre are 4.65 s stop read and 5.17 s to save, on an SSD hard drive. For the entire state archives of the RAIS of 2002, whose .txt occupy 11.8 GB, were in total 29min to import and 35min to export.

Using the FF package has some advantages:

the FF package decides how to optimize between disk and RAM when importing the data. The user does not think about it and uses a function with usual syntax (read.csv2).
just like in read.csv2, the Read function.csv2.ffdf assumes that the .CSV is European / Latin, using", "as decimal separator and"; " as field separator
the function also handles other issues of the current version of RAIS files such as some fields with spaces in bank before the die (ex: "999") and the usdas strings to identify missings ( "{ñclass}", "{ñ", "{ñ class}", "{ñc", "000-1"). After fixing these issues, the size of the 2002 files decreases to 4.7 GB, less than half.

Perhaps a disadvantage is that the FF package does not support extremely large bases, while the other solutions in this post are scalable without major problems.