Read error with fread function of data package.table

Question

Read error with fread function of data package.table

By downloading the database on daily rates and tickets for 2019 from the portal da Transparencia http://www.portaltransparencia.gov.br/download-de-dados/viagens, I used the code below to read the file of diaries received in the year 2019, but I have received some errors with the fread function and errors with the Read function.delim, because in the latter case when reading the column "daily value" the values are not recognized correct. Follow the codes with the respective errors:

rm(list=ls())
library("tidyverse")
library("readr")
library("data.table")
library("stringr")
library("lubridate")
#unzip("2019_20190630_Viagens.zip")
options(datatable.fread.input.cmd.message=FALSE)
Diaria2019_Via <- "iconv -f ISO-8859-1 -t UTF-8 2019_Viagem.csv"
Diaria2019 <- data.table::fread(Diaria2019_Via,dec = ",")


Warning messages:
1: In data.table::fread(Diaria2019_Via, dec = ",") :
  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(Diaria2019_Via, dec = ",") :
  Stopped early on line 7378. Expected 16 fields but found 18. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"0000000000015769552";"Realizada";"53000";"Ministério do Desenvolvimento Regional";"53000";"Ministério do Desenvolvimento Regional - Unidades com vínculo direto";"***.043.57*-**";"ARMIN AUGUSTO BRAUN";"";"20190115";"20190116";"São Paulo/SP";"Representar a Secretaria Nacional de Proteção e Defesa Civil - SEDEC, no Seminário "Proteção e Defesa Civil Aplicada", onde Ministrará palestra sobre "Apoio Federal na Resposta a Desastres"; participará reunião com pessoal do Hospital Albert Eins>>

The above message suggests using quote="" and fill=NULL, however, neither of them works. The code below misreads the Daily Value column:

Diaria_2019 <- read_delim("2019_Viagem.csv", 
                          ";", escape_double = FALSE, locale = locale(decimal_mark = ".",encoding = "ISO-8859-1"), 
                          trim_ws = TRUE)

Does anyone suggest any other function that can correctly read the database 2019_Viagem.csv?

1

r data.frame

Author: fsbmat, 2019-07-15

Source

2 answers

Are available in http://portaldatransparencia.gov.br/download-de-dados/viagens the new files, with proper corrections. I appreciate your help. If necessary, contact us directly through the official channels of the CGU (LAI, E-Ouv, Facebook and Telegram). More information at: http://cgu.gov.br/.

1

Author: Guest12345, 2019-07-24 13:03:54

score 4 · Accepted Answer

The file is filled incorrectly in the source. As far as I know, it will be impossible to read it, in this way, inside the R. I discovered this by running, in the terminal, the command

cat 2019_Viagem.csv | awk -F";" '{print NF-1}' | head -7379 | tail -3

Going by parts, what this command does is the following:

cat 2019_Viagem.csv: prints the contents of the file 2019_Viagem.csv to the terminal screen
awk -F";" '{print NF-1}': the awk language searches for all occurrences of ; in each line of the 2019_Viagem.csv file and counts them. How ; is the column separator, a file with 16 columns must have 15 characters of type ;
head -7379 and tail -3: do the same function as R, in which head shows the n first lines of a file and tail, the m last

The character | is used to concatenate commands, in a function identical to the command %>% of the magrittR or dplyr packages. So what I got was the result

15
17
15

This means that there are 16 columns in rows 7377 and 7379 of file 2019_Viagem.csv, while in row 7378 there are 18 columns. Notice that this result has everything to do with the warning you received in your attempt:

2: In data.table::fread(Diaria2019_Via, dec = ",") :
  Stopped early on line 7378. Expected 16 fields but found 18.

Since R doesn't know how to deal with this incongruity between the quantities of columns between the rows, you end up having trouble reading this data. As a curiosity, these are the three lines whose quantities of ; I counted and displayed above:

"0000000000015769550";"Realizada";"30000";"Minist�rio da Justi�a e Seguran�a P�blica";"30108";"Departamento de Pol�cia Federal";"***.000.00*-**";"Informa��es protegidas por sigilo";"Informa��es protegidas por sigilo";"20190117";"20190117";"Informa��es protegidas por sigilo";"Informa��o protegida por sigilo nos termos da legisla��o vigente";",00";"3792,34";",00"
"0000000000015769552";"Realizada";"53000";"Minist�rio do Desenvolvimento Regional";"53000";"Minist�rio do Desenvolvimento Regional - Unidades com v�nculo direto";"***.043.57*-**";"ARMIN AUGUSTO BRAUN";"";"20190115";"20190116";"S�o Paulo/SP";"Representar a Secretaria Nacional de Prote��o e Defesa Civil - SEDEC, no Semin�rio "Prote��o e Defesa Civil Aplicada", onde Ministrar� palestra sobre "Apoio Federal na Resposta a Desastres"; participar� reuni�o com pessoal do Hospital Albert Einstein para prepara��o de Simulado de Preven��o de desastres; participar� de reuni�o com Diretor do Centro Nacional de Monitoramento e Alertas de Desastres Naturais - CEMADEN, que ser�o realizados no dia 16 de janeiro de 2019.";"475,70";"2201,04";",00"
"0000000000015769553";"Realizada";"30000";"Minist�rio da Justi�a e Seguran�a P�blica";"30108";"Departamento de Pol�cia Federal";"***.000.00*-**";"Informa��es protegidas por sigilo";"Informa��es protegidas por sigilo";"20190101";"20190103";"Informa��es protegidas por sigilo";"Informa��o protegida por sigilo nos termos da legisla��o vigente";"725,96";",00";",00"

I see two action lines to solve your problem:

The first would be to manually correct the problematic lines. Yes, it's several lines with problems. Below I show a table with the number of occurrences of ; in the entire dataset, obtained through the command cat 2019_Viagem.csv | awk -F";" '{print NF-1}' | sort | uniq -c:

See that 266878 rows have 15 column separators as expected, 2929 has 16 and so on. We get to the Cumulus of 2 lines have 37 ;. Particularly, I don't see how to create a algorithm that can handle all the exceptions that may arise when dealing with such a problematic data set.

The other solution I see is to get in direct contact with those who provide this data and comment on these inconsistencies. Who knows, there's someone in the federal government to help you?