What is the best way to scrape the Datasus website in Python?

Question

What is the best way to scrape the Datasus website in Python?

The link is this: http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sih/cnv/nrbr.def

I'm trying to send a POST through requests with a dictionary containing the categories I want, but then the URL remains static.

Do you think Selenium would be more suitable for this? Has anyone ever done anything like this?

0

python selenium web-scraping scrapy aplicação-web

Author: Victor Serra, 2017-12-15

Source

1 answers

score 1 · Answer 1

I do not recommend the use of Selenium, because according to Sasa Buklijas in "do not use Selenium for Web Scraping" he states that Selenium is not a specialized tool of web scraping (data extraction technique used to collect data from websites), but rather a tool to perform automated testing of web applications. It is recommended to use tools such as Scrapy or Beautiful Soup + Requests.

The Datasus website Think difficult to use Selenium because the site has many checkboxes, generating many combinations to be able to download all content, it will be very laborious to do this in Selenium, and there are other better tools for this purpose.

I've done something similar to catch all Enem results using Bash Script and cURL, as the following steps:

use Google Chrome
Open Datasus website
click the button right, and select "Inspect", will open the developer tools to the right of the browser.
click the "Network" tab in the developer tools
on the Datasus website on the left, select the options, and click the "show" button that appears on the website
in the developer tools click on the request sent to the server called " tabcgi.exe?sih / cnv / nrbr.def " with the right button, select COPY - > Copy as...

Depending on where it is develop the script, if it is on Linux select "Copy as cURL (Bash)" if it is on Windows "Copy as cURL (cmd)".

With the copied cURL command just paste in Linux Bash and it will make the request the same as the Browser. You can modify the parameters of the cURL request to search for other information from the site.

Example below cURL request:

curl 'http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sih/cnv/nrbr.def' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: http://tabnet.datasus.gov.br' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Referer: http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sih/cnv/nrbr.def' \
  -H 'Accept-Language: pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7' \
  -H 'Cookie: TS014879da=01e046ca4c72569773aca201f18700eeeba156dca36d80d2164402d50541a167b0fe28c1eed5e284f878cbb97def8098f34d4600bd' \
  --data 'Linha=Macrorregi%E3o_de_Sa%FAde&Coluna=--N%E3o-Ativa--&Incremento=Interna%E7%F5es&Arquivos=nrbr2003.dbf&pesqmes1=Digite+o+texto+e+ache+f%E1cil&SMunic%EDpio=1&pesqmes2=Digite+o+texto+e+ache+f%E1cil&SCapital=1&pesqmes3=Digite+o+texto+e+ache+f%E1cil&SRegi%E3o_de_Sa%FAde_%28CIR%29=1&pesqmes4=Digite+o+texto+e+ache+f%E1cil&SMacrorregi%E3o_de_Sa%FAde=TODAS_AS_CATEGORIAS__&pesqmes5=Digite+o+texto+e+ache+f%E1cil&SMicrorregi%E3o_IBGE=TODAS_AS_CATEGORIAS__&pesqmes6=Digite+o+texto+e+ache+f%E1cil&SRegi%E3o_Metropolitana_-_RIDE=TODAS_AS_CATEGORIAS__&pesqmes7=Digite+o+texto+e+ache+f%E1cil&STerrit%F3rio_da_Cidadania=TODAS_AS_CATEGORIAS__&pesqmes8=Digite+o+texto+e+ache+f%E1cil&SMesorregi%E3o_PNDR=TODAS_AS_CATEGORIAS__&SAmaz%F4nia_Legal=TODAS_AS_CATEGORIAS__&SSemi%E1rido=TODAS_AS_CATEGORIAS__&SFaixa_de_Fronteira=TODAS_AS_CATEGORIAS__&SZona_de_Fronteira=TODAS_AS_CATEGORIAS__&SMunic%EDpio_de_extrema_pobreza=TODAS_AS_CATEGORIAS__&SCar%E1ter_atendimento=TODAS_AS_CATEGORIAS__&SRegime=TODAS_AS_CATEGORIAS__&pesqmes16=Digite+o+texto+e+ache+f%E1cil&SCap%EDtulo_CID-10=TODAS_AS_CATEGORIAS__&pesqmes17=Digite+o+texto+e+ache+f%E1cil&SLista_Morb__CID-10=TODAS_AS_CATEGORIAS__&pesqmes18=Digite+o+texto+e+ache+f%E1cil&SFaixa_Et%E1ria_1=3&pesqmes19=Digite+o+texto+e+ache+f%E1cil&SFaixa_Et%E1ria_2=TODAS_AS_CATEGORIAS__&SSexo=TODAS_AS_CATEGORIAS__&SCor%2Fra%E7a=TODAS_AS_CATEGORIAS__&zeradas=exibirlz&formato=prn&mostre=Mostra' \
  --compressed \
  --insecure

In the data Date field you have to change the values for all possibilities and make new requests cURL until you download all the information from the site.

The return of cURL requests comes in HTML, if you prefer to remove all HTML tags you can use a pipe with lynx.

$ curl <comando> | lynx --dump -stdin > resultado1.txt