Scraping data using Robobrowser

Question

Scraping data using Robobrowser

I'm trying to scrape a form, to insert an attachment and submit, using Robobrowser.

To open the page I do:

browser.open('url')

To pick up the form I do:

form = browser.get_form(id='id_form')

To enter the data in the form I do:

form['data_dia'] = '25'  # por exemplo

To submit the form I do:

browser.submit_form(form, form['btnEnviar'])

Or just

browser.submit_form(form)

But this is not working, the form is not being submitted. When trying to fetch all inputs gives page, I found that the button to submit is not coming from the Robobrowser.

Doing,

todos_inputs = browser.find_all('input')

        for t in todos_inputs:
            print(t)

I don't get the input tag with id 'btnEnviar', which in the html code of the page is inside the form. The other form inputs are coming, like 'day', 'month' and 'year', for example.

I did not post the html code because it needs login and password for access.

The problem is that Robobrowser is not managing to scrape all the html information, only a part, causing me not to be able to submit the form. Is there a solution to this? Or is there another way to fill out a form and submit it with other tools except RoboBrowser and Beautifulsale

1

python web-scraping beautifulsoup

Author: Rafael, 2018-12-12

Source

1 answers

score 3 · Accepted Answer

Robobrowser it is a module that combines requests to download pages and BeautifulSoup to parse them.

Your problem is that the button you want to click probably doesn't actually even exist on the page! It is quite likely that the pages of the site you want to use, as well as many others on the internet, are made available incomplete, without all the elements, and only then these elements are placed on the page through code made in javascript that runs in your browser after the loading.

Therefore, by inspecting the page code using your browser, javascript will have already executed and completed the elements dynamically, so you will find the button there. Since BeautifulSoup does not run javascript, on the page it parseou in memory when running the script the button does not exist.

This is very common on web pages nowadays, which are quite dynamic. Leaving you with two options:

Parse javascript code find out where it creates the button. Or else analyze what the button does. You can read and follow the javascript code manually until you find a way to imitate what it does by clicking that button, what parameters to pass, etc. Then write code in python to simulate these actions. It is not an easy task but the code would be quite optimized because it would be python code without having to open a real browser, which would be the second option:
Use a real browser that runs javascript. The Selenium library allows you to open and control a real browser window through your script. Since the page will open in a browser, javascript will work and you can click the button. The downside is that opening a browser is heavy and slow, as well as loading various unnecessary elements and images to the process, so it would not be as efficient as directly accessing the source.