Web scraping from a microsoft forms form returns none [python]

Hello, I'm having a hard time doing a web scraping of a form made by microsoft forms. (Note: the form was made by me).

I have the following code:

from bs4 import BeautifulSoup
import requests

linkForms01 = 'https://forms.office.com/Pages/AnalysisPage.aspx?id=vNBJ8bUOmk-egiSnbqz43tJCnHAzn91Lq2qUycLdTl5UOFFCQ0lXME85UlFKT1dBTFJPSllFUkkzVy4u&AnalyzerToken=qTmVTXSAWoyMXQcd56doC9W6W20G51UR'

page03 = requests.get(linkForms01) 
page03.encoding = page03.apparent_encoding

soup03 = BeautifulSoup(page03.text, 'html.parser')
texto03 = soup03.get_text('\n')
xxxx = soup03.find(class_="analyze-view-detail-text-lines")
print(xxxx)

In general, I can extract a lot of information from this forms, but I can't get the questionnaire answers. I thought I'd pull out the information that's in the getaggregatesurveydata file, this file can be seen in inspect-Network-XHR, but I'm not sure if this it's possible.

Who can give a help, I will be grateful:)

Author: Jonathan Cardoso, 2020-01-20

2 answers

Microsoft Forms has an undocumented REST service(API). It is from this JSON request that the information is digested.

First let's figure out where the GET comes from that will get the information.

https://forms.office.com/formapi/api/f149d0bc-0eb5-4f9a-9e82-24a76eacf8de/users/709c42d2-9f33-4bdd-ab6a-94c9c2dd4e5e/light/analysisForms('vNBJ8bUOmk-egiSnbqz43tJCnHAzn91Lq2qUycLdTl5UOFFCQ0lXME85UlFKT1dBTFJPSllFUkkzVy4u')?$expand=questions($expand=choices)

Ok, now we need to see what information is sent to achieve this request. If it notices the link you sent, it has a token for form Access:

AnalyzerToken=qTmVTXSAWoyMXQcd56doC9W6W20G51UR

This token is the permission that the site will request to extract the server information.

With this you will now be able to make your bot. Reminder that it is good to always have 'user-agent' for you to bypass the page, as it is common for the site to have some blocking for scrappers and crawlers.

import requests

url = "https://forms.office.com/formapi/api/f149d0bc-0eb5-4f9a-9e82-24a76eacf8de/users/709c42d2-9f33-4bdd-ab6a-94c9c2dd4e5e/light/analysisForms('vNBJ8bUOmk-egiSnbqz43tJCnHAzn91Lq2qUycLdTl5UOFFCQ0lXME85UlFKT1dBTFJPSllFUkkzVy4u')?$expand=questions($expand=choices)"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
           'AnalyzerToken': 'qTmVTXSAWoyMXQcd56doC9W6W20G51UR'}
print(requests.get(url, headers=headers).text)

I recommend studying HTTP protocols to better understand how to do web crawling / scraping, because most sites nowadays use APIs, and in a good part of the cases it is more documenting how the site API works, than staying strong brute that is much less performatic, and much more boring.(In case, scrapping from HTML...)

 0
Author: SakuraFreak, 2020-02-05 14:30:25

Cannot retrieve the information as it is being done.

By doing requests.get you retrieve the html from the server to that URL - but only that - any external resources of the page are not retrieved - be it images, or, data that is obtained by the page when it executes some Javascript code in the browser.

It's easy to check that the information is not present on the page-you abre open the URL above, and use the browser's view page source option - it's easy to see that the <body> part of the page is minimal, despite dozens of kbytes of javascript in <head>. It could still be that the form data is embedded in javascript, without relying on javascript requests to the server - but searching, for example, for the name "Ademir" that appears populated in the form, finds no occurrence in the raw text of the page.

From here you have two options: reverse engineer the client-side code and see what requests are made to the server using javascript-and you can replicate these requests using Python's requests. This can vary in difficulty between medium and practically impossible (if the author of the page, or the programs run on the page are willing to hide this data - not the case, it should be closer to "medium difficulty")

The other way is to use instead of requests the selenium - Selenium is a tool that integrates a "real" browser with a library Python-when the page is opened via selenium, the javascript code within it is executed - and the data is retrieved on the server and populated in The Associated browser (which may or may not be visible on the screen, depending on how you trust selenium) - and then yes, you access the page'S DOM after it is populated with the server data.

 0
Author: jsbueno, 2020-01-20 20:17:53