In requests, how to correctly read ISO-8859-1 encoding?
In Python3, with beautifulsoup4 and requests, I want to extract some information from a site that has 'ISO-8859-1'encoding. I tried this strategy to correctly display the text:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding
encoding = req.encoding
text = req.content
decoded_text = text.decode(encoding)
sopa = BeautifulSoup(decoded_text, "lxml")
sopa.find("h1")
And the result that appears is:
<h1>
CÃMARA MUNICIPAL DE SÃO PAULO<br/></h1>
When I copy and paste on this screen it appears correct, but on my computer all the accentuation is wrong
I'm on a machine with Ubuntu
Please does anyone know a correct way to read the encoding?
Edited June 2, 2019
I had the help of @ snakecharmerb here
In the answer he detailed that when no explicit charset is present in HTTP headers and the Content-Type header contains text, the RFC 2616 specifies that the default character set must be ISO-8859-1. What is the case with this site
But clearly the words are UTF-8, so I correct manually and it works My code stayed like this and worked:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding
'ISO-8859-1'
req.headers['content-type']
'text/html'
req.encoding = 'UTF-8'
sopa = BeautifulSoup(req.text,'lxml')
sopa.find('h1').text
'\r\n CÂMARA MUNICIPAL DE SÃO PAULO'
2 answers
In fact, requests
already does the decoding for you,using the correct encoding.
Is only instead of accessing the attribute .content
, access the attribute .text
of the object "response":
In [382]: import requests
In [383]: data = requests.get("https://slashdot.org")
In [384]: type(data.content)
Out[384]: bytes
In [385]: type(data.text)
Out[385]: str
But understanding of encodnig and not getting kicked what happens is halfvital in this industry. I can't get enough of recommending the following article, originally written in 2003 by the creator of StackOverflow:
If you refer to display in terminal or windows Command prompt then to resolve you must set in python script the charset, like this:
# -*- coding: latin-1 -*-
import requests
from bs4 import BeautifulSoup
...
If the page is in iso-8859-1 or windows-1252, if it is in utf-8 use like this:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
...
It seems to me that the page that linked this in utf-16, so I think it would look like this:
# -*- coding: utf-16 -*-
import requests
from bs4 import BeautifulSoup
...