In requests, how to correctly read ISO-8859-1 encoding?

Question

In requests, how to correctly read ISO-8859-1 encoding?

In Python3, with beautifulsoup4 and requests, I want to extract some information from a site that has 'ISO-8859-1'encoding. I tried this strategy to correctly display the text:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding

encoding = req.encoding
text = req.content

decoded_text = text.decode(encoding)

sopa = BeautifulSoup(decoded_text, "lxml")

sopa.find("h1")

And the result that appears is:

<h1>
                        CÃMARA MUNICIPAL DE SÃO PAULO<br/></h1>

When I copy and paste on this screen it appears correct, but on my computer all the accentuation is wrong

I'm on a machine with Ubuntu

Please does anyone know a correct way to read the encoding?

Edited June 2, 2019

I had the help of @ snakecharmerb here

In the answer he detailed that when no explicit charset is present in HTTP headers and the Content-Type header contains text, the RFC 2616 specifies that the default character set must be ISO-8859-1. What is the case with this site

But clearly the words are UTF-8, so I correct manually and it works My code stayed like this and worked:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')
req.encoding

'ISO-8859-1'

req.headers['content-type']
'text/html'

req.encoding = 'UTF-8'

sopa = BeautifulSoup(req.text,'lxml')

sopa.find('h1').text
'\r\n                        CÂMARA MUNICIPAL DE SÃO PAULO'

1

python codificação-de-caracteres python-requests beautifulsoup

Author: Reinaldo Chaves, 2019-05-31

Source

2 answers

score 5 · Answer 1

In fact, requests already does the decoding for you,using the correct encoding.

Is only instead of accessing the attribute .content, access the attribute .text of the object "response":

In [382]: import requests                                                                                              

In [383]: data = requests.get("https://slashdot.org")                                                                  

In [384]: type(data.content)                                                                                           
Out[384]: bytes

In [385]: type(data.text)                                                                                              
Out[385]: str

But understanding of encodnig and not getting kicked what happens is halfvital in this industry. I can't get enough of recommending the following article, originally written in 2003 by the creator of StackOverflow:

The absolute minimum that all software programmers need, Absolutely, positively to know about Unicode and character sets (no excuses!)

score 0 · Answer 2

If you refer to display in terminal or windows Command prompt then to resolve you must set in python script the charset, like this:

# -*- coding: latin-1 -*-

import requests
from bs4 import BeautifulSoup

...

If the page is in iso-8859-1 or windows-1252, if it is in utf-8 use like this:

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

...

It seems to me that the page that linked this in utf-16, so I think it would look like this:

# -*- coding: utf-16 -*-

import requests
from bs4 import BeautifulSoup

...