How do I determine the encoding of a string and convert it to utf8?
Through imap
I connect to the server and get a list of messages. The problem is in the encoding of the message body - in гугле
it is one thing, in яндексе
another. I want the system to automatically detect the encoding and convert it to utf8
import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
encoding = cchardet.detect(data)['encoding']
if new_coding.upper() != encoding.upper():
data = data.decode(encoding, data).encode(new_coding)
return data
def get_mails(login, password):
print("Connecting to {}...".format(server))
imap = imaplib.IMAP4_SSL(server)
print("Connected! Logging in as {}...".format(login));
imap.login(login, password)
print("Logged in! Listing messages...");
status, select_data = imap.select('INBOX')
nmessages = select_data[0].decode('utf-8')
status, search_data = imap.search(None, 'ALL')
for msg_id in search_data[0].split():
status, msg_data = imap.fetch(msg_id, '(RFC822)')
msg_raw = msg_data[0][1].decode("utf8")
mail = mailparser.parse_from_string(msg_raw)
telo = convert_encoding(mail.body.encode()) # Вот тут траблы
For example - mail.body
contains the following text
'<div>\\u041f\\u0440\\u043e\\u0432\\u0435\\u0440\\u043a\\u0430 \\u0441</div>'
Returns the error
TypeError: decode() argument 2 must be str, not bytes
2
2 answers
Try using chardet to determine the encoding. This library can determine the probability of using the encoding in the text.
import chardet
rawdata = b"Тело сообщения"
meta = chardet.detect(rawdata)
try:
rawdata.decode(meta['encoding'])
except: KeyError:
print('Кодировка не известна')
To extract characters from the html code, use html.unscape
import html
print(html.unescape('£682m'))
4
Author: Tihon, 2019-07-19 11:08:41
data = data.decode(encoding, str(data)).encode(new_coding)
Try converting to a string with str()
.
2
Author: Dan Konshin, 2019-07-19 10:54:56