How do I determine the encoding of a string and convert it to utf8?

Question

How do I determine the encoding of a string and convert it to utf8?

Through imap I connect to the server and get a list of messages. The problem is in the encoding of the message body - in гугле it is one thing, in яндексе another. I want the system to automatically detect the encoding and convert it to utf8

import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
    encoding = cchardet.detect(data)['encoding']
    if new_coding.upper() != encoding.upper():
        data = data.decode(encoding, data).encode(new_coding)
    return data

def get_mails(login, password):
    print("Connecting to {}...".format(server))
    imap = imaplib.IMAP4_SSL(server)
    print("Connected! Logging in as {}...".format(login));
    imap.login(login, password)
    print("Logged in! Listing messages...");
    status, select_data = imap.select('INBOX')
    nmessages = select_data[0].decode('utf-8')
    status, search_data = imap.search(None, 'ALL')
    for msg_id in search_data[0].split(): 
        status, msg_data = imap.fetch(msg_id, '(RFC822)')
        msg_raw = msg_data[0][1].decode("utf8")

        mail = mailparser.parse_from_string(msg_raw)
        telo = convert_encoding(mail.body.encode()) # Вот тут траблы

For example - mail.body contains the following text

'<div>\\u041f\\u0440\\u043e\\u0432\\u0435\\u0440\\u043a\\u0430 \\u0441</div>'

Returns the error

TypeError: decode() argument 2 must be str, not bytes

2

python кодировка imap

Author: MaxU, 2019-07-19

Source

2 answers

data = data.decode(encoding, str(data)).encode(new_coding) Try converting to a string with str().

2

Author: Dan Konshin, 2019-07-19 10:54:56

score 4 · Accepted Answer

Try using chardet to determine the encoding. This library can determine the probability of using the encoding in the text.

import chardet

rawdata = b"Тело сообщения"    
meta = chardet.detect(rawdata)
try:
    rawdata.decode(meta['encoding'])
except: KeyError:
    print('Кодировка не известна')

To extract characters from the html code, use html.unscape

import html
print(html.unescape('&pound;682m'))