Bytes-translation from a string

Question

Bytes-translation from a string

data = "Hello".encode("utf-8") # перевод в байты

I can't figure out the bytes... no matter how hard I try, Python gives me b'Hello'... how do I get the original bytes of this string?

If you try it with Cyrillic, everything works out:

    data = "Привет".encode("utf-8") # b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'

Or do I have no idea about bytes? please explain

6

python python-3.x byte

Author: jfs, 2016-02-23

Source

2 answers

Try this. The ord function returns the value of the character in the decimal code:

text = "Hello world!"
[(c, ord(c), hex(ord(c))) for c in text]

Result:

[('H', 72, '0x48'), ('e', 101, '0x65'), ('l', 108, '0x6c'), 
('l', 108, '0x6c'), ('o', 111, '0x6f'), (' ', 32, '0x20'), 
('w', 119, '0x77'), ('o', 111, '0x6f'), ('r', 114, '0x72'), 
('l', 108, '0x6c'), ('d', 100, '0x64'), ('!', 33, '0x21')]

0

Author: gil9red, 2016-02-23 20:23:02

score 5 · Accepted Answer

Both b'Hello' and b'\xd0\x9f\xd1\x80...' belong to the same type bytes.

b'Hello' == b'\x48\x65\x6c\x6c\x6f'. Bytes that correspond to printable ascii characters (0x20..0x7e) are shown by default as these characters in the text representation repr(data) -- the syntax used for bytes constants in Python source code (eval(repr(data)) == data).

Using characters for some bytes instead of hexadecimal codes can be misleading (as in this case). It is easy to get a hexdump if required:

>>> b'Hello'.hex()
'48656c6c6f'

The motivation for using b'Hello' instead of b'\x48\x65\x6c\x6c\x6f' could be due to the fact that many popular protocols such as HTTP freely mix text (encoded in ascii-compatible encoding) and binary data. Therefore, using symbols instead of hex codes can help with debugging.

The disadvantage of using b'Hello' instead of b'\x48\x65\x6c\x6c\x6f' is that people confuse the concepts of text (Unicode strings) and binary data (bytes), which leads to confusion and as a result, garbage (krakozyabram) in the results. Which was particularly acute on Python 2, where str = bytes. See Stop displaying elements of bytes objects as printable ASCII characters in CPython 3 [python-ideas mailing list (2014)].

Without explicitly specifying the encoding, the byte sequence (bytes object) is just a set of numbers. A sequence of bytes becomes text only if the bytes are decoded using a suitable encoding:

unicode_text = bytestring.decode(character_encoding)