Size of a character (ASCII vs other encodings) in bytes

Seeing this question I have a doubt, coming from PHP and in the past having" problems " derived from character encoding, ex: srtpos vs mb_strpos, I knew that all ASCII characters have 1 byte, but I thought that special characters would have more, I Associated that with the fact that a character is special this would also be multi byte.

I.e. if I write a simples.txt with a character "a", for example, this is 1 byte, but if you write with a character " ã " this gets two bytes. But this example indicates that the special character has 4 bytes.

#include <iostream>
using namespace std;

int main() {
    char a = 'a';
    cout << sizeof(char) << "\n"; // 1
    cout << sizeof(a) << "\n"; // 1
    cout << sizeof('ã')  << "\n"; // 4
}

Where are we?

Author: Comunidade, 2017-01-19

2 answers

The type char in C, and therefore in C++, does not have a good name there. I actually think he should call byte, because that's what he is. It being used as a character is just a detail.

Contrary to popular belief, C is a weak typing language. She is statically typed, but weak. people don't quite understand these terms. C can interpret the same data as if it were of a different type or form than originally intended. This can be observed in this code:

char a = 'a';
printf("%c\n", a);
printf("%d\n", a);

See working on ideone. E no repl.it. also I put on GitHub for future reference .

The same data can be displayed as a number or character.

Some functions of C allow you to do this interpretation as a character, in general you need to say that it should be so. So there is %c. It indicates that the data should be treated as a character. In general a char is treated as the same number.

Any character encoding that can be stored in 1 byte can be stored in a char. When C was created there was only the same ASCII (at least in a relevant way).

Other more complete encodings have emerged that use the entire byte to represent more characters. It was getting complicated, they created pages of charset . To" simplify " and enable more characters created the multi-byte character. At this point it was no longer possible to use char as the type to store the character, as it was guaranteed that it should only have 1 byte.

Na prevents you from using a string of chars to say it's just a character, but it will be a solution of yours, that your functions know what to do. The C Library, third-party, including operating systems do not know how to deal with it. So no one does. Many people do not understand that C is a language to work with things in a way brute, you can do whatever you want the way you want. Getting out of the pattern is your problem.

When we need the multi-byte character we usually use the Type wchar_t. It can have a variable size according to the implementation. The specification leaves free. In some cases we use the char16_ t and the char32_ t which has their sizes guaranteed by specification. This is standardized.

Let's run this code to better understand:

char a = 'a';
char b = 'ã';
wchar_t c = 'a';
wchar_t d = 'ã';
cout << sizeof(char) << "\n";
cout << sizeof(a) << "\n";
cout << sizeof('a') << "\n";
cout << sizeof(b)  << "\n";
cout << sizeof(c)  << "\n";
cout << sizeof(d)  << "\n";
cout << sizeof('ã')  << "\n";

See working no ideone. E no repl.it. also I put on GitHub for future reference .

Noticed that the accent does not take up more bytes? I declared b as char he he has only one byte, even having accent? And that c has 4 bytes even has a character that fits in ASCII? The size is determined by the type of the data, or the variable. Where I explicitly said it is char it used 1 byte. Where it can infer that a char is enough he used 1 byte, where I explicitly said it is a wchar_ t, occupied 4 bytes. Where he inferred that he needed more than one byte to represent the character he adopted 4 bytes. So your sizeof('ã') gave 4 bytes because there was inference that it would be of type wchar_ t.

It became clear that in this compiler wchar_t has 4 bytes.

Every C and C++ library understands wchar_ t as a type to store characters and not numbers, even though always what has there are numbers, computers don't know what characters are, it just uses a trick to show it to people who want to see it.

Again in C you do as you wish. If you want to make all characters have one byte you can do, even if they have an accent. Of course there are only 256 possible values in one byte. You can not have all possible characters in this situation.

 5
Author: Maniero, 2020-11-20 14:54:24

TL; DR; depends on the encoding and some language/platform details.

Each UTF-8 character occupies 1 to 6 bytes,

Each UTF-16 character occupies 16 bits

Each UTF-32 character occupies 32 bits

Each character of an ascii string occupies 1 byte

Source


Well, I think it's good to just remember that when you work with a language/platform it is free to decide how it will allocate memory space for each of the blown guys.

C

In the case of C it does the minimum work, it allocates enough space for the type and can add a few extra bytes for padding, to be more friendly to cache and read/write them in memory.

See this question for more information on C

C #

In the case of C# for example all non-primitive objects have an overhead of 8 or 16 bytes, this question also clarifies or because .

Python

Objects

Python also uses a technique similar to C#. The answer to this question in SOEN indicates that all objects in Python occupy extra 16bytes (at 64bits). It seems that all objects store a reference count and a reference for the type of the object. There is the official python documentation that explains how an object is structured .

I found a quite detailed article on this subject

It seems that Python also padding objects up to 256bytes, if you allocate an object of 10bytes it will actually occupy 16.

Strings

It also gives more details about the size a string occupies.

An empty string occupies 37bytes, each additional character adds one byte to the size. Unicode strings are similar, but they have an overhead of 50 bytes and each additional character occupies 4 bytes (I believe there was an error given by the author). In python 3 the overhead is 49 bytes.

The information seems to be somewhat contradictory to what is given in a SOEN question. But that will depend on the version of python you're using, so stay here for reference.

This other question in Soen also has a table that explains how much space each object occupies.

Bytes  type        empty + scaling notes
24     int         NA
28     long        NA
37     str         + 1 byte per additional character
52     unicode     + 4 bytes per additional character
56     tuple       + 8 bytes per additional item
72     list        + 32 for first, 8 for each additional
232    set         sixth item increases to 744; 22nd, 2280; 86th, 8424
280    dict        sixth item increases to 1048; 22nd, 3352; 86th, 12568 *
64     class inst  has a __dict__ attr, same scaling as dict above
16     __slots__   class with slots has no dict, seems to store in 
                   mutable tuple-like structure.
120    func def    doesn't include default args and other attrs
904    class def   has a proxy __dict__ structure for class attrs
104    old class   makes sense, less stuff, has real dict though.
 3
Author: Bruno Costa, 2017-05-23 12:37:27