What do characters of type "\u0 " mean…"

Question

What do characters of type "\u0 " mean…"

These characters mean:

"\u0001"
"\u0013"
"\u0004"
"\u0003"
"\u0018"
"\u001A"
"\u0016"
"\u0002"

When I press the ctrl key I'm using the QT IDE and a program to display keys. Thank you.

1

Author: Mariano, 2016-10-11

Source

4 answers

score 5 · Answer 1

In C++ \u or \U indicate unicode characters.

\u for 16 bit. \U for 32 bit.

Unicode characters are a standard for representing characters in different languages or regions, a character can be represented in the same way in any region or language, it can be said to be the universal representation of a character.

As an example the representation of the letter

Used in the Spanish language, it would be:

'\u00F1' o '\U000000F1'

another example the Euro symbol:

Its representation would be:

Would be:

'\u20AC' o '\U000020AC'

Usage example:

cout << "Deseo imprimir estos caracteres: \u00F1 y \u20AC" << endl;

Output:

Deseo imprimir estos caracteres: ñ y €

Here you have a table in which all the characters:

Http://unicode-table.com/en/#control-character

score 3 · Answer 2

In one of your comments you put " if I want to change unicodes to representation in char [...]". I think that question reflects that you are not clear how character encodings work, and their difference when such characters are displayed in a window or in a terminal. So I explain it to' in order to clarify any other doubts that come along the way.

ASCII

Let's start with ASCII. ASCII is an encoding born for standardize the" characters " that were used in teletypes (similar to typewriters). There were a total of 128 different characters, numbered from 0 to 127. Most, printable, such as a, 0, or =. Others were Control characters, such as carriage return (on a typewriter, used to put the "carriage" of the typewriter at the beginning of the line), or line break (to "turn the wheel" and be able to write on a new line. Other special characters of those 128 were page break (page change), header end (for special protocols), horizontal and vertical tabs (i.e. indentation), etc, etc.

They are called a control character because, when a teletype received that character, rather than" printing " something, it provoked an action. That is, they are character to control the flow of text or communication.

To save those 128 characters you only need 7 bits (simple math, 2^7 = 128). Since processors behave better with 8 bits than with 7, each ASCII code (the numeric value corresponding to each ASCII character), is saved in a 1 byte word. As with 8 bits you can save up to 256 characters, but ASCII only contemplates 128, you have half left!

CODE PAGES

What do we do with that other half? Put more characters in. Thus, Windows invented the code pages, which are nothing more than different encodings to reuse the other 128 available values in one byte. Thus, values from 128 to 255 were used to represent different characters according to Greek, Spanish (for accents), Russian or Hebrew, for example.

In the case of emoticons, which you can find in the typical MS-DOS terminal when printing certain texts, they are nothing more than a way to replace deprecated control characters, by printable characters, for some code pages.

What does that imply? That if you have a text editor, which thinks that you will receive a text encoded in the Greek code page, the values from 128 to 255 will be printed differently than if you think that the code page is Russian. Therefore, the same text (set of bytes / characters) will be printed differently in different editors or terminals according to the encoding configured in that editor / terminal to interpret the texts receive.

Obviously, if I send an email from my home to a Russian, a Russian will receive a meaningless text, because he will print my Spanish mail with Russian characters. However, the first 128 values (from 0 to 127), that is, the "ASCII part", will be printed equally in any case, since that first half of the value set is standardized and"not touched".

Unicode

Obviously, the code pages thing was a little bit chaotic (although not extinct), so many standards have appeared to try to put something in order. Thus, there are different standards such as ISO-8859-1, which refers to the Latin alphabet (ASCII + accented vowels, eñes, copyright symbol etc), ISO-8859-15 (a mutation of 8859-1 with the Euro symbol, for example), etc.

Most 1 byte, i.e. all with a limitation of 255 characters. Until Unicode appears. Unicode tries to be able to represent all characters of the world, and it is a "21 byte encoding" (beware of this, I'll clarify later). There are symbols for all the languages of the world, for linguists, musical symbols, Meteorology, mathematics, Miscellaneous (even soccer balls or the symbol of communism), in short, everything.

Original question

" if I want to change the unicodes to the representation in char [...]"

A char has one byte. In a char you can only save characters that fit in a byte. That is, any ASCII character, or ISO-8859-1 or ISO-8859-15, etc, but not any unicode character. C and C++ also have "wide characters" (wchar), which have two bytes each, so in a wchar you can save up to 65536 different characters.

As you can understand, unicode characters may need up to 3 bytes (although the vast majority of unicode characters fit in 2), so not every unicode character can be saved in a char. In addition, even if you save characters in wchar, that does not mean that, if you want to "display them by screen", they will be displayed correctly. That depends on the encoding of the terminal or text editor where you want to see those characters, or whether the editor detects the character encoding that the text uses.

When programs, however, the binary object you have is not a text file, it is a binary file, and therefore has no character encoding. Value" that you have saved in a wchar will be saved in memory according to your binary code.

When you "print" (printf / cout), that character, the terminal that will print that character will only receive one binary sequence, and it will not know if it has received two chars or one wchar, so it will not know whether to interpret that sequence as two ASCII/ISO-8859-* characters or as a single 16-bit character in any other 16-bit encoding that exists.

There is a problem to specify "what encoding" is being returned, when telling the terminal how it should display the text sequences it receives, that is, how to interpret (under what encoding), the binary string(if in characters 1 or 2 bytes, and to which particular character each value corresponds).

UTF-32 and UTF-16

Unicode, however, is a standard, not an implementation. By that I mean, that Unicode offers a mapping between characters and numeric values, not like implement the "translation engine". UTF-8, UTF-16 and UTF-32 are three of these encodings.

In UTF-32, each character occupies 4 bytes. Therefore, any Unicode character can be represented with UTF-32. The thing is that most characters of almost virtually any text in any language, only need a maximum of 16 bits. Imagine an Englishman typing "Hello" in a UTF-32 file. Since the file is UTF-32, it will save each character using 32 bits, so that the string "Hello" will occupy (5 characters) * (4 bytes each character) = 20 bytes, when in ISO-8859-1 it only occupies 5 bytes. It's a huge waste.

In UTF-16, each character occupies 2 bytes (16 bits). If I had to represent a character that does not fit in 2 bytes, I would use 4 bytes for that character. So UTF-16 needs a mechanism to know if the next character in the text is one or two words. This way, "Hello" would take up 10 bytes (half); you save a lot space, and also, only in certain particular cases where you need to send a larger caracter, you can still represent it.

Obviously, if I show a text encoded in UTF-16, in a text editor know (because it is directed to the editor or detect it), it will correctly display each character because you will know how to operate the coding and see if each character occupies 16 or 32-bit before rendering.

UTF-8

UTF-8 is the other end. Use 1 byte words. If you want to represent a character that does not fit in 1 byte, you will use two. If you need 3 bytes, then you will use 3 bytes. Like UTF-16, it has a mechanism to play with the binary string so that the Unicode value corresponding to the character is represented at the same time, and if that unicode value has been saved in 1, 2 or 3 bytes, etc. and if the text editor knows that it has to display a UTF-8 file, then it will apply the same the size of the next character to print.

In purely ASCII text, UTF-8 looks exactly the same, since the first 128 Unicode characters are ASCII. In addition, it does not waste memory. The issue is that, being UTF-8 a variable-size encoding (each character can occupy 1, 2 or 3 bits), the space is minimized ("hello" will occupy 5 bytes), but its processing is maximized, because each character must be detected if it occupies 1, 2 or 3 bytes, which makes the algorithm slower than UTF-16, and even more so than UTF-32.

Qt / GUI

If you want to display, in a window, a widget or whatever, a text, you can save it as a string, "inserting" in the string sequences like "Hola\u00F1" to indicate the corresponding Unicode value. What the compiler does is replace the occurrence of the sequence "\u00F1" with the binary utf-8 representation of the Unicode character corresponding to the value 0x00F1.

Then, if you pass such a string at QString::fromUtf8, such a function will "binary" read the received string as if it were a string in UTF-8 (which it is, since the prefix "Hola" is UTF-8, since UTF-8 is ASCII-compatible, and the character 0x00F1 has been replaced by the corresponding UTF-8 sequence), and will save it in its own way.

Then, if you pass that WString with the string correctly saved to a widget to be drawn, Qt will take care of talking properly with the" graphical manager " of your system so that it is this one who finally prints the string in the proper way. All very suitable around here.

Note: actually, a string literal ("...") containing Unicode characters, can be transformed into a UTF-8, UTF-16 or any other string. It is not standardized which encoding should be used in those cases (although it will normally be UTF-8). In C++11, however, they added the notation u8"..." to indicate that the string should be encoded in UTF-8, u"..." for UTF-16, and U"..." for UTF-32.

In short,

You will understand that you cannot pass a unicode character to a simple char.

score 1 · Answer 3

Are very useful unicode characters to represent special characters, I leave you a list of those that have helped me quite a lot

Á \ u00C1

Á \ u00E1

Is \u00C9

Is \u00E9

Í \u00CD

Í \u00ED

Or \u00D3

Ó \u00F3

Ú \u00DA

Ú \ u00FA

Ü \u00DC

Ü \ u00FC

? \u00D1

Ñ \u00F1

& \u0022

Í \u00ED

" \u0022

' \ u0027

® \ u00AE

€ \ u20AC

¼ \u00BC

½ \u00BD

¾ \u00BE

This way in your code you can replace an á with \u00E1 and then prevent it from getting ruined.

score 1 · Answer 4

Are called Unicode characters. They serve to unify the different types of characters that were formerly in one and in this way simplify when you can use the same character in more than one region.

For example, with old systems, a coding system that gave you the Euro € symbol might take you to a different one in another region of the world (depending on its coding), such as ♥ (it's just an example for you to see of course the problem that came to suppose).

In Wikipedia you have a list of the most used.