Fortune Telling Collection - Fortune-telling birth date - Character size in java

Character size in java

You understand it wrong, hehe, let's put it in a popular way:

I'll give you two houses, both of which are 100 square meters, one with a pig and the other with a chicken. You can say that the houses are all the same size, you can say.

Are pigs as big as chickens?

Pigs have the code name of pigs, and chickens have the code name of chickens, each of which is different!

In addition, two bytes, seemingly small, have 8 bits, which can represent the 8 th power of 2 in 256 states.

Two bytes: 2 to the power of 16, 65536 states, and that's it? Did I tell you? ..

The following is literacy: Look carefully, I copied Zhihu's eldest brother, which will definitely help you in the future:

After reading it, friends passing by, if you think it is good, please like it. Thank you. ...

A long time ago, a group of people decided to combine eight switchable transistors into different states to represent everything in the world.

They saw that all eight switches were in good condition, so they called this "byte".

Later, they made some machines that could handle these bytes. When the machine is started, they can combine many states with bytes, and the states begin to change.

They saw that this was good, so they called this machine "computer".

At first, computers were only used in America. An 8-bit byte * * * can be combined into 256 different states.

They specified a special purpose for 32 states starting from 0. Once the terminal and printer reach the agreed bytes, they will perform some agreed operations:

When encountering 0× 10, the terminal will wrap;

When it encounters 0×07, the terminal buzzes people;

In the case of 0x 1b, the printer will print words in reverse, or the terminal will display letters in color.

They see this is good, so they call these byte states below 0×20 "control codes".

They also represent all empty cells, punctuation marks, numbers and uppercase and lowercase letters in a continuous byte state until the number 127, so that the computer can store English words in different bytes.

Everyone felt good when they saw this, so everyone called this scheme "Ascii" coding of ANSI (American Standard Code for Information Interchange).

At that time, all computers in the world used the same ASCII scheme to save English characters.

Later, just like the establishment of the Tower of Babylon, computers were used all over the world, but many countries did not use English, and many of their letters were not in ASCII code, in order to store words in computers.

They decided to use the space after 127 to represent these new letters and symbols, and added many shapes such as horizontal lines, vertical lines and crosses that were needed when drawing tables, until the serial number was compiled to the final state of 255.

The character set from 128 to 255 on this page is called "extended character set".

Since then, greedy humans have no new state to use. American imperialism may not have thought that people in third world countries should also use computers!

When people in China get the computer, there is no byte state to represent Chinese characters, and there are more than 6,000 commonly used Chinese characters to be saved.

But this can't beat the smart people in China. We rudely canceled those strange symbols after 127, and stipulated that a character smaller than 127 has the same meaning as the original one, but when two characters larger than 127 are connected together, it means a Chinese character.

The first byte (which he called the high byte) is used for 0xA 1 to 0xF7.

The next byte (low byte) is 0xA 1 to 0xFE.

So we can combine more than 7000 simplified Chinese characters.

In these encodings, we also incorporate mathematical symbols, Roman and Greek letters and Japanese pseudonyms. Even the existing numbers, punctuation and letters in ASCII are recoded into two-byte lengths, which are usually called "full-width" characters, while the characters below 127 are called "half-width" characters.

China people saw this very well, so they called this Chinese character scheme "GB23 12".

GB23 12 is a Chinese extension of ASCII. But there are too many Chinese characters in China, and we soon found that many people's names can't be typed here, especially those who are very troublesome to others.

So we have to continue to find out the unused code points in GB23 12 and use them honestly and rudely.

Later, it was not enough, so it was no longer required that the low byte must be the internal code after 127. As long as the first byte is greater than 127, it is fixed that this is the beginning of a Chinese character, regardless of whether it is followed by the contents of the extended character set.

Results The expanded coding scheme is called GBK standard, which contains all the contents of GB23 12, and adds nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols. ?

Later, ethnic minorities also used computers, so we expanded it and added thousands of new ethnic minority words. GBK is extended to GB 18030.

Since then, the culture of the Chinese nation has been passed down in the computer age. ?

Programmers in China see that the coding standard of this series of Chinese characters is very good, so they are generally called "DBCS" (double-byte character set).

In DBCS series standards, the biggest feature is that double-byte Chinese characters and single-byte English characters coexist in the same coding scheme. Therefore, in order to support Chinese processing, they must pay attention to the value of each byte in the string. If the value is greater than 127, the characters in the double-byte character set are considered to have appeared.

At that time, all blessed computer monks who could program would read the following mantra hundreds of times every day: "One Chinese character counts as two English words! One Chinese character counts as two English words ... "Because at that time, all countries made their own coding standards like China, and as a result, no one understood each other's coding and no one supported others' coding. Even in Chinese mainland and Taiwan Province provinces, which are separated by 150 nautical miles and use the same language, different DBCS coding schemes are adopted. At that time, people in China wanted to display Chinese characters on their computers, so they had to install a "Chinese character".

It is specially used to handle the display and input of Chinese characters. For example, the fortune-telling program written by the ignorant feudal man in Taiwan Province Province must be installed with an "eternal Chinese character system" supporting BIG5 coding. If you install the wrong character system, the monitor will fail! What should I do?

Moreover, there are still poor people in the world who don't need computers for the time being. What about their words? ?

What a Babylonian computer proposition!

At this moment, the archangel Gabriel appeared in time-an international organization called ISO decided to solve this problem.

Their method is simple: abolish all regional coding schemes and re-create a code that includes all cultures and all alphabet symbols on the earth!

They intend to call it "universal multi-octet coded character set", or UCS for short, commonly known as "unicode".

When unicode was first formulated, the memory capacity of computers had greatly developed, and space was no longer a problem.

Therefore, ISO directly stipulates that all characters must be represented by two bytes, namely 16 bits. For those "half-width" characters in ASCII, unicode keeps the original encoding, but the length is extended from the original 8 bits to 16 bits, while all characters in other cultures and languages are recoded. Because the English symbol of "half-angle" only needs to use the lower 8 bits, and its upper 8 bits are always 0, this atmospheric scheme will waste twice the space when saving English texts.

At this time, programmers coming from the old society began to find a strange phenomenon: their strlen function is unreliable, and a Chinese character is no longer equivalent to two words, but one!

Yes, starting from unicode, both half-width English letters and full-width Chinese characters are unified "one word"!

At the same time, they are all unified "two bytes". Please pay attention to the difference between "characters" and "bytes". "Byte" is an 8-bit physical storage unit, while "character" is a symbol related to culture.

In unicode, a character is two bytes. The era when one Chinese character counts as two English words is coming to an end. Unicode is not perfect either, so there are two problems here. One is, how do we distinguish unicode from ascii?

How does the computer know that three bytes represent a symbol instead of three symbols?

The second problem is that we already know that only one byte is enough for English letters. If unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two or three bytes, which is a great waste of storage space and the size of the text file will be two or three times larger, which is unacceptable.

Unicode could not be popularized for a long time until the emergence of the Internet. In order to solve the problem of how to transmit unicode on the network, many UTF(UCS transport format) standards for transmission have appeared. As the name implies, UTF-8 transmits 8 bits of data at a time, while UTF- 16 transmits 16 bits of data at a time.

UTF-8 is the most widely used unicode implementation on the Internet. It is specially designed for transmission, making the encoding borderless, so that it can display characters of all cultures in the world.

One of the biggest characteristics of UTF-8 is that it is a variable-length coding method.

It can use 1~4 bytes to represent a symbol, and the byte length varies according to different symbols. When the character is in the range of ASCII code, it is represented by one byte, and the encoding of one byte of ASCII character is reserved as a part of it. Note that unicode has two bytes for a Chinese character and UTF-8 has three bytes for a Chinese character).

There is no direct correspondence between unicode and uft-8, but some algorithms and rules are needed for conversion.

Finally, a brief summary:

Through the expansion and transformation of Chinese ASCII code, China people produced the code GB23 12, which can represent more than 6,000 commonly used Chinese characters.

There are too many Chinese characters, including traditional Chinese characters and all kinds of characters, so the GBK code is produced, which includes the codes in GB23 12 and is expanded a lot.

China is a multi-ethnic country, and almost all ethnic groups have their own independent language system. In order to represent these characters, we continue to extend GBK code to GB 18030 code.

Every country codes its own language like China, so there are various codes. Without installing the corresponding code, it is impossible to explain what the corresponding code wants to express.

The last organization called ISO can't stand it. Together, they created a coded UNICODE, which is large enough to accommodate any characters and symbols in the world. Therefore, as long as there is a coding system such as UNICODE on the computer, no matter what characters are in the world, when saving the file, as long as it is saved as UNICODE coding, it can be interpreted normally by other computers.

In the network transmission of UNICODE, there are two standards, UTF-8 and UTF- 16, which transmit 8 bits and 16 bits respectively.

So some people may ask, since UTF-8 can store so many characters and symbols, why do so many people in China use GBK and other codes?

Because UTF-8 and other codes are large in size and occupy a large amount of computer space, if most users come from China, they can also use GBK and other codes.

If you are satisfied, please give me a compliment. Thank you! !