Codes 32 127. Coding of text information

Hello, dear blog readers Website. Today we will talk to you about where Krakoyarbra come from and in programs, which text encodings exist and which of them should be used. Let us consider in detail the history of their development, ranging from the basic ASCII, as well as its extended versions of CP866, KOI8-R, Windows 1251 and ending with modern codes of the Unicode UTF 16 and 8 consortium.

Someone this information may seem unnecessary, but you would know how much questions come to me exactly concerned the cracks (not reading a set of characters). Now I will have the opportunity to send everyone to the text of this article and independently search for your shoals. Well, get ready to absorb the information and try to monitor the narration.

ASCII - Basic Latiza Text Encoding

The development of text encodings occurs simultaneously with the formation of the IT industry, and during this time they had time to undergo quite a few changes. Historically, it all started with a rather harmful in Russian pronunciation of EBCDIC, which made it possible to encode the letters of the Latin alphabet, Arabic numbers and punctuation marks with control symbols.

But still the starting point for the development of modern text encodings should be considered a famous ASCII. American Standard Code for Information Interchange, which in Russian is usually pronounced as "Aski"). It describes the first 128 characters from the most commonly used English-speaking users - Latin letters, Arabic numbers and punctuation marks.

Even in these 128 characters described in ASCII, some service symbols were crushed by brackets, lattices, asterisks, etc. Actually, you yourself can see them:

It is these 128 characters from the initial version of the ASCII have become the standard, and in any other encoding you will definitely meet and stand that they will be in such a manner.

But the fact is that with the help of one byte of the information, you can encode not 128, but as many as 256 different values \u200b\u200b(twice to the degree eight equals 256), so a whole range appeared after the basic version of Aska advanced encodings ASCIIIn addition to 128 main signs, it was also possible to encode the national encoding symbols (for example, Russian).

Here, probably, it is worth a little more about the number system that are used in the description. First, as you know everything, the computer works only with numbers in a binary system, namely with zeros and units ("Boulev Algebra", if anyone held at the Institute or at School). Each of which is a decend to a degree, starting with zero, and to twos in the seventh:

It is not difficult to understand that all possible combinations of zeros and units in such a design can only be 256. Translate the number from the binary system in decimal is quite simple. It is necessary to simply fold all the degrees of twos above that one stands.

In our example, it turns out 1 (2 to the degree of zero) plus 8 (two to degrees 3), plus 32 (twice in the fifth degree), plus 64 (in the sixth), plus 128 (in the seventh). Total receives 233 in a decimal number system. As you can see, everything is very simple.

But if you look at the table with symbols ascii.You will see that they are presented in hexadecimal encoding. For example, the "asterisk" corresponds to the paradise of a hexadecimal number 2a. Probably, you know that in a hexadecimal number system, the Latin letters from A (mean ten) to F (means fifteen) are used in a hexadecimal number system.

Well, so for transfer binary numbers In hexadecimal Resort to the next simple and visual way. Each byte of information is broken into two parts of four bits, as shown in the screenshot above. So In each half of the byte, the binary code can only be encode for sixteen values \u200b\u200b(two in the fourth degree), which can be easily represented by hexadecimal.

Moreover, in the left half of the byte, it will be necessary to consider extent again from zero, and not as shown in the screenshot. As a result, by non-good computing, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solidification of this rebus you were understandable. Well, now we will continue, actually talk about the text encoding.

Extended versions of ASKI - CP866 and KOI8-R encoding with pseudograph

So, we started talking about ASCII, which was like a starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it was laid only 128 signs of the Latin alphabet, Arabic numbers and something else there, but in the extended version it was possible to use all 256 values \u200b\u200bthat can be encoded in one pate information. Those. An opportunity to add symbols of the letters of his tongue to Aska.

Here it will be necessary to once again be distracted to clarify - why do you need encoding texts and why it is so important. The characters on the screen of your computer are formed on the basis of two things - sets of vector forms (representations) of all kinds of characters (they are in files CO) and code that allows you to pull out this set of vector shapes (font file) it is the character to be inserted into Right place.

It is clear that the fonts are responsible for the vector forms, but the operating system and programs used in it are responsible for encoding. Those. Any text on your computer will be a set of bytes in each of which one single symbol of this text is encoded.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next sign and searches for the corresponding vector form in the desired file The font that is connected to display this text document. Everything is simple and trite.

So, to encode any symbol we need (for example, from the National Alphabet), two conditions must be completed - the vector form of this sign should be in the font used and this symbol could be encoded in the extended ASCII encodings into one byte. Therefore, there is a whole bunch of such options. Only for coding of the symbols of the Russian language there are several varieties of extended ASSS.

For example, initially appeared CP866.In which it was possible to use the symbols of the Russian alphabet and it was an extended version of ASCII.

Those. Its upper part completely coincided with the basic version of Aska (128 symbols of Latin, numbers and anyone else), which is presented on the screenshot given a little higher, but already bottom part The CP866 encoding tables had the specified in the screenshot slightly lower and allowed to encode another 128 characters (Russian letters and every pseudographic):

See, in the right column, the numbers begin with 8, because The numbers from 0 to 7 refer to the base part of the ASCII (see the first screenshot). So The Russian letter "M" in the CP866 will have code 9C (it is on the intersection of the corresponding rows with 9 and the column with a number C in a hexadecimal number system), which can be written in one byte information, and if there is a suitable font with Russian characters, this letter without Problems will be displayed in the text.

Where did this quantity come from pseudographers in CP866.? Here the thing is that this encoding for Russian text was developed in those bright years, when there was no such distribution of graphic operating systems like now. And in the doss, and similar text operations, the pseudographic allowed at least somehow diversify the design of texts and therefore it abounds with CP866 and all its other rows from the discharge of extended Versions of Aska.

CP866 distributed IBM company, but in addition, a number of encodings were developed for the symbols of the Russian language, for example, the same type (extended ASCII) can be attributed Koi8-R.:

The principle of its work remained the same as the CP866 described later - each text symbol is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because The first half is fully consistent with the base asus, which is shown on the first screenshot in this article.

Among the features of KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, like this, for example, made in CP866.

If you look at the very first screenshot (base part, which enters all extended encodings), then notice that in Koi8-R, Russian letters are located in the same tables of the table as the letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian symbols to Latin by discarding only one bit (two in the seventh degree or 128).

Windows 1251 - a modern version of ASCII and why crackels get out

Further development of text encodings was due to the fact that graphic operating systems and the need to use pseudographics in them were gaining popularity. As a result, a whole group arose, which, at their essence, was still advanced versions of ASKI (one text symbol is encoded with only one byput of information), but without using pseudographic characters.

They treated the so-called ANSI coding, which were developed by the American Institute for Standardization. The name of Cyrillic was still used in the surchanting for an option with the support of the Russian language. An example of such an example.

It was favorably different from the previously used CP866 and Koi8-R in that the place of the characters of the pseudographic in it took the missing symbols of the Russian typography (the decreasing sign), as well as the symbols used in close to Russian Slavic languages \u200b\u200b(Ukrainian, Belarusian, etc. ):

Because of this abundance of the codings of the Russian language, manufacturers of fonts and manufacturers software He constantly arose a headache, and with you, dear readers, often got those the most notorious krakoyabryWhen the confusion was taught with the version used in the text.

Very often, they got out when sending and receiving messages by email, which resulted in creating very complex transcoding tables, which, in fact, could not solve this problem in the root, and often users for correspondence were used to avoid notorious krakozyabs when using Russian encodings Similar CP866, KOI8-R or Windows 1251.

In essence, krakoyarbra, imparting instead of Russian text, were the result of the incorrect use of encoding of this languagewhich did not match the one in which the text message was encoded initially.

Suppose if symbols encoded with CP866, try to display using the Windows 1251 code table, then these most cracked (meaningless set of characters) and get out, completely replacing the message text.

A similar situation is very often occurring at, forums or blogs, when the text with Russian characters by mistake is not saved in the same encoding that is used on the default website, or not in the text editor, which adds to the code sebestin not visible to the naked eye.

In the end, such a situation with many encodings and constantly crawling cranebrams, many tired, there were prerequisites for creating a new universal variation, which would have replaced all existing and solve, finally, to the root of the problem with the advent of not readable texts. In addition, there was a problem of languages \u200b\u200bof similar Chinese, where the symbols of the language were much more than 256.

Unicode (Unicode) - Universal Codes UTF 8, 16 and 32

These thousands of signs of the Language group of Southeast Asia could not be described in one pape information that was allocated for encoding characters in advanced ASCII versions. As a result, a consortium was created called Unicode (Unicode - Unicode Consortium) In the collaboration of many IT leaders of the industry (those who produce a software that encodes iron, who creates fonts) who were interested in the appearance of a universal text encoding.

The first variation published under the auspices of the Unicode Consortium was UTF 32.. The digit in the name of the encoding means the number of bits that is used to encode one symbol. 32 bits are 4 bytes of information that will be needed to encode one single sign in the new Universal UTF encoding.

As a result, the same file with the text encoded in the extended version of ASCII and UTF-32 will in the latter case will have the size (weigh) four times more. It is bad, but now we have the opportunity to encode the number of signs equal to two to thirty second degrees with the help of UTF ( billions of characterswhich will cover any real value with a colossal margin).

But many countries with the languages \u200b\u200bof the European Group have such a huge number of signs to use in the encoding at all and there was no need, however, when using UTF-32, they didn't receive a four-time increase in the weight of text documents, and as a result, an increase in Internet traffic and volume stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode appeared UTF-16.which turned out so successful that was adopted by default as a basic space for all the characters that we use. It uses two bytes to encode one sign. Let's see how this thing looks like.

In operating room windows system You can pass along the path "Start" - "Programs" - "Standard" - "service" - "Character Table". As a result, a table opens with vector forms of all installed in your fonts. If you choose in "additional parameters" a set of unicode characters, you can see for each font separately the entire range of characters included in it.

By the way, clicking on any of them, you can see it two-by code in UTF-16 formatconsisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65 536 (two to sixteen), and this number was taken for the basic space in Unicode. In addition, there are ways to encode with it and about two million characters, but limited to the expanded space in a million text symbols.

But even this successful version of Unicode's encoding did not bring much satisfaction to those who wrote, for example, the programs only on english languageFor them, after the transition from the extended version of ASCII to UTF-16, the weight of the documents increased twice (one byte per symbol in ASKI and two bytes on the same symbol in UTF-16).

That's it precisely to satisfy everyone and all in the Unicode consortium was decided to come up with encoding variable length. She was called UTF-8. Despite the eight in the title, it really has a variable length, i.e. Each text symbol can be encoded into a sequence of one to six bytes.

In practice, the UTF-8 uses only a range from one to four bytes, because there is nothing even theoretically possible to submit anything to the four bytes of the code. All Latin signs are encoded in one byte, as well as in the old good ASCII.

What is noteworthy, in the case of coding only Latin, even those programs that do not understand Unicode will still read what is encoded in UTF-8. Those. The basic part of Aska simply switched to this off the Unicode Consortium.

Cyrillic signs in UTF-8 are encoded into two bytes, and, for example, Georgian - in three bytes. The Unicode Consortium after the creation of UTF 16 and 8 decided the main problem - now we have in the fonts there is a single code space. And now their manufacturers remain only on the basis of their forces and opportunities to fill it with vector forms of text symbols. Now in the sets even.

In the Symbol table below, it can be seen that different fonts support a different number of characters. Some symbols of Unicode fonts can weigh very well. But now they are not distinguished by the fact that they are created for different encodings, but by the fact that the font manufacturer filled or not filled the single code space by those or other vector forms to the end.

Krakoyabry instead of Russian letters - how to fix

Let's now see how the Crakozyabe text appears instead of the text or, in other words, how the correct encoding is selected for Russian text. Actually, it is set in the program in which you create or edit this same text, or code using text fragments.

For editing and creating text files, I personally use very good, in my opinion. However, it can highlight the syntax still good hundreds of programming languages \u200b\u200band markup, and also has the ability to expand with plugins. Read detailed review This wonderful program according to the link.

In the NotePad ++ top menu, there is an "encoding" item, where you will have the ability to convert an existing option to one that is used on your default site:

In the case of a site on Joomla 1.5 and above, as well as in the case of a blog on Wordpress, you should choose the option in order to avoid the appearance of krakoyar UTF 8 without BOM. What is the BOM prefix?

The fact is that when the ETF-16 encoding was developed, for some reason decided to fasten such a thing to it as the ability to record a symbol code, both in direct sequence (for example, 0a15) and in the reverse (150a). And in order for the programs to understand which sequence reading codes, and was invented BOM. (Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of documents.

In the utf-8 encoding, there were no BOM in the Unicode Consortium and therefore adding signature (these most notorious additional three bytes to the beginning of the document) Some programs simply prevent reading the code. Therefore, we always, when saving files in UTF, you must select an option without BOM (without signature). So you are in advance mustrase yourself from crackering.

What is noteworthy, some programs in Windows do not know how to do this (do not be able to save text in UTF-8 without BOM), for example, the same notorious notebook Windows. It saves the document in UTF-8, but still adds signature to its beginning (three additional bytes). Moreover, these bytes will always be the same - read the code in direct sequence. But on the servers, because of this little things, there may be a problem - crackels will get out.

Therefore, in no case do not use the usual notebook Windows To edit documents of your site, if you do not want the appearance of krakoyarbra. I consider the latest and easiest option for the already mentioned NotePad ++ editor, which practically does not have drawbacks and consists of one of the advantages.

In NotePad ++ when choosing an encoding, you will have the ability to convert text to the UCS-2 encoding, which is very close to the Unicode standard in essence. Also in a non-type can be encoded in ANSI, i.e. With reference to the Russian language, this will be already described by us just above Windows 1251. Where does this information come from?

It is spelled out in the registry of your Windows operating system - which encoding is to choose in the case of ANSI, what to choose in the case of OEM (for the Russian language it will be CP866). If you install another default language on your computer, then these encodings will be replaced with similar to ANSI or OEM discharge for the same language.

After you in NotePad ++, save the document in the encoding you need or open a document from the site to edit, then in the lower right corner of the editor you can see its name:

To avoid krakoyarbrovexcept the actions described above will be useful to register in his cap source code All site pages information about this coding, so that the server or local host does not occur.

In general, in all languages \u200b\u200bof hypertext marking other than HTML, a special XML ad is used, which specifies the text encoding.

Before starting to disassemble the code, the browser will find out which version is used and how exactly you need to interpret the codes of the characters of this language. But what is noteworthy, in case you save the document in the default Unicode, this XML declaration can be omitted (the encoding will be considered UTF-8, if there is no BOM or UTF-16 if there is a BOM).

In the case of a document hTML language To specify the encoding used meta elementwhich is prescribed between the opening and closing HEAD tag:

... ...

This entry is quite different from the accepted B, but fully complies with the newly introduced slowly by the HTML 5 standard, and it will be absolutely correctly understood by anyone used on this moment browsers.

In theory, the META element with an indication of the HTML coding of the document will be better to put as high as possible in the dock headerSo that at the time of the meeting in the text of the first sign is not from the basic ANSI (which always read always and in any variation) the browser must already have information on how to interpret the codes of these characters.

Good luck to you! To ambiguous meetings on the blog pages Website

see more Rollers you can go on

");">

Symbol overlay

Thanks to the BS symbol (return to step), one character over the other can be printed on the printer. In ASCII, it has been addressed to add diacritic to letters, for example:

a bs "→ Á
a BS `→ à
a bs ^ → Â
o bs / → Ø
c BS, → ç
n BS ~ → ñ

Note: In the old fonts apostrophe "drew a slope to the left, and Tilda ~ was shifted up, so that they just fit the role of Akut and Tilde from above.

If the same symbol is superimposed on the symbol, the effect of the bold font is obtained, and if emphasis is superimposed on the symbol, it turns out underdend the text.

a BS A → a.
a BS _ → a.

Note: This is used, for example, in the MAN reference system.

National ASCII options

ISO 646 (ECMA-6) Standard provides for the possibility of placing national characters in place @ [ \ ] ^ ` { | } ~ . In addition to this, in place # May be placed £ , and in place $ - ¤ . Such a system is well suited for European languages, where only a few additional characters are needed. An ASCII version without national symbols is called US-ASCII, or "International Reference Version".

Subsequently, it turned out more convenient to use 8-bit encodings (code pages), where the lower half of the code table (0-127) occupy US-ASCII characters, and the upper (128-255) are additional characters, including a set of national symbols. Thus, the upper half of the ASCII table to the ubiquitous implementation of Unicode was actively used to represent localized symbols, local letters. The absence of a single standard for placing Cyrillic characters in the ASCII table delivered many encoding problems (Koi-8, Windows-1251 and others). Other languages \u200b\u200bwith nonlaining writing also suffered due to the presence of several different encodings.

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A.a.	.B	.C.c.	.D.	.E.e.	.F.
0.	Nul.	SOM	EOA.	EOM.	EQT.	Wru.	Ru	Bell.	BKSP.	HT	LF.	Vt.	FF.	CR	SO.	SI
1.	DC 0.	DC 1.	DC 2.	DC 3.	DC 4.	Err.	Sync.	Lem.	S 0.	S 1	S 2.	S 3.	S 4.	S 5.	S 6.	S 7.
2.
3.
4.	Blank	!	"	#	$	%	&	"	(	)	*	+	,	-	.	/
5.	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
6.
7.
8.
9.
A.	@	A.	B.	C.	D.	E.	F.	G.	H.	I.	J.	K.	L.	M.	N.	O.
B.	P.	Q.	R.	S.	T.	U.	V.	W.	X.	Y.	Z.	[	\	]		←
C.
D.
E.		a.	b.	c.	d.	e.	f.	g.	h.	i.	j.	k.	l.	m.	n.	o.
F.	p.	q.	r.	s.	t.	u.	v.	w.	x.	y.	z.				ESC	Del.

On those computers where the minimally addressable unit of memory was a 36-bit word, initially 6-bit characters were used (1 word \u003d 6 characters). After switching to ASCII on such computers, either 5 seven-bit characters began to be placed in one word (1 bit remained superfluous) or 4 ninebitant characters.

ASCII codes are also used to determine the key under programming. For standard QWERTY keypad, the code table looks like this:

The computer means the process of its transformation into a form that allows you to organize more convenient transmission, storage or automatic processing of these data. For this purpose, various tables are used. The ASCII encoding is the first system developed in the United States to work with the English-language text, which was subsequently distributed throughout the world. Its descriptions, features, properties and further use is devoted to the article presented below.

Display and storing information in computer

Symbols on a computer monitor or a mobile digital gadget are formed on the basis of sets of vector forms of all kinds of signs and code that allows you to find the character among them that you want to insert into the right place. It is a bit sequence. Thus, each symbol must definitely fit a set of zeros and units that stand in a certain, unique order.

How it all began

Historically, the first computers were English-speaking. To encode symbolic information in them, it was enough to use only 7 memory bits, whereas for this purpose there was 1 byte, consisting of 8 bits. The number of signs understood by the computer in this case was equal to 128. The number of such characters included an English alphabet with its punctuation marks, numbers and some special characters. An English-speaking seven-coded encoding with the corresponding table (code page), developed in 1963, was named American Standard Code for Information Interchange. Usually, for its designation, the abbreviation "ASCII encoding" was used and used to this day.

Transition to multiplying

Over time, computers have become widely used in non-engaging countries. In this regard, there was a need for encodings that allow us to use national languages. It was decided not to reinvent the bike and take as a basis of ASCII. The coding table in the new edition expanded significantly. The use of the 8th bit allowed 256 characters to translate into a computer language.

Description

The ASCII encoding has a table that is divided into 2 parts. The generally accepted international standard is considered only its first half. It includes:

Symbols with sequence numbers from 0 to 31, encoded by sequences from 00000000 to 00011111. They are assigned to control characters that follow the process of outputting text to the screen or printer, the sound signal, etc.
Symbols with Nn in the table from 32 to 127, encoded by sequences from 00100000 to 01111111 constitute a standard part of the table. These include a space (N 32), the letters of the Latin alphabet (lowercase and uppercase), ten-digit numbers from 0 to 9, punctuation marks, brackets of different inscription and other characters.
Symbols with sequence numbers from 128 to 255, encoded by sequences from 10,000,000 to 11111111. These are the letters of national alphabets other than Latin. It is this alternative part of the ASCII encoding table that is used to convert Russian symbols to the computer form.

Some properties

The features of the ASCII encoding include the difference between the letters "A" - "z" of the lower and upper registers with only one bit. This circumstance greatly simplifies the register transformation, as well as its verification to belong to the specified range of values. In addition, all the letters in the ASCII encoding system are represented by their own sequence numbers in the alphabet that are written 5 digits in a binary number system, in front of which for the letters of the lower register costs 011 2, and the upper - 010 2.

The features of the ASCII encoding features can also be classified and representing 10 digits - "0" - "9". In the second number system, they begin with 00112, and ends with 2 values \u200b\u200bof numbers. So, 0101 2 is equivalent to a decimal number five, so the "5" symbol is written as 0011 01012. Relying on the above, you can easily convert binary-decimal numbers to the string in the ASCII encoding by adding the left bit sequence 00112 to each mb.

"Unicode"

As you know, thousands of characters are required to display texts in the languages \u200b\u200bof the group of Southeast Asia. This amount is not described in any way in one pate information, so even the extended versions of ASCII could no longer meet the increased needs of users from different countries.

So, there was a need to create a universal encoding of the text, the development of which, with the cooperation with many leaders of the world IT industry, a Consortium "Unicode" was engaged. Its experts created the UTF 32 system. In it, 32 bits constituting 4 bytes of information were released for coding 1 of the symbol. The main disadvantage was a sharp increase in the amount of memory required as much as 4 times, which entailed many problems.

At the same time, for most countries with official languages \u200b\u200brelating to the Indo-European Group, the number of signs equal to 2 32 is more than redundant.

As a result of the further work of specialists from the "Unicode" Consortium, an UTF-16 encoding appeared. It has become the option of converting symbolic information that has arranged all both by the volume of the required memory and by the number of encoded symbols. That is why UTF-16 was accepted by default and in it for one mark you need to reserve 2 bytes.

Even this rather advanced and successful version of "Unicode" had some drawbacks, and after the transition from the extended version of the ASCII to UTF-16 increased the weight of the document twice.

In this regard, it was decided to use the UTF-8 variable variable encoding. In this case, each source text icon is encoded by a sequence of 1 to 6 bytes.

Communication with American Standard Code for Information Interchange

All signs of the Latin alphabet in UTF-8 variable length are encoded in 1 byte, as in the ASCII encoding system.

A feature of UTF-8 is that in the case of text on Latinia without using other characters, even programs that do not understand "Unicode" will still allow you to read it. In other words, the basic part of the ASCII text encoding simply moves to the new UTF length variable. Cyrillic signs in UTF-8 occupy 2 bytes, and, for example, Georgian - 3 bytes. The creation of UTF-16 and 8 was solved the main problem of creating a single code space in fonts. Since then, manufacturers of fonts remain only to fill in the table vector forms of text symbols based on their needs.

In various operating systems, preference is given to various encodings. To be able to read and edit texts scored in another encoding, the transcoding programs of Russian text apply. Some text editors Contain embedded transcoders and allow you to read text regardless of encoding.

Now you know how many characters in the ASCII encoding and, how and why it was designed. Of course, today I received the greatest distribution in the world. Unicode. However, it is impossible to forget that it is created on the basis of ASCII, so it should be appreciated by the contribution of its developers to the IT scope.

Recall some facts known to us:

Many characters with which the text is written, is called the alphabet.

The number of characters in the alphabet is its power.

Formula for determining the amount of information: n \u003d 2 b,

where n is the power of the alphabet (the number of characters),

b - the number of bits (symbol information).

Alphabet with a power of 256 characters can be placed almost all the necessary characters. Such an alphabet is called sufficient.

Because 256 \u003d 2 8 , The weight of 1 symbol is 8 bits.

Unit of measurement 8 bits appropriated Name 1 byte:

1 byte \u003d 8 bits.

The binary code of each symbol in the computer text takes 1 memory byte.

What way text information is presented in the computer's memory?

Coding is that each symbol is put in line with a unique decimal code from 0 to 255 or the binary code corresponding to it from 00000000 to 11111111. Thus, a person distinguishes characters to their design, and the computer - according to their code.

Convenience of over-off symbol encoding is obvious, because bytes - the smallest addressable part of the memory and, therefore, the processor can refer to each character separately by performing text processing. On the other hand, 256 characters are quite enough to represent the most varied symbolic information.

Now the question arises which eight-bit binary code to put in line with each symbol.

It is clear that this is a conditional matter, you can come up with many encoding methods.

The International Standard for PCs has become the ASCII table (Aski read) (American standard code For information exchange).

The international standard is only the first half of the table, i.e. Symbols with numbers from 0 (00000000), up to 127 (01111111).

Serial number		Symbol
	00000000 - 00011111	Their function is to control the process of outputting text on the screen or print, the sound signal, text markup, etc.
32 - 127	00100000 - 01111111
128 - 255	10000000 - 11111111	The second half of the ASCII code table, called the code page (128 codes, starting with 10,000,000 and ending 11111111), may have different options, each option has its own number.

I draw your attention to the fact that in the table of encoding the letter (uppercase and lowercase) are arranged in alphabetical order, and the numbers are ordered by increasing values. Such adherence to the lexicographic order in the arrangement of the symbols is called the principle of sequential coding of the alphabet.

The most common is currently encoding. Microsoft Windows.denoted by the reduction of CP1251.

Since the end of the 90s, the problem of standardization of symbol coding is solved by the introduction of a new international standard called Unicode . This is a 16-bit encoding, i.e. In it, each symbol is given 2 byte of memory. Of course, the amount of memory occupied by 2 times. But this code table allows you to enable up to 65536 characters. The complete specification of the Unicode standard includes all existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Let's try using the ASCII table to imagine how words will look in the computer's memory.

The words

Memory

01100110

01101001

01101100

01100101

01100100

01101001

01110011

01101011

When entering text information into a computer, the characters (letters, numbers, characters) are encoded using various code systems that consist of a set of code tables posted on the corresponding standards pages for text information encoding. In such tables, each character is assigned a specific numerical code in a hexadecimal or decimal number system, i.e. code tables reflect the correspondence between the images of symbols and numeric codes and are designed for encoding and decoding text information. When entering text information using a computer keyboard, each entered character is encoded, i.e. it is converted to a numeric code, when the text information is displayed on the computer output device (display, printer, or plotter), its image is built on the numerical code of the symbol. The assignment of a specific numerical code is the result of the agreement between the relevant organizations of different countries. Currently, there is no uniform universal code table that satisfies the letters of national alphabets of different countries.

Modern code tables include an international and national part, that is, contains the letters of Latin and national alphabets, numbers, signs of arithmetic operations and punctuation, mathematical and control symbols, pseudographic symbols. International part of the code table based on the standard ASCII (American Standard Code for Information Interchange),encodes the first half of the code table symbols with numeric codes from 0 to 7 F 16,or in a decimal number system from 0 to 127. At the same time, the codes from 0 to 20 16 (0? 32 10) are settled with the function keys (F1, F2, F3, etc.) of the personal computer keyboard. In fig. 3.1 shows the international part of the code table based on the standard ASCII.Cells of tables are numbered accordingly in a decimal and hexterior number system.

Figure 3.1. International part of the code table (standard ASCII)with the numbers of the cells presented in decimal (a) and hexadecimal (b) system of number

The national part of the code tables contains codes of national alphabets, which is also called the symbol set table (Charset).

Currently, there are several code tables (encodings) to support the letters of the Russian alphabet (Cyrillic), which are used by various operating systems, which is a significant disadvantage and in some cases leads to problems associated with decoding operations of numerical symbol values. In tab. 3.1 The names of the code pages (standards) are shown on which code tables (encoding) of Cyrillic are posted.

Table 3.1

One of the first standards of cyrillic coding on computers was the standard Koi8-r. The national part of the code table of this standard is shown in Fig. 3.2.

Fig. 3.2. National part of the code table standard Koi8-R

Currently, a code table is also applied on the CP866 page of the text information encoding standard that is used in the operating system. MS DOS.or session of work MS DOS.for cirillic coding (Fig. 3.3, but).

Fig. 3.3. National part of the code table posted on the CP866 page (A) and on the CP1251 page (b) of the text information encoding standard

Currently, a code table posted on the CP1251 page of the corresponding standard, which is used in the family operating systems, was most widely distributed for Cyrillic coding. Windowsfirms Microsoft.(Fig. 3.2, b).In all submitted code tables, except the standard table Unicode,for encoding one symbol, 8 binary discharges are given (8 bits).

At the end of the last century, a new international standard appeared Unicode,in which one character is represented by a double-byte binary code. The application of this standard is to continue developing a universal international standard, which makes it possible to solve the problem of compatibility of national symbol encodings. Via this standard can be encoded 2 16 \u003d 65536 different characters. In fig. 3.4 shows the code table 0400 (Russian alphabet) of the standard Unicode.

Fig. 3.4. UNICODE standard code table 0400

Let us explain what has been said regarding the coding of text information, on the example.

Example 3.1.

Encoding the word "computer" in the form of a sequence of decimal and hexadecimal numbers using CP1251 encoding. What characters will be displayed in CP866 code tables and koi8-p when using the resulting code.

The sequences of hexadecimal and binary code "Computer" word based on CP1251 encoding table (see Fig. 3.3, b)will look like this:

This code sequence in CP866 encodings and koi8-p will result in the following symbols:

To convert Russian-speaking text documents from one standard for coding text information to another, special programs are used - converters. Converters are usually embedded in other programs. An example is the browser program - Internet Explorer. (IE),which has a built-in converter. The browser program is a special program for viewing content. Web pagesin global computer network The Internet. We use this program to confirm the results of the display of characters obtained in Example 3.1. To do this, perform the following steps.

1. Start the Notepad program (NotePad).Notebook program in the operating system Windows XP.starts using the command: [Button Start - Standard programs - notepad]. In the notepad program window that opens, type the word "computer" using the syntax of the Hypertext Document Markup Language - HTML (Hyper Text Markup Language).This language is used to create documents on the Internet. The text should look like this:

Computer

where

and

Tags (Special Designs) Language HTML.for marking headers. In fig. 3.5 Presented the result of these actions.

Fig. 3.5. Text Display in the Notepad window

Save this text by running the command: [File - Save as ...] In the appropriate folder of the computer, when you save the text file, you assign the name - approx, with the file extension. HTML.

2. Start the program Internet Explorer,by commanding the command: [Button Start - Programs - Internet Explorer].When you start the program, a window appears in Fig. 3.6.

Fig. 3.6. Offline access window

Select and activate the button Offlinethis will not connect a computer to a global Internet. The main window of the program will appear. Microsoft Internet Explorer,presented in Fig. 3.7.

Fig. 3.7. Basic microsoft window Internet Explorer.

Perform the following command: [File - Open], A window will appear (Fig. 3.8), in which you want to specify the file name and click the button OK Or press the button Overview…and find the file approx.html.

Fig. 3.8. Window "Open"

Main window internet programs Explorer will take the view shown in Fig. 3.9. The word "computer" appears in the window. Next, using the top menu of the program Internet Explorer,we will execute the following command: [View - Coding - Cyrillic (DOS)].After executing this command in the program window Internet EXPlorersymbols shown in Figure will be displayed. 3.10. When executing a team: [View - Coding - Cyrillic (Koi8-R)]in the program window Internet Explorer.symbols shown in Figure will be displayed. 3.11.

Fig. 3.9. Symbols displayed when encoding CP1251

Fig. 3.10. Symbols displayed when the CP866 encoding is turned on for the code sequence submitted to CP1251 encoding

Fig. 3.11. Symbols displayed when the KoO8-P encoding is turned on for the code sequence submitted to CP1251 encoding

Thus obtained using the program Internet Explorer.the sequences of the characters coincide with the sequences of the characters obtained using CP866 code tables and Koi8-P in Example 3.1.

3.2. Coding graphic information

Graphic information presented in the form of drawings, photos, slides, moving images (animation, video), schemes, drawings, can be created and edited using a computer, while it is appropriately encoded. Currently, there is a sufficiently large number of application programs for processing graphic information, but they all implement three types of computer graphics: raster, vector and fractal.

If you closely consider the graphic image on the computer monitor screen, you can see a large number of Multicolored dots (pixels - from English. pixel,educated OT picture Element -an element of the image), which, being assembled together, and form this graphic image. From this we can conclude: the graphic image in the computer is definitely encoded and must be represented as a graphic file. The file is the main structural unit of the organization and storage in the computer and in this case should contain information on how to submit this set of points on the monitor screen.

Files created on the basis of vector graphics contain information in the form of mathematical dependencies (mathematical functions describing linear dependencies) and the corresponding data on how to build an image of an object using line segments (vectors) when it outputs it to the computer monitor screen.

Files created on the basis of raster graphics assume data storage of each individual image point. No complex mathematical calculations are required to display raster graphics, it is enough to obtain data about each image point (its coordinates and color) and display them on the computer monitor screen.

In the process of encoding the image, its spatial sampling is performed, i.e. the image is divided into separate points and each point is set to color code (yellow, red, blue, etc.). For the encoding of each point of color graphic image, the principle of decomposition of arbitrary color on its main components is used, which use three main colors: red (English word Reddenote the letter TO),green (Green,denote the letter G)blue (BLUE,denote beech IN).Any color of the point perceived by the human eye can be obtained by additive (proportional) addition (mixing) of the three main colors - red, green and blue. Such a coding system is called the Color System RGB.Files graphic imagesin which the color system is applied RGB,represent each point of the image in the form of a color triplet - three numeric values R, G.and IN,corresponding to the intensities of red, green and blue. The process of encoding a graphic image is carried out using various technical means (scanner, digital camera, digital video camera, etc.); As a result, a raster image is obtained. When playing color graphics images on the color monitor screen, the color of each point (pixel) of this image is obtained by mixing the three main colors R, G. and B.

Quality raster image Determined by two main parameters - resolution (by the number of horizontal and vertical points) and the palette of colors used (the number of specified colors for each point of the image). The resolution is set by indicating the number of points horizontally and vertically, for example 800 by 600 points.

Between the number of colors defined by the raster image point, and the amount of information that must be allocated to storing the color of the point, there is a dependence determined by the relation (formula R. Hartley):

where I. - amount of information; N -the number of colors setpoint.

The amount of information required for storing the color of the point is also called color depth, or color quality.

So, if the number of colors defined for the image point, N \u003d256, then the amount of information necessary for its storage (color depth) in accordance with formula (3.1) will be equal I. \u003d 8 bits.

In computers to display graphic information Various graphic monitor operation modes are used. Here it should be noted that in addition to the graphical mode of operation of the monitor, there is also a text mode in which the monitor screen is conditionally divided into 25 rows of 80 characters in the row. These graphic modes are characterized by the resolution of the monitor screen and the quality of color reproduction (color depth). To set the graphic monitor screen mode in the operating system MS Windows XPyou must execute the command: [Button Start - Setup - Control Panel - Screen]. In the "Properties: Screen" dialog box (Fig. 3.12), you must select the "Parameters" tab and using the screen resolution slider to select the appropriate screen resolution (800 per 600 points, 1024 to 768 points, etc.). Using the color rendering list, you can select the color depth - "The highest (32 bits)", "mean (16 bits)", etc., with the number of colors set by each image point, will respectively, equal to 2 32 (4294967296), 2 16 (65536), etc.

Fig. 3.12. Dialog box "Properties: Screen"

To implement each of the graphics modes, the monitor screen requires a specific information volume of the computer video memory. Necessary information volume of video memory (V)determined from the relationship

where To -number of image points on the monitor screen (K \u003d A · B); BUT -the number of horizontal points on the monitor screen; IN -the number of points vertically on the monitor screen; I. - amount of information (color depth).

So, if the monitor screen has a resolution of 1024 to 768 points and a palette consisting of 65,536 colors, the depth of color in accordance with formula (3.1) will be i \u003d log 2 65 538 \u003d 16 bits, the number of image points will be: K \u003d.1024 x 768 \u003d 786432, and the required information volume of video memory in accordance with (3.2) will be equal

V \u003d.786432 · 16 bits \u003d 12582912 BIT \u003d 1572864 byte \u003d 1536 KB \u003d 1.5 MB.

In conclusion, it should be noted that in addition to the listed characteristics of the most important characteristics of the monitor are the geometric dimensions of its screen and image points. The geometric dimensions of the screen are specified by the diagonal diagonal value. The diagonal of the monitors is set in inches (1 inch \u003d 1 "\u003d 25.4 mm) and can take values \u200b\u200bequal to: 14", 15 ", 17", 21 "etc. Modern technologies for the production of monitors can provide the image point size equal 0.22 mm.

Thus, for each monitor, there is physically the maximum possible resolution of the screen, which is determined by its diagonal size and the size of the image point.

Exercises for self-execution

1. Using the program MS Excelconvert code tables ASCII, CP866, CP1251, koi8-p to tables of type: In the cells of the first column of the tables, write in alphabetical order, and then the lowercase letters of the Latin and Cyrillic, in the cells of the second column - corresponding to the letters codes in the decimal number system, in the cell The third column is the corresponding codes in a hexadecimal number system. Codes must be selected from the corresponding code tables.

2. Encoding and record as a sequence of numbers in a decimal and hexadecimal number system the following words:

a) Internet Explorer,b) Microsoft Office;in) CorelDRAW.

Coding to produce using an upgraded ASCII encoding table obtained in the previous exercise.

3. Decoding using an upgraded encoding table Koi8-p sequence of numbers recorded in a hexadecimal number system:

a) Fc CB DA C9 D3 D4 C5 CE C3 C9 D1;

b) EB CF CE C6 CF D2 CD C9 DA CD;

c) Fc CB D3 D0 D2 C5 D3 C9 CF CE C9 DA CD.

4. How will the word "cybernetics", recorded in the CP1251 encoding, when using CP866 encodings and koi8-p? Check the results with the program Internet Explorer.

5. Using the code table shown in Fig. 3.1. but,decode the following code sequences recorded in the binary number system:

a) 01010111 01101111 01110010 01100100;

b) 01000101 01111000 01100011 01100101 01101100;

c) 01000001 01100011 01100011 01100101 01110011 01110011.

6. Determine the information volume of the word "economy" encoded using CP866 code tables, CP1251, Unicode and Koi8-p.

7. Determine the information volume of the file obtained as a result of a 12x12 color image scanning, cm. The resolution of the scanner used when scanning this image is 600 dpi. The scanner sets the image color depth of 16 bits.

Resolving scanner 600 dpi (Dotper Inch -inch dots) Determines the ability to scanner with such a resolution on a segment of 1 inch long distinguish 600 points.

8. Determine the information file resulting from scanning a color image A4. The resolution of the scanner used when scanning this image is 1200 dpi. The scanner sets the color point color depth of 24 bits.

9. Determine the number of colors in the palette at a color depth of 8, 16, 24 and 32 bits.

10. Determine the required volume of video memory for the graphic modes of the monitor screen 640 to 480, 800 to 600, 1024 to 768 and 1280 per 1024 points at a color depth of the image point 8, 16, 24 and 32 bits. Results to reduce table. Develop B. MS Excelprogram for automation of calculations.

11. Determine the maximum number of colors that is allowed to be used to store an image with a size of 32 to 32 points if the computer is highlighted below the image of 2 KB of memory.

12. Determine the maximum possible allowing the ability of the monitor screen that has a diagonal length 15 "and the size of the image point is 0.28 mm.

13. What are the graphics modes of the monitor can provide video memory with a volume of 64 MB?

Burling

I. The history of information coding .................................... ..3

II. Coding information ................................................ 4

III. Coding of text information ..................................4

IV. Types of encoding tables ................................................ ... 6

V. Calculation of the number of text information ........................... 14

List of literature used .........................................16

I. . Information coding history

Humanity uses encryption (encoding) of text from the very moment when the first appeared secret information. Here are several techniques for the coding of the text, which were invented at various stages of the development of human thought:

Cryptography is a gradient, a letter change system in order to make the text incomprehensible to uninitiated persons;

Morse Alphabet or uneven telegraph code in which each letter or sign is represented by its combination of short elementary parcels electric current (points) and elementary parcels of the tripled duration (dash);

slurgoes - gesture language used by people with hearing impairment.

One of the very first known methods of encryption is the name of the Roman emperor Julia Caesar (I century BC). This method is based on the replacement of each letter of the encrypted text, to another, by displacement in the alphabet from the original letter to the fixed number of characters, and the alphabet is read in a circle, that is, after the letter I am considered a. So the word "byte" when displaced two characters to the right is coded by the word "GVF". The reverse process of decryption of this word - it is necessary to replace each encrypted letter, to the second to the left of it.

II. Information coding

The code is a set of conditional designations (or signals) to record (or transmit) of some predetermined concepts.

Information coding is the process of forming a certain presentation of information. In a narrower sense, the term "coding" often understand the transition from one form of information presentation to another, more convenient for storage, transmission or processing.

Usually, each image when encoding (sometimes they say - encrypted) representing a separate sign.

The sign is the element of the final set of elements other than each other.

In a narrower sense, the term "coding" often understand the transition from one form of presenting information to another, more convenient for storage, transmission or processing.

On the computer you can process text information. When entering into the computer, each letter is encoded by a certain number, and when displaying an external device (screen or printing), images of letters are built for perception of this numbers. The correspondence between the set of letters and numbers is called encoding of characters.

As a rule, all numbers in the computer are presented using zeros and units (and not ten digits, as is familiar to people). In other words, computers usually work in a binary number system, since the devices for processing are obtained much easier. Entering numbers into the computer and the withdrawal of them to read by a person can be carried out in the usual decimal form, and all the necessary transformations perform programs running on the computer.

III. Coding text information

The same information can be presented (encoded) in several forms. With the appearance of computers, it was necessary to encode all types of information with which a separate person and humanity as a whole. But to solve the task of encoding information, humanity began long before computers. The grand achievements of humanity - writing and arithmetic - there is nothing more than a speech coding system and numeric information. The information never appears in its pure form, it is always somehow presented, somehow encoded.

Binary coding is one of the common ways to present information. In computing machines, in robots and numerical control machines, as a rule, all the information with which the device has a case is encoded as the words of the binary alphabet.

Starting from the late 60s, computers have become more and more used to handle text information, and now the main share personal computers In the world (and most of the time) is occupied by the processing of text information. All these types of information in the computer are presented in binary code, i.e., the alphabet is used with a power of two (only two characters 0 and 1). This is due to the fact that it is convenient to present information in the form of a sequence of electrical pulses: the pulse is missing (0), the impulse is (1).

Such coding is called binary, and the logical sequences of zeros and units are machine tongue.

From the point of view of the computer, the text consists of individual characters. Not only letters (capital or lowercase, Latin or Russian), but also numbers, punctuation signs, special mixes like "\u003d", "(", ", etc. and even (pay special attention!) Spaces between words.

Texts are entered into the computer's memory using the keyboard. The letters, numbers, punctuation marks and other characters are written on the keys. IN rAM They fall in binary code. This means that each symbol seems to be an 8-bit binary code.

Traditionally, for encoding one character, the amount of information is used equal to 1 byte, i.e. i \u003d 1 byte \u003d 8 bits. With the help of a formula that binds the number of possible events to and the amount of information i, you can calculate how many different characters can be encoded (counting that symbols are possible events): K \u003d 2 i \u003d 2 8 \u003d 256, i.e. for Presentations of textual information You can use alphabet with a power of 256 characters.

Such a number of characters is quite enough to present text information, including the uppercase and lowercase letters of the Russian and Latin alphabet, numbers, signs, graphic symbols, etc.

During the output of the symbol on the computer screen, the reverse process is made - decoding, that is, the symbol code conversion into its image. It is important that the assignment of a specific code symbol is a question of agreement that is fixed in the code table.

Now the question arises which eight-bit binary code to put in line with each symbol. It is clear that this is a conditional matter, you can come up with many encoding methods.

All symbols of the computer alphabet are numbered from 0 to 255. Even the number corresponds to the eight-bit binary code from 00000000 to 11111111. This code is simply the sequence number of the symbol in the binary number system.

IV . Types of coding tables

A table in which all the characters of the computer alphabet are made in compliance with the sequence numbers, is called the encoding table.

For different types EUM uses various encoding tables.

As an international standard, the ASCII code table is adopted (American Standard Code for Information Interchange - American Standard Code for Information Exchange), encoding the first half of characters with numeric codes from 0 to 127 (codes from 0 to 32 are not set as symbols, but the function keys).

The ASCII code table is divided into two parts.

The international standard is only the first half of the table, i.e. Symbols with numbers from 0 (00000000), up to 127 (01111111).

ASCII Encoding Table Structure

Serial number	The code	Symbol
0 - 31	00000000 - 00011111	Symbols with numbers from 0 to 31 are called managers. Their function is to control the process of outputting text on the screen or print, the sound signal, text markup, etc.
32 - 127	0100000 - 01111111	Standard part of the table (English). This includes lowercase and capital letters of the Latin alphabet, decimal numbers, punctuation marks, all kinds of brackets, commercial and other characters. Symbol 32 - space, i.e. Empty position in the text. All other are reflected by certain signs.
128 - 255	10000000 - 11111111	Alternative part of the table (Russian). The second half of the ASCII code table, called the code page (128 codes, starting with 10,000,000 and ending 11111111), may have different options, each option has its own number. The code page is primarily used to accommodate national alphabets other than Latin. In Russian national encodings, the symbols of the Russian alphabet are placed in this part of the table.

The first half of the ASCII codes table

It is drawn to the fact that in the coding table, the letters (uppercase and lowercase) are arranged in alphabetical order, and the numbers are ordered by increasing values. Such adherence to the lexicographic order in the arrangement of the symbols is called the principle of sequential coding of the alphabet.

For the letters of the Russian alphabet, the principle of serial coding is also observed.

The second half of the ASCII codes table

Unfortunately, there are currently five different Cyrillic encodings (koi8-p, Windows. MS-DOS, Macintosh and ISO). Because of this, problems often arise with the transfer of Russian text from one computer to another, from one software system to another.

Chronologically one of the first standards of coding Russian letters on computers was koi8 ("information exchange code, 8-bit"). This encoding was used in the 70s on the computers of the EU EU series, and from the mid-80s it began to be used in the first Russified versions of the UNIX operating system.

From the beginning of the 90s, the time of domination of the MS DOS operating system, the CP866 encoding remains ("CP" means "Code Page", "code page").

Apple computers running the Mac OS operating system use their own Mac encoding.

In addition, the International Standards Organization, ISO) approved another encoding called ISO 8859-5 as a standard for Russian language.

The most common currently is the Microsoft Windows encoding, denoted by the reduction of CP1251. Introduced by Microsoft; Taking into account the widespread dissemination of operating systems (OS) and other software products of this company in the Russian Federation, it has been widespread.

Since the end of the 90s, the problem of standardization of symbolic coding is solved by the introduction of a new international standard called Unicode.

This is a 16-bit encoding, i.e. In it, each symbol is given 2 byte of memory. Of course, the amount of memory occupied by 2 times. But this code table allows you to enable up to 65536 characters. The complete specification of the Unicode standard includes all existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Internal word view in computer memory

using the ASCII table

Sometimes it happens that the text consisting of the letters of the Russian alphabet, obtained from another computer, cannot be read - some "abrakadabra" can be visible on the monitor screen. This happens because the computers use different encoding of the symbols of the Russian language.

Thus, each encoding is set by its own code table. As can be seen from the table, the same binary code in various encodings are made in compliance with various characters.

For example, the sequence of numeric codes 221, 194, 204 in the CP1251 encoding forms the word "computer", whereas in other encodings it will be a meaningless set of characters.

Fortunately, in most cases, the user should not take care of the transcoding of text documents, as they make special converter programs built into applications.

V. . Calculation of the number of text information

Task 1: Clear the word "ROME" using KoO8-P and CP1251 encoding tables.

Decision:

Task 2: Considering that each character is encoded by one byte, appreciate the information volume of the following sentence:

"My uncle of the most honest rules,

When not a joke,

He forced himself

And it was better to invent could not. "

Decision: In this phrase, 108 characters, given the punctuation marks, quotes and spaces. Multiply this amount by 8 bits. We get 108 * 8 \u003d 864 bits.

Task 3: Two texts contain the same number of characters. The first text is recorded in Russian, and the second in the language of the Naguri tribe, the alphabet of which consists of 16 characters. Whose text carries more information?

Decision:

1) I \u003d K * A (the information volume of the text is equal to the product of the number of characters to the information weight of one symbol).

2) because Both texts have the same number of characters (K), then the difference depends on the informativeness of a single alphabet symbol (a).

3) 2 A1 \u003d 32, i.e. A 1 \u003d 5 bits, 2 A2 \u003d 16, i.e. A 2 \u003d 4 bits.

4) I 1 \u003d K * 5 bits, i 2 \u003d K * 4 bits.

5) So, the text recorded in Russian is 5/4 times more information.

Task 4: The volume of a message containing 2048 characters amounted to 1/512 part of MB. Determine the power of the alphabet.

Decision:

1) i \u003d 1/512 * 1024 * 1024 * 8 \u003d 16384 bits - transferred information information into bits.

2) A \u003d I / K \u003d 16384/1024 \u003d 16 bits - accounts for one alphabet symbol.

3) 2 * 16 * 2048 \u003d 65536 characters - the power of the alphabet used.

Task 5: Laser printer Canon LBP prints at an average of 6.3 kbps per second. How much time will you need to print an 8-page document, if it is known that on one page on average 45 lines, in line 70 characters (1 symbol - 1 byte)?

Decision:

1) We find the amount of information contained on 1 page: 45 * 70 * 8 bits \u003d 25200 bits

2) Find the amount of information on 8 pages: 25200 * 8 \u003d 201600 bits

3) lead to single units of measurement. For this Mbity, we translate into bits: 6.3 * 1024 \u003d 6451.2 bits / s.

4) Find a print time: 201600: 6451.2 \u003d 31 seconds.

Bibliography

1. Ageev V.M. Theory of information and coding: sampling and coding of measuring information. - M.: MAI, 1977.

2. Kuzmin I.V., Kedrus V.A. Basics of information and coding theory. - Kiev, vice school, 1986.

3. The simplest methods of text encryption / D.M. Zlatopolsky. - M.: Clean ponds, 2007 - 32 p.

4. Ugrinovich N.D. Informatics I. information Technology. Tutorial for 10-11 classes / N.D. Vugrinovich. - M.: Binom. Laboratory of Knowledge, 2003. - 512 p.

5. http://school497.spb.edu.ru/uchint002/les10/les.html#n.

Material for self-study on lectures 2

Encoding ASCII.

ASCII Encoding Table (ASCII - American Standard Code for Information Interchange - American Standard Code for Exchange Information).

In total, using the ASCII encoding table (Figure 1), you can encode 256 different characters. This table is divided into two parts: the main (with OOH codes up to 7fh) and an additional (from 80H to FFH, where the letter H denotes the codes to the hexadecimal number system).

Picture 1

For encoding one character from the table, 8 bits (1 byte) are given. When processing text information, one byte may contain a code of some symbol - letters, numbers, punctuation sign, actions sign, etc. Each character corresponds to its code in the form of an integer. At the same time, all codes are collected in special tables called encoding. With their help, the symbol code is converted to its visible view on the monitor screen. As a result, any text in the memory of the computer is represented as a sequence of bytes with symbol codes.

For example, the word Hello! will be encoded as follows (Table 1).

Table 1


Binary code
Code decimal

Figure 1 shows the symbols included in the standard (English) and extended (Russian) encoding ASCII.

The first half of the ASCII table is standardized. It contains control codes (from 00h to 20h and 77h). These codes from the table are seized, as they do not belong to text elements. There are also marks of punctuation and mathematical signs: 2lh -!, 26h - & 28h - (, 2bh - +, ..., large and small Latin letters: 41h - a, 61h - a.

The second half of the table contains national fonts, symbols of pseudographic, from which tables, special mathematical signs can be built. The lower part of the encoding table can be replaced using the appropriate drivers - control auxiliary programs. This technique allows you to apply several fonts and their headsets.

The display for each symbol code should display the symbol image - not just a digital code, but the corresponding picture corresponding to it, since each symbol has its own form. The form of the form of each character is stored in a special display memory - a signogenerator. Highlight the symbol on the display screen IBM PC, for example, is carried out using points forming a symbolic matrix. Each pixel in such a matrix is \u200b\u200ban image element and can be bright or dark. The dark point is encoded with a number 0, bright (bright) - 1. If you depict the dark pixels in the matrix field in the matrix field, and a bright-asterisk, you can graphically portray the shape of the symbol.

People in different countries use symbols to record the words of their native zykov. Nowadays, most applications, including systems email and web browsers are pure 8-bit, that is, they can show and correctly perceive 8-bit characters, according to ISO-8859-1.

There are more than 256 characters in the world (if you consider Cyrillic, Arabic, Chinese, Japanese, Korean and Thai languages), and all new and new symbols appear. And it creates the following gaps for many users:

It is not possible to use the characters of various encoding sets in the same document. Since each text document uses its own set of encodings, there are great difficulties with automatic text recognition.

New characters appear (for example: euro), as a result of which ISO is developing a new ISO-8859-15 standard, which is very similar to the ISO-8859-1 standard. The difference is as follows: Symbols for the designation of old currencies are removed from the ISO-8859-1 encoding table, which are not currently used to make space for newly appeared characters (such as euros). As a result, users on disks may lie down the same documents, but in different encodings. The solution to these problems is the adoption of a single international set of encodings, which is called universal coding or unicode.

Encoding Unicode.

The standard was proposed in 1991 by the non-profit organization "Unicode Consortium" (Eng. Unicode Consortium, Unicode Inc.). Application of this standard allows you to encode very big number Symbols from different written languages: Chinese characters, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic, and Cyrillic, may be adjacent in the Unicode documents, and it becomes unnecessary switching code pages.

The standard consists of two main sections: a universal set of characters (eng. UCS, Universal Character Set) and a family of encoding (English UTF, Unicode Transformation Format). The universal set of characters sets the definite conformity of the characters with codes - elements of the code space representing non-negative integers. The encoding family determines the machine representation of the UCS codes sequence.

The Unicode standard was designed to create a single encoding of symbols of all modern and many ancient written languages. Each symbol in this standard is encoded with 16 bits, which allows it to cover incomparably more characters than 8-bit encodings taken earlier. Another important distinction of Unicode from other encoding systems is that it not only attributes to each symbol. unique code, but also determines the various characteristics of this symbol, for example:

symbol type (uppercase letter, lowercase letter, digit, punctuation sign, etc.);

symbol attributes (display from left to right or right left, space, row break, etc.);

appropriate uppercase or lowercase letter (for lowercase and uppercase letters, respectively);

the corresponding numeric value (for digital characters).

The entire range of codes from 0 to FFFF is divided into several standard subsets, each of which corresponds to either the alphabet of some kind of language or a group of special characters similar to its functions. The following scheme contains a total list of subsets Unicode 3.0 (Figure 2).

Figure 2.

UNICODE standard is the basis for storing and text in many modern computer systems. However, it is not compatible with most Internet protocols, since its codes may contain any byte values, and the protocols usually use bytes 00 - 1F and Fe - FF as service. To achieve compatibility, several Unicode conversion formats were developed (UTFS, Unicode Transformation Formats), from which today is the most common UTF-8. This format defines the following rules for converting each Unicode code into a set of bytes (from one to three) suitable for transporting Internet protocols.

Here X, Y, Z denote the bits of the source code, which should be removed from the younger, and to be entered into the results bytes to the right left until all of the specified positions are filled.

The further development of the UNICODE standard is associated with the addition of new language planes, i.e. Symbols in the ranges of 10,000 - 1FFFF, 20,000 - 2FFFF, etc., where it is supposed to include encoding for the writings of dead languages \u200b\u200bthat did not fall into the table above. To encode these additional characters, a new UTF-16 format was developed.

Thus, there are 4 main ways of encoding bytes in Unicode format:

UTF-8: 128 characters are encoded by one byte (ASCII format), 1920 characters are encoded with 2 bytes ((Roman, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic symbols), 63488 characters are encoded by 3 bytes (Chinese, Japanese et al.) The remaining 2 147 418 112 characters (not yet used) can be encoded with 4, 5 or 6 bytes.

UCS-2: Each symbol is represented by 2 bytes. This encoding includes only the first 65,535 characters from Unicode format.

UTF-16: It is an extension of UCS-2, includes 1 114 112 Unicode format characters. The first 65,535 characters are represented by 2 bytes, the rest - 4 bytes.

USC-4: Each character is encoded by 4 bytes.

Unicode (in English Unicode) is the standard encoding standard. Simply put, this is a table of conformity of text signs (, letters, elements of punctuation) binary codes. The computer understands only the sequence of zeros and units. So that he knew what exactly should be displayed on the screen, you must assign your unique number to each symbol. In the eighties, the signs were encoded by one byte, that is, eight bits (each bit is 0 or 1). Thus it turned out that one table (it is the same encoding or set) can accommodate only 256 characters. This may not be enough even for one language. Therefore, many different encodings appeared, the confusion with which often led to the fact that some strange krakozyabry appeared on the screen instead of the read text. The unified standard was required, which Unicode became. The most used encoding - UTF-8 (Unicode Transformation Format) for the image of the symbol involves from 1 to 4 bytes.

Symbols

Symbols in Unicode Tables are numbered by hexadecimal numbers. For example, Cyrillic capital letter M is denoted by u + 041c. This means that it stands at the intersection of the string 041 and the column of C. It can be simply copied and then inserted somewhere. In order not to rummage in the multi-kilometer list, you should use the search. Going to the Symbol page, you will see its number in Unicode and a way of drawing in different fonts. You can drive into the search string and the sign itself, even if the square is drawn instead, at least in order to find out what it was. Also, on this site there are special (and random) sets of the same type of icons collected from different sections for the convenience of using them.

UNICOD standard - international. It includes marks of almost all the writing of the world. Including those that no longer apply. Egyptian hieroglyphs, German runes, Mayan writing, clinp and alphabets of ancient states. Presented and designations of measures and scales, musical literacy, mathematical concepts.

The Unicode consortium itself does not invent new symbols. The tables are added to the tables that find their use in society. For example, the ruble sign was actively used for six years before was added to Unicode. The emoji (emoticons) pictograms also first gained widespread use in Japan before they were included in the encoding. But trademarks, and company logos are not added fundamentally. Even the apple apple or Windows flag. To date, about 120 thousand characters are encoded in version 8.0.