the Internet Windows Android

Screening of characters. What special characters should be shielded in regular expressions? What is shielding in HTML

Usually programming languages, text command interfaces, text marking languages \u200b\u200b(HTML, TeX, Wiki-markup) are dealt with structured text, in which some characters (and combinations) are used as manager, including control structure of text. In a situation where it is necessary to use such a symbol as a "conventional language symbol", apply shielding.

Conditionally shielding can be divided into three types:

  • single symbol shielding
  • screening a group of characters using a "Start shielding" symbol sequence, "Finish shielding"
  • using the command sequence "Start shielding" and the "End of shielding" symbol, which is set to the start of the shielded text.

Lack of shielding as a cause of vulnerability

Symbol shielding attracts special attention when the structured text is generated automatically. The inclusion of arbitrary string data involves the mandatory shielding of control characters in them. At the same time, very often the real lines of such characters do not contain, which allows the programmer to skip this operation at all and receive a simpler program, which is correctly operating with "any reasonable" string data. However, such a simplified code has a hidden vulnerability, because a third-party person (the author of string data) receives an unauthorized possibility to influence structure The generated text. Vulnerability becomes serious if the text created is someone's program. Traditionally, such problems are subject to SQL languages \u200b\u200b(see SQL-INJECTION) and HTML (see SCRS SITE Scripting).

Examples

Single symbol shielding

  • In the SI programming language, inside the lines, the screening of characters is carried out using the symbol "" placed before the shielded symbol. (At the same time, the "\\" symbol can shield itself, that is, a combination "\\\\") is used to output a bexlesh), the same symbol is used to shield characters in the UNIX command prompt.
  • In the Microsoft Windows command prompt, the screening of the character part of the characters is carried out using the "^" symbol placed in front of the shielded symbol.

Shielding group of symbols

  • In the Python programming language, the screening of the symbol group in the string is made by indicating the letter R (from the English. Raw - raw) in front of the string, i.e. characters are shielded by sequences r "shielded text "
  • Wiki-markup text shielding is carried out with the help of pseudo and . If you need to write pseudoteg himself , this is done by the symbols of the substitution ( ).

Shielding text with final symbol

When there are many control characters in the text, there will be many shielding signs, the text becomes heavy. For such cases, an alternative shielding method is used - with the final text. In this case, all control characters will be symbols (do not bear the control function), and the text ends when the compiler detects some sequence - the final text.

To understand when and what to avoid without attempts, you need to accurately understand the chain of contexts through which the string passes. You will specify a string from the longest side to its final destination, which is the memory processed by the regular expression syntactic analysis code.

Remember how the string is processed in the memory: if it can be a simple line inside the code or a string entered into the command line, but it can be either an interactive command line or a command line specified in the shell script file, or inside the memory variable mentioned by the code , or (string) argument with a further evaluation, or a string containing the code generated by dynamically with any encapsulation ...

Each of this context is assigned several characters with special functionality.

If you want to convey the symbol literally without using its special function (local for context), then in this case you must shield it for the next context ... which may need some other Escape characters that may be required additionally ran away in the previous context (Oh). In addition, there may be things like symbol encoding (the most insidious is UTF-8, because it looks like ASCII for common symbols, but it can be additionally interpreted even by the terminal depending on its settings, so it can behave differently, The HTML / XML encoding attribute is necessary for the correct understanding of the process.

For example, a regular expression on the command line, starting with Perl -NPE, must be transferred to the EXEC system call set connecting as a channel that processes the file, each of these EXEC system calls simply has a list of arguments that were separated (not shielded) spaces And maybe channels (|) and redirection (\u003e n\u003e n\u003e & m), brackets, interactive extension * and? , $ (()) ... (All these are special characters used by * sh, which may seem interfering with a regular expression symbol in the following context, but they are estimated in order: before the command line. The command line is read by the program as Bash / SH / CSH / TCSH / ZSH, essentially inside a double quotation cell or single quotation, shielding is easier, but there is no need to make a row in the command line in the command line, because mostly the space must begin with a reverse braid feature and quotation, no need, leaving the functionality of the disclosure For characters * and?, But it analyzes the same context as in quotation. Then, when evaluating the command line, the regular expression obtained in memory (not as recorded on the command line) receives the same processing as in the source file. For regular expression in square brackets there is a symbol set context, a regular PERL expression can be enclosed in a large set of non-alpha-numeric characters (for example, M // Il and M: / better / for / way: ...).

You have more details about the characters in a different response, which are very specific for the final context of the regular expression. As I already noted, you mention that you find that Regexp is reset with attempts, which is probably due to the fact that another context has another set of characters that configured your memory of attempts (often the reverse slash is a symbol used In this other context for shielding a literal symbol instead of its function.).

The note: The adaptive version of the site is activated, which automatically adjusts to the small size of your browser and hides some site details for reading. Happy viewing!

I am glad to again welcome everyone on the blog pages dedicated to all the intricacies of successful creation and promotion of sites - Site. ON! In today's PHP lesson, we will touch on topics such as: types of variables, shielding, specialimolts, as well as Heredoc syntax in PHP.

Types of variables

PHP has eight different types of variables, of which

4 scalar types:

  • boolean (Boolean or Logic Type)
  • integer (integers)
  • float (floating point number)
  • string

2 mixed types:

  • array (Array)
  • object (object)

2 special types:

  • resource

Before proceeding to the consideration of each type in more detail, it is worth clarifying that PHP is not strictly typed language, but a language with dynamic typing. This means that we do not need in advance (when creating) declare the type of each variable. PHP himself guesses which type of one or another variable, based on the fact that we put in this variable. It also means that, in contrast to languages \u200b\u200bwith strict typisations, we can in the variable with the number (Integer) take and put the string (string) and it will not be an error! This is one of the features of PHP, which very much like people (novice), previously not dealing with programming. As a rule, in the end, everyone comes to the fact that it is minus language, and not plus.

Boolean (logical) - The simplest type. Can take only 2 values: true. or false (Right or wrong), they are registered independent (you can write true, true and so on.). Visory example:

echo $ Name, "
", $ name2;?\u003e

Result:

As you can see, the browser does not understand the Boolean type, unlike PHP, so when trying to derive true. or false He will display the number 1 or empty string.

When converting to logical type, the following values \u200b\u200bare treated as false:

  • whole 0 (zero)
  • floating point 0.0 (zero)
  • empty string and string "0" or "0"
  • empty array
  • special type NULL (including unidentified variables)

All other values \u200b\u200bare treated as true.

// decimal number $ int \u003d -5; // a negative number $ int \u003d 05; // octal number $ int \u003d 0x1a; // Hexadecimal number
// Floating point numbers (real): $ FLT \u003d 1.4; $ FLT \u003d 1.2E3; $ FLT \u003d 7E-10; ? \u003e.

However, the most commonly used type in PHP can be considered strings (String). Rows can be recorded either in single or double quotes, but I never advise you to write down lines in double quotes, as you are making a PHP interpreter "Puro" your string for the presence of variables in it, but albeit slightly, but slow down work. Even if you want to use variables in your row - this can be done using single quotes + (bonding two or more lines in one). Why then double quotes are needed at all? For example, when we want to use specialimwills (\\ n, \\ r, etc.), but a little later about them.

It is also worth noting that use single quotes + concatenation Makes the code much more readable than if everything is without paving dual quotes. But enough prefaces, now you will see everything yourself and understand:

$ Number \u003d 2; // Integer $ Hand1 \u003d "The number of hands in humans:"; // String + Make Person For Variables $ hand2 \u003d "Human Hands:"; // String
// Add a variable $ Number to these lines: $ hand1 \u003d "Number of hands in humans: $ number and still text ..."; // I do not recommend $ Hand2 \u003d "The number of hands in a person:". $ Number. "And still text ..."; // recommend!
echo $ Hand1, "
", $ hand2;?\u003e

Result:

We will talk more about concatenation in the next article.

  • she was assigned a constant NULL.
  • she has not yet been assigned any meaning.
  • it was removed using unset ()

The study of the remaining types of variables at this stage would be meaningless. With the rest of the types, we will collide and discern them with a deeper study of PHP.

Shielding in PHP.

And what if we do not want to get the value of the variable in our line, but do we want to write literally $ number? Consider two options:

$ hand1 \u003d "Number of hands in humans: \\ $ number and still text ..."; // I do not recommend $ hand2 \u003d "The number of hands in humans: $ number and still text ..."; // recommend!
echo $ Hand1, "
", $ hand2;?\u003e

Result:

In the first version (with double quotes), we used the shielding of a special symbol of the dollar, so that this specialist ceased to have its own special purpose (designation of variables) and turned into an ordinary dollar sign.

In the second variant (with single quotes) as you already know - PHP interpreter did not even try to find variables in the line, and therefore the screening was not required.

Special Mills in PHP.

Especially for blog readers Site. ON! I prepared a small list of special characters in the PHP programming language:

  • \\ N new row
  • \\ r Return carriage
  • \\ T horizontal tabulation
  • \\\\ reverse skew damn (backlash)
  • \\ $ dollar sign
  • \\ "Double quotation

Let's look at the work of specialsmols on the example of \\ N - a specialist, which makes the translation to a new line (as Enter), but browsers do not understand (and should not) ignore it, but the result of his work can be viewed in the source page of the page:

echo $ Rule, "
", $ rule2;?\u003e

Result:

Source code (Ctrl + U):

If for visitors in the browser, the special mixer is not displayed in any way, then what is his meaning?

First, with the help of special symbols and \\ n, in particular, you can conveniently format the code on the page (as in the example above).

Secondly, \\ n can be used, for example, when recording to a file to make the transfer (Enter) and continue recording on a new line.

An alternative to this formatting is.

Heredoc Syntax in PHP

Result:

Source code (Ctrl + U):

The result speaks for himself, now let's see how everything is arranged:

  • The string begins with three corner brackets<<<, далее следует имя идентификатора.
  • A string with the opening identifier (label) in no case should contain after it any other characters, including a space. That is, in other words, immediately after our label we have to put Enter, without a space, immediately Enter!
  • Transfer
  • Tutorial

SQL injection, fake of cross-line requests, damaged XML ... Scary, terrible things, from which we would all like to protect yourself, but just know why it all happens. This article explains the fundamental concept behind all this: rows and row processing inside lines.

Main problem

This is just the text. Yes, just text - here it is the main problem. Almost everything in the computer system is represented by the text (which, in turn, is represented by bytes). Is that some texts are intended for a computer, and others for people. But those and those still remain the text. To understand what I am talking about, I will give a small example:
Homo Sapiens. Suppose, There Is The English Text, Which i Don "T Wanna Translate Into Russian

Do not believe it: it is text. Some people call it XML, but it is just the text. Perhaps it is not suitable for showing the English teacher, but it is still just text. You can print it on a poster and walk with him to rallies, you can write it in the letter your mother ... This is the text.

Nevertheless, we want certain parts of this text to have some value for our computer. We want the computer to be able to extract the author of the text and the text itself separately, so that you can do something with it. For example, convert the aforementioned to it:
SUPPOSE, THERE IS THE English Text, Which i don "t Wanna Translate Into Russian by Homo Sapiens
Where does the computer know how to do it? Well, because we highly wrapped certain parts of the text with special words in fun brackets, such as, for example, and. Since we did it, we can write a program that would search for these certain parts, removed the text and would use it for any our own invention.

In other words, we used certain rules in our text to identify some particular importance that someone, observing the same rules, could use.
Okay, it's not so hard to understand. And what if we want to use these funny brackets that are some special meaning in our text, but without using this very meaning? .. something like this:

Homo Sapiens. < n and y >

Symbols "<" и ">"They are not special. They can legally be used anywhere, in any text, as in the example above. But how is our idea about special words, such as? Does this mean, what is some kind of keyword? In XML - maybe Yes. And maybe not. This is ambiguous. Since computers do not cope with ambiguities, then something as a result can give an unforeseen result if we do not interfere with all the points above I and not eliminate ambiguity.
You can solve this dilemma, replacing ambiguous symbols of something unambiguous.
Homo Sapiens. Basic Math TELLS US THAT IF X< n and y > N, X Cannot Be Larger Than Y.

Now, the text should be completely unequivocal. "<" равносильно "<", а ">" - ">".
Technical definition of this - shielding , We avoid special symbols when we do not want them to have their particular importance.
Escape | Iskāp | [NO OBJ. ] Break free [with OBJ. ] Do not notice / do not remember [...] [With OBJ. ] IT: Cause to be interpreted differently [...]
If certain characters or sequences of characters in the text are of particular importance, there must be rules that determine how to solve the situations where these characters should be used without attracting their special importance. Or in other words, shielding answers the question: "If these characters are so special, then how should I use them in your text?".
As it was possible to notice in the example above, ampersand (&) is also a special symbol. But what to do if we want to write "<", но без интерпретации этого как "<"? В XML, escape-последовательность для &, это - " & ", т.е. мы должны написать: " &< "

Other examples

XML is not the only case of "suffering" from special characters. Any source code in any programming language may demonstrate it:
VAR NAME \u003d "HOMO SAPIENS"; Var Contents \u003d "Suppose, There Is The English Text, Which I Don" T Wanna Translate Into Russian ";
Everything is simple - the usual text is clearly separated from "not text" double quotes. In the same way, my text from the Mathematical Analysis Course can be used:
VAR NAME \u003d "HOMO SAPIENS"; Var Contents \u003d "Basic Math TELLS US THAT IF X< n and y > N, X CANNOT BE LARGER THAN Y. ";
Cool! And do not even need to resort to shielding! But wait, and what if I want to quote someone?
VAR NAME \u003d "HOMO SAPIENS"; Var Contents \u003d "Plato Is Said to Once Have SAID" Lorem Ipsum Dolor Sit Amet ".";
Hmm ... sadness, longing. As a person, you can determine where the text begins and ends and where the quote is located. However, it became ambiguous again for any computer. We must come up with some kind of shielding rules that would help us distinguish with literal "and", which means the end of the text. Most languages \u200b\u200bProgramming use oblique features:
VAR NAME \u003d "HOMO SAPIENS"; Var Contents \u003d "Plato is Said to Once Have Said \\" Lorem Ipsum Dolor Sit Amet \\ ".";
"\\" makes a symbol after it is not special. But this, again, it means that "\\" is a special symbol. For the unequivocal writing of this symbol in the text, you need to add the same symbol to it by writing: "\\\\". Funny, right?

Attack!

Not everything would be so bad if they just had to resort to shielding. Strains of course, but it is not so terrible. Problems begin when some programs write text for other programs to "read" it. And no, it is not science fiction, it happens constantly. For example, on this site, you, publish a message, do not dial it into manual in HTML format, and write only the text that, in consequence, is converted by this site in HTML, after which the browser is already converting the "generated" HTML again in Readable text.

Another common example and source of many security problems - SQL requests. SQL - language designed to simplify communication with databases:
In this text, there are practically no special characters, mostly English words. And yet, in fact, each word in SQL has a special meaning. It is used in many programming languages \u200b\u200bworldwide in one form or another, for example:
$ query \u003d "SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d" ALEX ""; $ result \u003d mysql_query ($ Query);
These two simple lines will abstract from us a terribly complex task of a request for a database program that meets our requirements. The database "sieves", possibly terabytes of bits and bytes to return a beautifully formatted result of the program that has made a request. Seriously, all this crap is encapsulated in a simple Anglo-like proposal.

In order to make it useful, such requests are not hard-codes, but are built on the basis of user input. This is the proposal aimed at using different users:
$ name \u003d $ _post ["Name"]; $ query \u003d "SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d" $ NAME ""; $ result \u003d mysql_query ($ Query);
In case you simply view this article: This is an anti-example! This is the worst that you ever could do! This is a security nightmare! Every time you write something like this, will die one innocent kitten! Ktulhu will devour your soul for it!

And now let's see what happens here. $ _Post ["Name"] - a value that a random user entered into a random form on your randomly website. Your program will build a SQL query that uses this value as a username that you would like to find in the database. Then this SQL "Offer" is sent straight to the database.

It seems that everything sounds not so terrible, yes? Let's try to enter several random values \u200b\u200bthat can be entered on your random website and what requests from this will turn out:

Alex.
SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d "ALEX"
MC "Donalds.
SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d "MC" DONALDS "
Joe "; Drop Table Users; -
SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d "JOE"; DROP Table Users; - "
The first request does not look scary, but quite nice, right? Number 2, it seems, "somewhat" damages our syntax due to ambiguous. "Damn German! Number 4 Some stupid. Who would write this? It does not make sense ...
But not for the database processing request ... The database does not have anything from where this request came, and what he should mean. The only thing that she sees is two requests: find a user number named Joe, and then delete the USERS table (which is accompanied by a comment "), and it will be successfully done.

For you, this should not be news. If so, then, please read this article again, because you are either a newcomer in programming, or the last 10 years lived in a cave. This example illustrates the basics of SQL injections used all over the world. In order to delete data, or get data that should not be simply obtained, or log in, without having rights, etc. And all because the database perceives the Anglo-like "sentence" too literally.

Oooeeeee!

Next step: XSS attacks. They act in the same way, only apply to HTML.
Suppose you have decided problems with the database, receive data from the user, write down to the database and output them back to the website, to access users. This is what makes a typical forum, a comment system, etc. Somewhere on your site there is something like that:

Posted by. ON.


If your users are good and kind, they will place the quotes of old philosophers, and the messages will have about the following type:

Posted by Plato on January 2, 15:31

I am Said to Have Said "Lorem Ipsum Dolor Sit Amet, Consertetur Adipisicing ELIT, SED DO EIUSMOD TEMPOR INCIDIDUNT UT LABORE ET DOLORE MAGNA ALIQUA. UT ENIM AD MINIM VENIAM, QUIS NOSTRUD EXERCITATION ULLAMCO LABORIS NISI UT ALIQUIP EX EA COMMODO CONSEQUAT."


If users are clever, they will probably talk about mathematics, and there will be such messages:

Posted by Pascal On NovEmber 23, 04:12

Basic Math TELLS US THAT IF X< n and y > N, X Cannot Be Larger Than Y.


Hmm ... again these defaults of our brackets. Well, from a technical point of view, they may be ambiguous, but the browser will forgive us this, right?


Well, stop, what hell? What a joker introduced JavaScript tags to your forum? Anyone who looks at this message on your site, is now loading and executes scripts in the context of your site that can do not have the news that. And this is not good.

Do not understand literally

In the above cases, we want to somehow inform our database or browser, that it's just a text, you do nothing with him! In other words, we want to "delete" the special values \u200b\u200bof all special characters and keywords from any information provided by the user, for we do not trust him. What to do?

What? What do you say, boy? Oh, you say "shielding"? And you are absolutely right, take the cookie!
If we apply shielding to user data before combining them with the request, the problem is solved. For our database requests, it will be something like:
$ name \u003d $ _post ["Name"]; $ name \u003d mysql_real_escape_string ($ name); $ query \u003d "SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d" $ NAME ""; $ result \u003d mysql_query ($ Query);
Just one line of code, but now no one else can "hack" our database. Let's see again how SQL requests will look, depending on the user entry:
Alex.
SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d "ALEX"
MC "Donalds.
Select phone_number from Users WHERE NAME \u003d "MC \\" DONALDS "
Joe "; Drop Table Users; -
SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d "JOE \\"; DROP Table Users; - "
MySQL_REAL_ESCAPE_String without parsing places oblique line in front of everything, which can be some kind of special meaning.


We use the HTMLSPecialChars feature to all user data, before bringing them away. Now the pest message looks like this:

Posted by jacktr on july 18, 12:56


Please note that the values \u200b\u200breceived from users are not really "damaged." Any browser Parsit This is like HTML and displays everything in the right form.

What brings us back to ...

All the above demonstrates the problem characteristic of many systems: the text in the text should be shielded if it is assumed that it should not have special characters. Placing textual values \u200b\u200bin SQL, they must be shielded by SQL rules. Placing textual values \u200b\u200bin HTML, they must be shielded by HTML rules. By placing text values \u200b\u200bin (technology name), they must be shielded by the rules (technology name). That's all.

For complete picture

There are, of course, other ways to combat user inventors, which should or should not contain special characters:
  • Validation.
    You can check whether the user entry matches some specified specification. If you require the input of the number, and the user enters something else, the program must inform him about it and cancel input. If all this is properly organized, then there is no risk to grab "Drop Table Users" where, it was assumed, the user will introduce "42". This is not very practical to avoid HTML / SQL injections, because Often it is necessary to adopt the text of a free format that can contain "hike". Usually validation is used in addition to other measures.
  • Sanitization
    You can also "damp" to remove any characters that you consider dangerous. For example, simply remove something similar to the HTML tag that avoid adding to your forum. The problem is that you can delete quite legitimate parts of the text.
    Prepared SQL statements
    There are special functions that make something we achieved: forcing the database to understand the differences between the SQL request and information provided by users. In the RNR, they look like this:
    $ STMT \u003d $ PDO-\u003e PREPARE ("SELECT PHONE_NUMBER FROM Users WHERE NAME \u003d?"); $ STMT-\u003e EXECUTE ($ _ POST ["NAME"]);
    At the same time, sending occurs in two stages, clearly distinguishing the request and variables. The database has the ability to first understand the query structure, and then fill it with values.

  • In the real world, all this is used together for different protection steps. You must always use verification check (validation) to be sure that the user enters correct data. Then you can (but not obliged) scan the entered data. If the user is clearly trying to "drive" a script for you, you can simply delete it. Then, you always must always shield custom data before placing them in a SQL query (the same applies to HTML).

In the directory on regular expressions, there is such a section called " Meta-symbols (shielded) ". That's it precisely about these meta symbols (they are also called special symbols) and we will talk in this article.

Special symbols - These are the characters that are not letters or numbers. That is, these are all characters, except letters and numbers.

Special characters are considered to be such symbols as a point, asterisk, plus, question mark, grid and others.

As we know from previous articles, some special characters have a special role in regular expressions. That is, every special symbol has some kind of performance.

For example, a point means absolutely any character. The stars is a quantizer of repetitions from zero to infinity. Plus is also a quantizer of repetitions from one to infinity. The imaginary symbol ^ means the beginning of the line, and the dollar sign ($) the end of the string. By the way, the dollar symbol is also an imaginary symbol. We also know that the symbol ^ has another role, if we put it inside square brackets. We talked about all these values \u200b\u200bin previous articles.

In this article I will answer the question " How to use special characters in regular terms ".

In order to cancel this particular role, a special symbol in regular terms, it is necessary shield. Thus, this special character will represent exactly that symbol that is. That is, the shielded point means the point, and not any character. Shielded stars, means the stall, and not the quantifier of repetitions.

Shielding Made by the reverse stupid. That is, in order to shield some special symbol, you must put in front of it, the backlash.

Suppose we have such a task "Check whether the point is set at the end of the line." So in order for this point in regular expression, it is exactly a point, and not by any other symbol, it is necessary to shield it.

Var str \u003d "He is a hero."; var reg \u003d / .* .$ /; Alert (reg.test (STR)); // True

As we see, the result of checking the line for compliance with the regular expression is true. If we remove the point from the end of the line, the result will be already false.

Similarly, other special characters are also shielded.

Var str \u003d "x + y \u003d .n * m \u003d /, co \\\\ la"; var reg \u003d /x\\ +y\u003d\\.n\\*m\u003d \\/co\\\\\\la/; Alert (reg.test (STR)); // True

Here we have shielded symbols plus (\\ +), points (\\.), Stars (\\ *), the usual layer (/ /) and the reverse layer (\\\\\\). Please note that the reverse layer in the row is written by two reverse strokes. And it is shielded in regular expression, also with the help of two reverse layers.

If we use Alert, withdraw the string from the STR variable, then instead of two reverse slabs we will see only one.

Similarly, all the characters specified in the directory are shielded in the Meta-symbols section.

And on this, perhaps, everything. From this little article, you already know how to shield special characters And how to use them in the preparation of regular expressions.

Tasks

  1. Suppose we need to check for compliance, such a string "I won $ 400." Write a regular expression that checked the presence of a dollar symbol at the end of the string. Check the string for compliance.