Cleaning text from unnecessary HTML tags - Parsing from A to Z. Removing from the HTML row tags in PHP PHP remove HTML tags from line

With the task of cleaning HTML from unnecessary tags, absolutely everything is facing.

The first thing that comes to mind is to use the php-function strip_tags ():
String Strip_Tags (String Str [, String AllowAgle_Tags])

The function returns a string cleaned from tags. Tags that do not need to be deleted are passed as the Allowable_TAGS argument. The function works, but, to put it mildly, is imperfect. In the course, there is no verification of the validity of the code, which can entail the removal of the text that is not included in the tags.
Initiative developers have not satisfied the hands - on the network you can find modified functions. A good example is strip_tags_smart.

Apply or not to apply ready-made solutions - a personal selection of a programmer. It so happened that I most often do not require a "universal" handler and is more convenient to clean the code by regular.

What does the selection of one or another processing method depends?

1. From the source material and the complexity of its analysis.
If you need to handle enough simple HTMP texts, without any tricky layout, clear as day :), you can use standard functions.
If there are certain features in the texts that need to be considered, then special handlers are written. You can simply be used in some STR_REPLACE. For example:

$ s \u003d array ("â € ™" \u003d\u003e "'", // Right-Apostrophe (EG IN I "M)
"â € œ" \u003d\u003e "" ", // Opening Speech Mark
"â €" "\u003d\u003e" - ", // LONG DASH
"â €" \u003d\u003e "", // Closing Speech Mark
"Ã" \u003d\u003e "é", // E Acute Accent
Chr (226). CHR (128). Chr (153) \u003d\u003e "', // Right-Apostrophe Again
Chr (226). CHR (128). CHR (147) \u003d\u003e "-", // LONG DASH AGAIN
Chr (226). CHR (128). Chr (156) \u003d\u003e "" ", // Opening Speech Mark
Chr (226). CHR (128). CHR (148) \u003d\u003e "-", // M Dash Again
Chr (226). CHR (128) \u003d\u003e "" ", // Right Speech Mark
CHR (195). CHR (169) \u003d\u003e "É", // E Acute Again
);

foreach ($ s AS $ Needle \u003d\u003e $ replace)
{
$ HTMLTEXT \u003d STR_REPLACE ($ Needle, $ REPLACE, $ HTMLTEXT);
}

Others can be based on regular expressions. As an example:

Function GetTextFromhtml ($ HTMLTEXT)
{
$ search \u003d array (""]*?>.*?"Si", // Remove JavaScript
""]*?>.*?"Si", // Remove Styles
""]*?>.*?"SI", // Remove XML Tags
""<[\/\!]*?[^<>] *?\u003e "Si", // Remove HTML-TAGS
"" ([\\ r \\ n]) [\\ s] "", // Remove Spaces
"" & (quot | # 34); "i", // Replace HTML Special Chars
"" & (amp | # 38); "I",
"" & (lt | # 60); "I",
"" & GT | # 62); "I",
"" & NBSP | # 160); "I",
"" & IEXCL | # 161); "I",
"" & (CENT | # 162); "I",
"" & (pound | # 163); "I",
"" & (Copy | # 169); "I",
"" (\\ d); "E"); // Write AS PHP

$ replace \u003d array (",
"",
"",
"",
"\\1",
"\"",
"&",
"<",
">",
" ",
chr (161),
chr (162),
chr (163),
chr (169),
"chr (\\\\ 1)");

Return preg_replace ($ search, $ replace, $ HTMLTEXT);
}
(At such minutes, however, the possibility of preg_replace will be happy to work with arrays as parameters). An array, if necessary, complement your regular regularities. It may help you in their compilation, for example, this designer of regular expressions. Beginner developers can be a useful article "All About Html Tags. 9 Regular Express to Strip Html Tags". Look there examples, analyze the logic.

2. From volumes.
Volumes are directly related to the complexity of the analysis (from the previous paragraph). A large number of texts increases the likelihood that, trying to foresee and clean everything by regular, you can miss something. In this case, the "multistage" cleaning method is suitable. That is, clear first, let's say, the strip_tags_smart function (source code just in case do not delete). Then selectively view some texts on the identification of "anomalies". Well, "we" clean "the anomaly by regular.

3. From what needs to be obtained as a result.
The processing algorithm can be simplified in different ways depending on the situation. The case described by me in one of the previous articles is well demonstrated. Let me remind you, the text was in the div - e, in which there was still a div with "bread crumbs", adsens advertising, a list of similar articles. When analyzing the selection of articles, it was found that articles do not contain patterns and simply broken into paragraphs with. In order not to clean the "main" division from outsiders, you can find all paragraphs (with Simple Html Dom Parser is very simple) and connect their contents. So before drawing up regular cleaning, look, it is impossible to do with low blood.

In general, between the supporters of the HTML-code parsing, based on regular expressions, and the parsing, which is based on the analysis of the DOM structure of the document, real holivars flared up on the network. Here, for example, on Overflola. Innocent

Check and processing incoming data is one of the common tasks in programming. PHP language is usually used for web applications, so here is the most relevant removal of HTML tags from the text, because they are most susceptible to third-party injections. In this article, I want to remind you of the old-fashioned Stip_TAGS () and its chips, and also offer solutions to remove sectional HTML tags and a couple of useful bonuses in the same time.

So. The main tool to delete HTML tags from the text is the strip_tags () function. We pass it string value, and it removes HTML and PHP tags from it, for example:

$ s \u003d "

Paragraph.

Still text. ";
echo strip_tags ($ s);

This example will display the string:

Paragraph. Still text.

It is noteworthy here that the function has the second (optional, but useful) The parameter, the value of which is a string with a list of allowed HTML tags, for example:

$ s \u003d "

Paragraph.

Still text. ";
Echo Strip_Tags ($ s, "

This example will display the string:

Paragraph.
Still text.

In my opinion, very comfortable. Nevertheless, it does not solve one important problem - deleting sectional HTML tagsFor example: Script, Noscript and Style - they are most common. When I need to remove such section tags, as well as options starting with "< » и заканчивающиеся символом « > "I use the following php code:

$ p \u003d array (
""]*?>.*?"Si",
""]*?>.*?"Si",
""]*?>.*?"Si",
""<[\/\!]*?[^<>] *?\u003e "Si",
);
$ R \u003d Array ("", "", "", "");
$ s \u003d preg_replace ($ p, $ R, $ s);

Here the variable $ p contains an array of regular expressions, and $ R is an array of replacing them (I use gaps). It remains only to replace in the string, and we will remove HTML trash from the text.

Obviously, the two above decisions can be combined. At the beginning I use a replacement through regular expressions, and then strip_tags () and I get my own function Nohtml ().

Finally, I want to offer you some more useful solutions. So in the text it is better to replace the tab, the result of the interpretation of the other in the browser is identical, and the trouble will be less, for example:

$ s \u003d STR_REPLACE ("\\ t", "", $ s);

If you do not need string transfers, they can also be replaced by spaces, for example:

$ s \u003d str_replace (array ("\\ n", "\\ r"), "", $ s);

From the extra spaces you can get rid of a simple regular expression, for example:

$ s \u003d preg_replace ("/ \\ s + /", "", $ s);
$ s \u003d trim ($ s); // will not be superfluous

I have everything on this. Thanks for attention. Good luck!

at 21:56.

Edit Message

You have a JavaScript blocked in your browser. Allow JavaScript to work the site!

strip_tags

(PHP 3\u003e \u003d 3.0.8, PHP 4, PHP 5)

strip_tags - Deletes HTML and PHP tags from line

Description

String strip_tags (String Str [, String Allowable_Tags])

This feature returns the string of STR, from which HTML and PHP Tags are removed. To remove tags, an automatic is used similar to the function applied to the function. fGETSS ().

Optional second argument can be used to indicate tags that should not be deleted.

Comment: Allowable_TAGS argument was added to PHP 3.0.13 and PHP 4.0B3. HTML comments are also deleted from PHP 4.3.0.

Attention

As strip_Tags () Does not check the correctness of the HTML code, unfinished tags can lead to the removal of the text that is not included in the tags.

Example 1. Example of use strip_Tags ()

$ Text \u003d "

Paragraph.

A little more text "; echo strip_tags ($ text); echo" \\ n \\ n ------- \\ n "; // Do not delete

Echo Strip_Tags ($ Text, "

"); // Allow ,, Echo Strip_Tags ($ Text, " ");

This example will bring out:

Paragraph. A little more text -------

Paragraph.

A little bit of text

Attention
This feature does not change the tag attributes specified in the Allowable_TAGS argument, including Style and OnMouseOver.

From PHP 5.0.0 strip_Tags () Secure for data processing in binary form.

This feature has a significant drawback - this is a gluing of words when removing tags. In addition, the function has vulnerability. Alternative feature Analog Strip_TAGS:
C "* - the" dirty "HTML is correctly processed, when symbols may occur in the tag attribute values< > * - Correctly processed by HTML * - Cut comments, scripts, styles, PHP, Perl, ASP code, MS Word Taggy, CDATA * - text automatically formats, if it contains HTML code * - Protection against fakes type: "<script\u003e Alert ("Hi")script\u003e "* * @Param String $ S * @Param Array $ allowable_tags array of tags that will not be cut * Example:" B "- Tag will remain with attributes," "- Tag will remain without attributes * @Param BOOL $ is_format_spaces format spaces and string transfers? * Type of output text (PLAIN) as close as possible text in the browser at the input. * In other words, competently converts text / html in Text / Plain. * The text is formatted only if any tags were cut out. * @Param Array $ pair_tags array of damp tags, which will be deleted along with contents * See default values \u200b\u200b* @Param Array $ Para_TAGS Massif of damp Tags, which will be perceived as paragraphs (if $ is_format_spaces \u003d true) * See default values \u200b\u200b* @return string * * @license http://creativecommons.org/licenses/by-sa/3.0/ * @Author Nasibullin Rinat, http: //ogengetie.ru/ * @charset ANSI * @Version 4.0.14 * / function strip_tags_smart (/ * String * / $ s, Array $ ALLOWABLE_TAGS \u003d NULL, / * Boolean * / $ is_format_spaces \u003d True, Array $ Pair_Tags \u003d Array ("Script", "Style", "Map", "IFrame", "Frameset", "Object", "Applet", "CO MMENT "," Button "," Textarea "," Select "), Array $ Para_Tags \u003d Array (" P "," TD "," TH "," Li "," h1 "," h2 "," h3 ", "H4", "H5", "H6", "DIV", "Form", "Title", "Pre") (// Return strip_tags ($ s); Static $ _callback_type \u003d false; static $ _allowable_tags \u003d array (); static $ _para_tags \u003d array (); #regular Expression for Tag Attributees #correct Processes Dirty and Broken Html in A SingleByte or MultiByte UTF-8 Charset! Static $ RE_ATTRS_FAST_SAFE \u003d "(?!) #statement, Which Follows After A Tag #correct Attributes (?\u003e [^\u003e" \\ "] + | (?<=[\=\x20\r\n\t]|\xc2\xa0) "[^"]*" | (?<=[\=\x20\r\n\t]|\xc2\xa0) \"[^\"]*\")* #incorrect attributes [^>] * + "; if (IS_Array ($ s)) (if ($ _callback_type \u003d\u003d\u003d" strip_tags ") ($ TAG \u003d STRTOLOWER ($ S); if ($ _allowable_tags) (#tag with attributes if (Array_Key_exists ($ Tag, $ _allowable_tags)) Return $ s; #tag without attributes if (Array_Key_exists ("<" . $tag . ">", $ _allowable_tags)) (if (substr ($ s, 0, 2) \u003d\u003d\u003d""; if (substr ($ s, -2) \u003d\u003d\u003d" /\u003e ") Return"<" . $tag . " />"; Return"<" . $tag . ">";)) if ($ tag \u003d\u003d\u003d" br ") return" \\ r \\ n "; if ($ _para_tags && array_key_exists ($ tag, $ _para_tags)) RETURN" \\ R \\ N \\ r \\ n "; Return "";) trigger_error ("Unknown Callback Type". "$ _CALLBACK_TYPE." "" ", E_USER_ERROR);) if ((($ POS \u003d STRPOS ($ s,"<")) === false || strpos($s, ">", $ POS) \u003d\u003d\u003d FALSE) #Speed \u200b\u200bImprove (#tags Are Not Found Return $ s;) $ Length \u003d Strlen ($ s); #Unpaired Tags (Opening, Closing ,! DOCTYPE, MS Word Namespace) $ RE_TAGS \u003d "~: * +)?) # 1 ". $ Re_attrs_fast_safe."\u003e ~ SXSX "; $ Patterns \u003d Array (" /<([\?\%]) .*? \\1>/ SXSX ", # built-in PHP, Perl, ASP code" /<\!\\]>/ SXSX ", # CDATA #" blocks "/<\!\[ [\x20\r\n\t]* .*? \]>/ SXSX ", #: Deprecated: MS Word Tagged Tag... "/<\!--.*?-->/ SSX ", # Comments #MS Word Tagged Type...", # Conditional execution code for IE type" HTML "# Conditional execution code for IE type" HTML"# See http://www.tigir.com/comments.htm" /<\! (?:--)?+ \[ (?> [^ \\] "\\"] + | "[^"] * "| \\" [^ \\ "] * \\") * \\] (?: -)? +\u003e / sxsx ",); if ($ pair_tags) (# pair tags together with content: Foreach ($ pair_tags as $ k \u003d\u003e $ v) $ pair_tags [$ k] \u003d preg_quote ($ V, "/"); $ patterns \u003d "/<((?i:" . implode("|", $pair_tags) . "))" . $re_attrs_fast_safe . "(? .*? <\/(?i:\\1)" . $re_attrs_fast_safe . "> / SXSX ";) #d ($ patterns); $ i \u003d 0; # Protection against $ max \u003d 99; While ($ i< $max) { $s2 = preg_replace($patterns, "", $s); if (preg_last_error() !== PREG_NO_ERROR) { $i = 999; break; } if ($i == 0) { $is_html = ($s2 != $s || preg_match($re_tags, $s2)); if (preg_last_error() !== PREG_NO_ERROR) { $i = 999; break; } if ($is_html) { if ($is_format_spaces) { /* В библиотеке PCRE для PHP \s - это любой пробельный символ, а именно класс символов [\x09\x0a\x0c\x0d\x20\xa0] или, по другому, [\t\n\f\r \xa0] Если \s используется с модификатором /u, то \s трактуется как [\x09\x0a\x0c\x0d\x20] Браузер не делает различия между пробельными символами, друг за другом подряд идущие символы воспринимаются как один */ #$s2 = str_replace(array("\r", "\n", "\t"), " ", $s2); #$s2 = strtr($s2, "\x09\x0a\x0c\x0d", " "); $s2 = preg_replace("/ [\x09\x0a\x0c\x0d]++ | <((?i:pre|textarea))" . $re_attrs_fast_safe . "(? .+? <\/(?i:\\1)" . $re_attrs_fast_safe . "> \\ K / SXSX "," ", $ S2); if (preg_last_error ()! \u003d\u003d preg_no_error) ($ i \u003d 999; Break;)) An array of tags that will not be cut if ($ ALLOWABLE_TAGS) $ _allowable_tags \u003d array_flip ($ allowable_tags); # Paired tags that will be perceived as paragraphs if ($ para_tags) $ _para_tags \u003d array_flip ($ para_tags);)) #if #tags Processing if ($ is_html) ($ _callback_type \u003d "strip_tags"; $ s2 \u003d preg_replace_callback ($ RE_TAGS, __Function__, $ S2); $ _callback_type \u003d false; if (preg_last_error ()! \u003d\u003d preg_no_error) ($ i \u003d 999; Break;)) if ($ s \u003d\u003d\u003d $ s2) Break; $ S \u003d $ S2; $ I ++;) #while if ($ i\u003e \u003d $ max) $ s \u003d strip_tags ($ s); #Too Many Cycles for Replace ... if ($ is_format_spaces && strlen ($ s)! \u003d \u003d $ Length) (#remove A Duplicate Spaces $ s \u003d preg_replace ("/ \\ x20 \\ x20 ++ / ssx", "", trim ($ s)); #remove A Spaces Before and After New Lines $ s \u003d Str_replace (Array ("\\ R \\ n \\ x20", "\\ x20 \\ r \\ n"), "\\ r \\ n", $ s); #replace 3 and more new Lines to 2 new lines $ s \u003d preg_replace ("/ [ \\ r \\ n] (3,) + / ssx "," \\ ) Return $ s; )? \u003e.
See also the description of the function

Task Delete all or only certain HTML tags from the line often occurs where it is necessary to provide the opportunity to any visitor to add new information. The most common example may be a guestbook or comment system on the site. The text that is thus added may contain many different tags added by chance when copying text or deliberately to make a message somehow "very original." It is worth noting the same and malicious attempts to make a malicious code in the Script tags or an attempt to spoil the page layout with extra tags.

In any of the listed cases, it is necessary before recording new information, clean it from unnecessary HTML tags.

Full cleaning of text from HTML tags

Often, regular expressions are used for such tasks, but in this article we will consider the easiest method - deleting tags using the PHP function strip_tags. This feature simply deletes tags from the string specified in the parameter.

$ str_in \u003d. "
My text from various tags.
" ;
$ str_out \u003d strip_tags ($ str_in);
echo $ str_out;

As a result of this processing in the $ str_out variable we will get a string without tags:

My text with various tags.

* It is worth noting that the strip_tags function removes only the tags themselves, leaving their contents between the opening and closing tag.

Removing individual HTML tags from text

Sometimes you need to remove only certain tags from the string. Here we will also use the strip_tags function, but this time the second (optional) parameter indicate the tags you want to save.

For example, when processing a string, you need to leave only links:

$ str_in \u003d. "
My text from various tags.
" ;
$ str_out \u003d strip_tags ($ str_in, " " );
echo $ str_out;

As a result of this processing in a variable $ str_out we get:

My text with various tags.

Thus, you can specify all the tags that are permissible in the string, while everyone else will be deleted.

This article discusses the easiest way to clean the line from tags. Considering other options, I will expand this article. I will be glad if you offer your solutions to this task in comments or by email.