Correct methods for removing the sheets of pages. How to get rid of the pages of the right job with the duplicate pages

The reason for writing this article was the next call of an accountant with a panic before surrendering reports on VAT. Last quarter spent a lot of time cleaning the double counterparties. And again they, the same and new ones. Where?

I decided to spend time, and deal with the cause, not a consequence. The situation is mainly relevant with customized automatic unloading through the exchange plans from the control program (in my case of ut 10.3) in the company's accounting department (in my case 2.0).

Several years ago, these configurations were installed, and the automatic exchange between them is configured. Faced the problem of the peculiarity of the reference book of counterparties of the sales department, which began to start the poles of counterparties (with the same INN / CPP / name) for one reason or another (the same counterparty they scattered on different groups). Accounting expressed its "fi", and decided - it does not matter to us that they have, combine cards when loading into one. I had to intervene in the process of transferring objects by the rules of exchange. Removed for counterparties Search by internal identifier, and left a search by Inn + GPP + name. However, and then there were their pitfalls in the form of fans to rename the names of counterparties (as a result, Dupils in the BP are already created by the rules themselves). They all gathered together, discussed, decided, convinced that in Urti we had duplicate, they removed them, returned to the standard rules.

That's just after the "combing" doubles in UT and in BP - internal identifiers from many counterparties differed. And since typical Rules The exchange is searching for objects exclusively by internal identifier, then with the next portion of documents in the BP, the new counterparty double (in case these identifiers differed). But universal exchange xML data Would not be universal if it was impossible to get around this problem. Because Identifier existing object staffing It is impossible to change this situation, you can get around this situation using a special compliance register "Compliance of objects for exchanging", which is available in all standard configurations from 1C.

In order not to have the new doubles, the cleaning algorithm of the doubles became as follows:

1. In BP using the processing "Search and Replacing the Duplicate Elements" (it is typical, it can be taken from the configuration. Trade control or on the ITS disk, or select the most appropriate among the set of variations on the infostar itself) I find a double, I define the faithful item, click executing replacement.

2. I get the internal identifier of the only (after replacement) of the object of our double (sketched specifically simple processing for this so that the internal identifier is automatically copied to the clipboard).

3. I open the "Compliance of Objects for Exchange" in UT, I make a selection by my own link.

Fighting Double Page

The owner may not suspect that on its website some pages have copies - most often it happens. Pages are open, with their contents are all in order, but if only to pay attention to, then it can be noted that with the same content of the address different. What does it mean? For live users, nothing, as they are interested in information on the pages, but the soulless search engines perceive such a phenomenon completely differently - for them it is completely different pages with the same content.

Are DOUBLE pages are harmful? So, if the ordinary user can not even notice the presence of a double on your site, then search engines will immediately determine. What reaction from them to wait? So, in fact, the copies are seen as different pages, then the content on them ceases to be unique. And this already negatively affects ranking.

Also, the presence of a dub is blocked, which the optimizer tried to focus on the target page. Because of the double, he may not be at all on that page that he wanted to move. That is, the effect of inner transfine and external references It can be repeatedly reduced.

In the overwhelming majority of cases in the appearance of the double - due to not right settings And the lack of proper attention of the optimizer is generated clear copies. With this, many CMS are sinning, for example, Joomla. To solve the problem, it is difficult to choose a universal recipe, but you can try to use one of the plug-ins to delete copies.

The emergence of fuzzy doubles, in which the content is not fully identical, usually occurs due to the fault of the webmaster. Such pages are often found on online store websites, where pages with goods are characterized by only a few sentences with a description, and the rest of the content consisting of through blocks and other elements is the same.

Many specialists argue that a small amount of doubles will not hurt a site, but if more than 40-50% more than 40-50%, then the resource can wait for serious difficulties. In any case, even if the copies are not so much, it is worthwhile to do with their elimination, so you are guaranteed to get rid of problems with the dubs.

Search Page copies There are several ways to search for duplicate pages, but first you should contact several search engines and see how they see your site - you only need to compare the number of pages in the index of each. This is quite simple, without resorting any additional means: in Yandex or Google enough in the search string, enter Host: Yoursite.ru and look at the number of results.

If, after such a simple check, the quantity will be very different, 10-20 times, then this is with some more likely to talk about the contents of the dub in one of them. Page copies can be not to blame for such a difference, but nevertheless it gives a reason for further more thorough search. If the site is small, you can manually calculate the number of real pages and then compare with indicators from search engines.

Search duplicate pages You can search for a URL in the issuance of the search engine. If they have to be CNC, then pages with a URL of incomprehensible characters, like "index.php? S \u003d 0F6B2903D", will immediately be embarrassed from the general list.

Another way to determine the presence of a duplicate by the means of search engines is a search on text fragments. The procedure for such an inspection is simple: you need to enter a text fragment out of 10-15 words from each page in the search string, and then analyze the result. If there will be two or more pages in extradition, there are copies, if the result is only one, then there are no doubles from this page, and you can not worry.

It is logical that if the site consists of large number Pages, such a check can turn into an impracticable routine for an optimizer. To minimize time costs, you can use special programs. One of these tools, which is probably a sign of experienced specialists, is Xenu`s Link Sleuth.

To check the site, you need to open a new project by selecting the "File" menu "check URL", enter the address and click "OK". After that, the program will begin processing all the URL of the site. At the end of the check, you need to export the received data to any convenient editor and start looking for a double.

In addition to the above methods in the tools of the Yandex.Vebmaster panels and Google WebMaster Tools, there are means for checking indexing pages that can be used to search for a double.

Methods for solving the problem After all the duplicas are found, their elimination will be required. This can also be done in several ways, but for each specific case you need your own method, it is possible that everyone will have to use.

Copy pages can be deleted manually, but this method is rather suitable only for those doubles, which were created by manual way to the inconsistency of the webmaster.
Redirect 301 is great for gluing pages-copies whose url is distinguished by the presence and absence of WWW.
Solving problems with doubles using the canonical tag can be used for fuzzy copies. For example, for categories of goods in the online store, which have a duplicate, distinguished by sorting in various parameters. Canonical is also suitable for versions of pages for printing and in other similar cases. It is used quite simply - for all copies, the REL \u003d "Canonical" attribute is indicated, and for the main page that is most relevant - no. The code should look something like this: Link Rel \u003d "canonical" href \u003d "http://yoursite.ru/stranica-kopiya" /, and stand within the head tag.
In the fight against doubles can help configure the robots.txt file. The Disallow directive will allow you to close access to dubs for search robots. You can read more about the syntax of this file in our mailing.

Duplicas are pages on the same domain with identical or very similar content. Most often appear due to the features of the work of CMS, errors in Robots.txt directives or in setting 301 redirects.

What is the danger of doubles

1. Incorrect identification of the relevant page search robot. Suppose you have one and the same page available on two URLs:

Https://site.ru/kepki/

Https://site.ru/catalog/kepki/

You have invested money in the promotion of the page https://site.ru/kepki/. Now it refers to thematic resources, and it ranked positions in the top 10. But at some point, the robot eliminates it from the index and in return adds https://site.ru/catalog/kepki/. Naturally, this page is ranked worse and attracts less traffic.

2. Increasing the time required for crossing the site by robots. On the scan of each site robots allocated limited time. If a lot of doubles, the robot may not get to the main content, because of which the indexing will be delayed. This problem is particularly relevant for sites with thousands of pages.

3. The imposition of sanctions on the part of the search engines. By themselves, the duplicas are not a reason for the pessimization of the site - as long as search algorithms do not count that you create a swill intentionally in order to manipulate the issuance.

4. Problems for webmasters. If the work on eliminating the doubles to postpone in a long box, they can be accumulated by such a quantity that the webmaster is purely physically it will be difficult to process reports, systematize the reasons for the dubs and make adjustments. Large work increases the risk of errors.

Dupils are conventionally divided into two groups: explicit and implicit.

Explicit duplicas (page available on two or more url)

There are many options for such doubles, but they are all like their essence. Here are the most common.

1. URL with a slash at the end and without it

Https://site.ru/list/

Https://site.ru/list.

What to do: Configure server response "HTTP 301 MOVED PERMANENTLY" (301th Redirect).

How to do it:

- find in the root folder of the site file.htaccess and open (if there is no - create in TXT format, call.htaccess and put in the site root);
- prescribe in the file file for redirect with the URL with a slash on the URL without a slash:

RewriteCond% (Request_FileName)! -D
RewriteCond% (Request_uri) ^ (. +) / $
Rewriterule ^ (. +) / $ / $ 1

- reverse operation:

RewriteCond% (Request_FileName)! -F
RewriteCond% (Request_uri)! (. *) / $
Rewriterule ^ (. * [^ /]) $ $ 1 /

- if the file is created from scratch, all redirects must be prescribed inside such lines:

…

Configuring 301 redirect with .htaccess is suitable only for apache sites. For NGINX and other servers, redirect is configured in other ways.

What url is preferred: with or without slam? Pure technically - no difference. Look in the situation: if more pages are indexed with a slash, leave this option, and vice versa.

2. URL with www and without www

Https://www.site.ru/1.

Https://site.ru/1.

What to do: Specify the main mirror of the site in the webmaster panel.

How to do this in Yandex:

- go to Yandex.Vebmaster

- select the site in the panel from which the redirection will go (most often redirected to the URL without www);
- go to the "Indexing / Site Moving" section, remove the checkbox in front of the "Add www" item and save the changes.

Within 1.5-2 weeks of Yandex, the mirrors will reinperse the pages, and only the URL without www will appear in the search.

Important! Previously, to specify the main mirror in the Robots.txt file, it was necessary to prescribe a HOST directive. But it is no longer supported. Some Webmasters "for the Safety" still indicate this directive and for even greater confidence set 301 redirect - this is not necessary, it is enough to adjust the gluing in the webmaster.

How to glue mirrors in Google:

- go to Google Search Console. and add 2 versions of the site - with www and without www;

- select the site from which the redirection will go from the Search Console;
- click on the gear icon in the upper right corner, select the "Site Settings" item and select the main domain.

As in the case of Yandex, additional manipulations with 301 redirects are not needed, although it is possible to implement gluing with it.

What should be done:

- unload a list of indexed URLs from Yandex.Webmaster;
- download this list into the Seopult list tool or using the XLS file (detailed instructions for using the tool);

- run the analysis and download the result.

In this example, the Phagination page is indexed by Yandex, and Google is not. The reason is that they are closed from indexing in robots.txt only for the Bot Yandex. Solution - set up canonization for pagination pages.

Using the parser from Seopult, you will understand, duplicate pages in both search engines or only in one. This will allow you to choose optimal problem solving tools.

If you do not have time or experience to deal with the doubles, order an audit - in addition to having a double you get a lot useful information About your resource: the presence of errors in HTML code, headlines, meta tags, structure, internal passing, usability, content optimization, etc. As a result, you will have ready-made recommendations on your hands, which will make the site more attractive for Visitors and increase its position in the search.

Drops pages on sites or blogsWhere they come from and what problems can create.
It is about this that we'll talk about this post, we will try to deal with this phenomenon and find ways to minimize those potential troubles that can bring us duplicate pages on the site.

So, will continue.

What is duplicate pages?

Dutch pages on any web resource means access to the same information at different addresses. Such pages are also called the internal dubs of the site.

If the text on the page is completely identical, then such a duplicate is called complete or clear. With partial coincidence duplicate is called incomplete or fuzzy.

Incomplete duplication - These are pages of categories, page list of goods and the like pages containing the announcements of the materials of the site.

Full duplicate pages- These are versions for printing, pages with different extensions, archives page, search on the site, pages with comments so on.

Sources of double pages.

At the moment, most duplicate pages are generated when using modern CMS. - Content management systems, they are also called engines of sites.

This is I. WordPress, and Joomla, and Dle And other popular CMS. This phenomenon seriously annifies the optimizers of sites and webmasters and delivers them additional troubles.

In online stores Dupils may appear when the goods are displayed with sorting on various details (manufacturer of goods, appointment of goods, the date of manufacture, price, etc.).

Also need to remember the notorious console www.and to determine if it is in the name of the domain when creating, developing, promoting and promoting the site.

As you can see, the sources of the appearance of a double can be different, I listed only the main, but all of them are well known to those skilled in the art.

Dutch pages, negative.

Despite the fact that many at the appearance of the doubles do not pay special attention, this phenomenon can create serious problems when promoting sites.

Search engine may regard drokes like spamand, as a result, it is serious to reduce the position of both these pages and the site as a whole.

When promoting the site links may occur as follows. At some point, the search engine is regarded as the most relevant Page DoubleAnd not the one you promote links and all your efforts and costs will be vain.

But there are people who try use duplicate weight on the page required, Main, for example, or any other.

Methods of dealing with dubs pages

How to avoid a double or how to reduce the negative moments when they appear?
And in general it is worth it to somehow deal with this or all to give the mercy to search engines. Let them disassemble, since they are so smart.

Using robots.txt

Robots.txt- This is a file placed in the root directory of our site and containing directives for search robots.

In these directives, we indicate which pages on our site index, and which is not. We can also specify the name of the main domain of the site and the file containing the site map.

To prohibit indexing pages used Directive Disallow. It is it that the webmasters use it, in order to close from the indexation of duplicate pages, and not only duplicate, but any other information that is not directly related to the content of pages. For example:

Disallow: / Search / - Close Site Search Pages
Disallow: / *? - Close the pages containing the question mark "?"
Disallow: / 20 * - Close the archive page

Using file.htaccess

File.htaccess.(without expansion) is also placed in the root directory of the site. To combat duplicates in this file, customize use 301 redirect.
This method helps to keep the site indicators when cMS Site change or change its structure. The result is correct redirection without loss of reference weight. At the same time, the weight of the page at the old address will be passed to the page at a new address.
301 redirect apply and when determining the main domain of the site - with www or without www.

Using the REL \u003d "CANNONICAL" tag

With this tag, the webmaster indicates the search engine of the original source, that is, the page that should be indexed and participate in the ranking of search engines. The page is called canonical. The entry in the HTML code will look like this:

When using CMS WordPress, this can be done in the settings of such useful. plugin as ALL IN ONE SEO PACK.

Additional measures to combat doubles for CMS Wordpress

By applying all of the above methods of dealing with duplicate pages on your blog, I had a feeling all the time that I did not all that you can. Therefore, fighting on the Internet, consulting with professionals, I decided to do something else. Now I will describe it.

I decided to eliminate dupils that are created on the blog, when use anchors I told them about the article "HTML Anchors". On blogs running CMS WordPress anchors are formed when tag "#More" and when using comments. The feasibility of their application is rather controversial, but the ducky they are fruit clearly.
Now how I eliminated this problem.

First, we will take the #more tag.

Found a file where it is formed. Rather, I suggested.
This ../ WP-includes / Post-Template.php
Then I found a fragment of the program:

Id) \\ "class \u003d \\" more-link \\ "\u003e $ more_link_text", $ More_link_text);

Fragment marked in red removed

#more - ($ post-\u003e id) \\ "class \u003d

And received in the end the string of this kind.

$ Output. \u003d Apply_Filters ('The_Content_More_Link', ' $ more_link_text", $ More_link_text);

Remove the anchors comments #comment

We now turn to comments. This is already Dodumal himself.
Also determined with the file ../wp-includes/comment-template.php.
Find the desired fragment of the program code

return Apply_Filters ('Get_Comment_Link', $ Link . '# COMMENT-'. $ comment-\u003e comment_id, $ comment, $ args);)

Similarly, a fragment marked red removed. Very neatly, carefully, right up to each point.

. '# COMMENT-'. $ comment-\u003e comment_id

We as a result of the following row of the program code.

return Apply_Filters ('Get_Comment_Link', $ Link, $ Comment, $ Args);
}

Naturally, all this has done, after copying the specified program files to your computer so that in case of failure, it is easy to restore the status to changes.

As a result of these changes, when you click on the text "Read the rest of the record ..." I have a page with a canonical address and without adding to the address of the tail in the form "# More- ....". Also when clicking on comments, I have a normal canonical address without a prefix in the form of "# comment- ...".

Thus, the number of double pages on the site slightly decreased. But what else will form our WordPress now I can not say. We will keep track of the problem further.

And in conclusion, I bring to your attention a very good and informative video on this topic. I strongly recommend seeing.

All health and success. Until the following meetings.

Useful materials:

Duplicate pages is one of the many reasons for lowering positions in search results and even entering the filter. To prevent this, you need to warn them into the search engine index.

Determine the presence of a double on the site and get rid of them different waysBut the seriousness of the problem is that the duplicate is not always useless pages, they simply should not be in the index.

We will solve this problem now, only for a start to find out what a duplicate is and how they arise.

What is duplicate pages

Pupil pages is a copy of the content of the canonical (main) page, but with another URL. It is important here to note that they can be both complete and partial.

Full duplication It is an accurate copy, but with its address, the difference of which can manifest itself in the slash, the WWW abbreviation, the substitution of the parameters index.php?, Page \u003d 1, Page / 1, etc.

Partial duplication It is manifested in incomplete copying of the content and associated with the site structure, when the announcements of the articles directory, archives, content from Sidebar, Page Page and other through elements of the resource contained on the canonical page are indexed. This is inherent in most CMS and online stores in which the catalog is an integral part of the structure.

We have already spoken about the consequences of the occurrence of the oak, and this is due to the distribution of reference mass between duplicates, submenuing pages in the index, the loss of the uniqueness of the content, etc.

How to find ducky pages on site

The following methods can be used to search for a double:

google search string. With the design of the Site: myblog.ru, where myblog.ru is your URL, pages from the main index are detected. To see dupils, you need to go to last page search results and click on the "Show Hidden Results" line;
team "Advanced Search" in Yandex. Pointing in a special window address of your site and entering in quotes one of the proposals of an indexed article exposed to check, we must only get one result. If their more is a duplicate;
toolbar For webmasters in PS;
manually, Substituting in the address bar, the slash, www, html, asp, php, the letters of the upper and lower registers. In all cases, redirection must occur on the page with the main address;
special programs and services: Xenu, Megaindex, etc.

Remove sheets of pages

The removal of doubles also have several. Each of them has its impact and consequencesTherefore, it is not necessary to talk about most effective. It should be remembered that the physical destruction of an indexed duplicate is not a way out: search engines will still remember. Therefore, the best method of dealing with dubs - prevent their appearance Using the right settings of the site.

Here are some of the ways to eliminate the doubles:

Setting Robots.txt. This will allow specific pages from indexing. But if Yandex robots are susceptible to this file, Google captures even pages closed, not particularly considering his recommendations. In addition, with the help of robots.txt, remove indexed duplicas is very difficult;
301 redirect. It contributes to gluing a double with a canonical page. The method is valid, but not always useful. It can not be used in the case when duplicates must remain independent pages, but should not be indexed;
Assignment 404 errors Infected dubs. The method is very good for their removal, but will require some time before the effect manifests itself.

When nothing to glue and delete nothing, but I don't want to lose the weight of the page and get a punishment from search engines, it is used rel Canonical Href attribute.

Rel Canonical attribute on the fight against doubles

I will start with the example. In the online store there are two pages with identical content cards, but on the same goods are alphabetically, and on the other in cost. Both are needed and redirected is not allowed. At the same time, for search engines it is a clear double.

In this case, rational use of the tag link Rel Canonicalindicating the canonical page that is indexed, but the main page remains available to users.

This is done as follows: In the HEAD block of pages-duplicate, reference is specified. "Link REL \u003d" Canonical "href \u003d" http://site.ru/osnovnaya stranitsa "/"where Stranitsa is the address of the canonical page.

With this approach, the user can freely visit any page of the site, but a robot, reading the Rel Canonical attribute code, will go index only the address of which is listed in the link.

This attribute may be useful and for pagation pages. In this case, create a page "Show everything" (such "portylight") and take for canonical, and pagination pages send a robot to it through Rel Canonical.

Thus, the choice of the method of combating pages duplication depends on the nature of their emergence and necessity Presence on the site.