Which pages to close from indexing and how. Preventing page indexing in the robots meta tag

Recently he shared with me the observation that many sites that come to us for audit often have the same errors. Moreover, these mistakes can not always be called trivial - even advanced webmasters make them. So the idea arose to write a series of articles with instructions for tracking and fixing such errors. The first in line is a guide to setting up site indexing. I give the floor to the author.

For good indexing of the site and better page ranking, it is necessary that the search engine bypasses the key promoted pages of the site, and on the pages themselves can accurately highlight the main content, without getting confused in the abundance of service and auxiliary information.
Websites that come to us for analysis have two types of errors:

1. When promoting a site, their owners do not think about what the search bot sees and adds to the index. In this case, a situation may arise when the index contains more garbage pages than promoted pages, and the pages themselves are overloaded.

2. On the contrary, the owners were too zealous to clean up the site. Along with unnecessary information, data important for the promotion and evaluation of pages can be hidden.

Today we want to consider what is really worth hiding from search robots and how best to do it. Let's start with the content of the pages.

Content

Problems related to closing content on the site:

The page is evaluated by search robots comprehensively, and not only by textual indicators. Carried away by closing various blocks, information that is important for evaluating the usefulness and ranking is often deleted.

Here's an example of the most common mistakes:
- the site header is hidden. It usually contains contact information, links. If the site header is closed, search engines may not know that you have taken care of visitors and placed important information on a prominent place;

- filters, search form, sorting are hidden from indexing. The presence of such opportunities in an online store is an important commercial indicator that is better to show, not hide.
- information about payment and delivery is hidden. This is done to enhance the uniqueness of the product cards. But this is also information that should be on a high-quality product card.
- the menu is “cut out” from the pages, impairing the assessment of the ease of navigation on the site.

Why is part of the content closed on the site?
There are usually several goals:
- to focus on the main content on the page by removing auxiliary information, service blocks, menus from the index;
- to make the page more unique and useful by removing duplicate blocks on the site;
- remove "extra" text, increase the text relevance of the page.

All of this can be achieved without hiding some of the content!
Do you have a very large menu?
Display on the pages only those items that are directly related to the section.

Many choices in filters?
Display only popular ones in the main code. Load the rest of the options only if the user clicks the "show all" button. Yes, scripts are used here, but there is no deception - the script is triggered at the user's request. The search engine will be able to find all the items, but when evaluated, they will not receive the same value as the main content of the page.

On the page big block with the news?
Reduce their number, display only headlines, or simply remove the news block if users rarely follow links in it or there is little main content on the page.

Search robots, although far from ideal, are constantly improving. Already, Google shows hiding scripts from indexing as an error in the Google Search Console panel ("Blocked Resources" tab). Not showing part of the content to robots can really be useful, but this is not an optimization method, but rather temporary "crutches" that should be used only when absolutely necessary.

We recommend:
- treat content hiding as a "crutch", and resort to it only in extreme situations, trying to modify the page itself;
- removing part of the content from the page, focus not only on text indicators, but also evaluate the convenience and information that affects;
- before hiding content, conduct an experiment on several test pages. Search bots know how to parse pages and your fears about a decrease in relevance may be in vain.

Let's take a look at the methods used to hide content:

Noindex tag

This method has several disadvantages. First of all, this tag is only taken into account by Yandex, so it is useless for hiding text from Google. In addition, it is important to understand that the tag prohibits indexing and showing only text in search results. The rest of the content, such as links, is not covered by it.

Yandex support doesn't really cover how noindex works. There is a little more information in one of the discussions on the official blog.

User question:

“The mechanics of action and the influence on the ranking of the tag are not fully understood. text... Next, I will explain why they are so puzzled. And now - there are 2 hypotheses, I would like to find the truth.

# 1 Noindex does not affect the ranking / relevance of the page at all

Under this assumption: the only thing it does is to block some of the content from appearing in search results. In this case, the entire page is considered as a whole, including closed blocks, relevance and related parameters (uniqueness; compliance, etc.) for it is calculated according to all content in the code, even closed.

# 2 Noindex affects ranking and relevance, since content closed in the tag is not rated at all. Accordingly, the opposite is true. The page will be ranked according to the content open to robots. "

When the tag can be useful:
- if there is a suspicion that the page is downgraded in Yandex search results due to over-optimization, but at the same time it occupies TOP positions for important phrases in Google. You need to understand that this is a quick and temporary solution. If the entire site falls under "Baden-Baden", noindex, as Yandex representatives have repeatedly confirmed, will not help;
- to hide general proprietary information that you, due to corporate or legal regulations, must indicate on the page;
- to correct snippets in Yandex if they contain unwanted content.

Hiding content with AJAX

it universal method... It allows you to hide content from both Yandex and Google. If you want to cleanse the page of content that dilutes relevance, it is better to use it. Representatives of the PS, of course, do not welcome this method and recommend that search robots see the same content as users.
The technology for using AJAX is widespread and if you do not engage in explicit cloaking, there is no threat of sanctions for its use. The disadvantage of this method is that you still have to block access to scripts, although Yandex and Google do not recommend doing this.

Site pages

For successful promotion, it is important not only to get rid of unnecessary information on the pages, but also to clear search index site from useless garbage pages.
First, it will speed up the indexing of the main promoted pages of the site. Second, the presence in the index a large number junk pages will negatively affect the site's rating and promotion.

Let's immediately list the pages that are advisable to hide:

- pages for registration of applications, baskets of users;
- site search results;
- personal information of users;
- product comparison results pages and similar auxiliary modules;
- pages generated by search filters and sorting;
- pages of the administrative part of the site;
- print versions.

Let's consider the ways in which you can close pages from indexing.

Close in robots.txt

This is not the best method.

Firstly, the robots file is not designed to combat duplicates and clean sites from junk pages. For these purposes, it is better to use other methods.

Secondly, a robots file is not a guarantee that a page will not be indexed.

Here's what Google says about it in their help:

Noindex meta tag

To ensure that pages are excluded from the index, it is best to use this meta tag.

Below is a variant of the meta tag that both search engines understand:

An important point!

For Googlebot to see the noindex meta tag, you need to open access to pages that are closed in the robots.txt file. If this is not done, the robot may simply not go to these pages.

X-Robots-Tag Headers

A significant advantage of this method is that the ban can be placed not only in the page code, but also through the root .htaccess file.

This method is not very common in the Russian Internet. We believe that the main reason for this situation is that Yandex uses this method long time did not support.
This year, Yandex employees wrote that the method is now supported.

The support response cannot be called detailed))). Before proceeding to prohibiting indexing using X-Robots-Tag, it is better to make sure that this method works for Yandex. We have not yet set up our experiments on this topic, but, perhaps, we will do it in the near future.

Password protection

If you need to hide the entire site, for example, the test version, we also recommend using this method. Perhaps the only drawback is that it may be difficult if you need to scan a domain hidden under a password.

Eliminate junk pages with AJAX

The point is not just to prohibit indexing of pages generated by filters, sorting, etc., but not to create such pages on the site at all.

For example, if a user selected a set of parameters in the search filter for which you did not create a separate page, changes in the products displayed on the page occur without changing the URL itself.

The difficulty with this method is that it usually cannot be applied to all cases at once. Some of the generated pages are used for promotion.

For example, filter pages. For "refrigerator + Samsung + white" we need a page, but for "refrigerator + Samsung + white + two-compartment + no frost" we don't.

Therefore, you need to make a tool that involves the creation of exceptions. This complicates the task of programmers.

Use methods of prohibiting indexing from search algorithms

URL Parameters in Google Search Console

This tool allows you to specify how to identify the occurrence in URL of pages new parameters.

Clean-param directive in robots.txt

In Yandex, a similar ban for URL parameters can be set using the Clean-param directive.
You can read about it.

Canonical addresses as prevention of garbage pages on the site
This meta tag was created specifically to combat duplicates and junk pages on the site. We recommend prescribing it throughout the site as a prevention of duplicate and garbage pages appearing in the index.

Tools for spot deletion of pages from the Yandex and Google index

If a situation has arisen when you urgently need to remove information from the index, without waiting for your ban to be seen by search jobs, you can use the tools from the Yandex.Webmaster panel and Google Search Console.

In Yandex, this is "Remove URL":

In Google Search Console "Remove URL":

Internal links

Internal links are closed from indexing to redistribute internal weights to the main promoted pages. But the point is:
- such a redistribution may have a bad effect on general ties between pages;
- links from template end-to-end blocks usually have less weight or may not be counted at all.

Consider the options that are used to hide links:

Noindex tag

This tag is useless for hiding links. It only applies to text.

Rel = "nofollow" attribute

Currently, the attribute does not allow you to save weight on the page. Using rel = ”nofollow” simply loses weight. By itself, using the tag for internal links does not seem very logical.

Hiding links with scripts

This is actually the only working method by which you can hide links from search engines. You can use Ajax and load link blocks after loading the page or add links by replacing the tag with the script on ... It is important to keep in mind that search algorithms are able to recognize scripts.

As with content, this is a crutch that can sometimes solve a problem. If you are not sure that you will get a positive effect from the hidden link block, it is better not to use such methods.

Conclusion

Removing bulky end-to-end blocks from a page can really have a positive effect on ranking. It is better to do this by shortening the page and displaying only the content that visitors need. Hiding content from a search engine is a crutch, which should be used only in cases where it is impossible to reduce end-to-end blocks in other ways.

When removing some of the content from the page, do not forget that not only text criteria are important for ranking, but also completeness of information and commercial factors.

The situation is similar with internal links. Yes, sometimes it can be useful, but artificial redistribution of the link mass on the site is a controversial method. It is much safer and more reliable to simply discard links you are not sure about.

With the pages of the site, everything is more unambiguous. It is important to ensure that junk pages of little use do not end up in the index. There are many methods for this that we have collected and described in this article.

You can always take our advice on the technical aspects of optimization, or order a turnkey promotion, which includes.

Most robots are well designed and do not pose any problem for site owners. But if a bot is written by an amateur or "something went wrong", then it can create a significant load on the site that it bypasses. By the way, spiders do not enter the server at all like viruses - they simply request the pages they need remotely (in fact, they are analogous to browsers, but without the function of viewing pages).

Robots.txt - user-agent directive and search engine bots

Robots.tht has a completely uncomplicated syntax, which is described in great detail, for example, in Yandex help and Google help... It usually specifies which search bot the directives described below are for: bot name (" User-agent"), allowing (" Allow") and prohibiting (" Disallow"), and also actively used" Sitemap "to indicate to search engines exactly where the map file is located.

The standard was created a long time ago and something was added later. There are guidelines and design rules that will only be understood by the robots of certain search engines. In Runet, only Yandex and Google are of interest, which means that it is precisely with their help for compiling robots.txt that you should familiarize yourself in detail (I gave the links in the previous paragraph).

For example, earlier it was useful for the Yandex search engine to indicate that your web project is the main one in the special "Host" directive, which only this search engine understands (well, also Mail.ru, because they have a search from Yandex). True, at the beginning of 2018 Yandex still canceled Host and now its function, like that of other search engines, is performed by a 301 redirect.

Even if your resource does not have mirrors, it will be useful to indicate which of the spelling options is the main one -.

Now let's talk a little about the syntax of this file. Robots.txt directives look like this:

<поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел>

The correct code should contain at least one "Disallow" directive after each "User-agent" entry. An empty file assumes permission to index the entire site.

User-agent

User-agent directive should contain the name of the search bot. Using it, you can configure the rules of behavior for each specific search engine (for example, create a ban on indexing a specific folder only for Yandex). An example of writing "User-agent" addressed to all bots that have entered your resource looks like this:

User-agent: *

If you want to set certain conditions in the "User-agent" only for one bot, for example, Yandex, then you need to write as follows:

User-agent: Yandex

The name of the search engine robots and their role in the robots.txt file

Every search engine's bot has its own name (for example, for a rambler it is StackRambler). Here I will list the most famous ones:

Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot

Major search engines sometimes, except for the main bots, there are also separate copies for indexing blogs, news, images, etc. You can find a lot of information on the types of bots (for Yandex) and (for Google).

How to be in this case? If you need to write a rule for prohibiting indexing, which all types of Google robots must comply with, then use the name Googlebot and all other spiders of this search engine will also obey. However, you can only prohibit, for example, indexing images by specifying Googlebot-Image as the User-agent. Now this is not very clear, but with examples, I think it will be easier.

Examples of using the Disallow and Allow directives in robots.tht

I will give a few simple examples of using directives with an explanation of his actions.

The code below allows all bots (indicated by an asterisk in the User-agent) to index all content without any exceptions. This is given empty directive Disallow... User-agent: * Disallow:
The following code, on the contrary, completely prohibits all search engines from adding pages of this resource to the index. Sets it to Disallow with "/" in the value field. User-agent: * Disallow: /
In this case, all bots will be prohibited from viewing the contents of the / image / directory (http://mysite.ru/image/ is the absolute path to this directory) User-agent: * Disallow: / image /
To block one file, it will be enough to register its absolute path to it (read): User-agent: * Disallow: /katalog1//katalog2/private_file.html
Running a little ahead, I will say that it is easier to use the asterisk (*) symbol so as not to write the full path:
Disallow: /*private_file.html
In the example below, the "image" directory will be prohibited, as well as all files and directories starting with the "image" characters, ie files: "image.htm", "images.htm", directories: "image", " images1 "," image34 ", etc.): User-agent: * Disallow: / image The fact is that by default, at the end of the record, an asterisk is implied, which replaces any characters, including their absence. Read about it below.
By using Allow directives we allow access. Complements Disallow well. For example, with such a condition, we prohibit the Yandex search robot from downloading (indexing) everything except web pages whose address begins with / cgi-bin: User-agent: Yandex Allow: / cgi-bin Disallow: /
Well, or such an obvious example of using the Allow and Disallow combination:
User-agent: * Disallow: / catalog Allow: / catalog / auto
When describing paths for Allow-Disallow directives, you can use the symbols "*" and "$", thus setting certain logical expressions.
1. Symbol "*"(star) means any (including empty) sequence of characters. The following example prohibits all search engines from indexing files with the ".php" extension: User-agent: * Disallow: * .php $
2. Why is it needed at the end $ (dollar) sign? The fact is that, according to the logic of the robots.txt file, a default asterisk is added at the end of each directive (it is not there, but it seems to be there). For example we write: Disallow: / images
  By implying that this is the same as:
  Disallow: / images *
  Those. this rule prohibits indexing of all files (web pages, images and other types of files) whose address begins with / images, and then anything follows (see the example above). So, symbol $ just overrides that default (unsplit) trailing asterisk. For example:
  Disallow: / images $
  Only prohibits indexing of the / images file, but not /images.html or /images/primer.html. Well, in the first example, we prohibited indexing only files ending in .php (with such an extension), so as not to catch anything superfluous:
  Disallow: * .php $

In many engines, users (human-readable urls), while urls generated by the system have a question mark "?" in the address. You can use this and write such a rule in robots.txt: User-agent: * Disallow: / *?

The asterisk after the question mark suggests itself, but, as we found out a little higher, it is already implied at the end. Thus, we will prohibit the indexing of search pages and other service pages created by the engine, which the search robot can reach. It will not be superfluous, because the question mark is most often used by CMS as a session identifier, which can lead to duplicate pages in the index.

Sitemap and Host directives (for Yandex) in Robots.txt

In order to avoid unpleasant problems with site mirrors, it was previously recommended to add the Host directive to robots.txt, which pointed the Yandex bot to the main mirror.

Host directive - specifies the main site mirror for Yandex

For example, before, if you have not switched to secure protocol yet, it was not necessary to specify the full URL in Host, but Domain name(without http: //, i.e. ru). If you have already switched to https, then you will need to specify the full URL (like https://myhost.ru).

A great tool for dealing with duplicate content - the search engine simply will not index the page if another URL is registered in Canonical. For example, for such a page of my blog (page with pagination) Canonical points to the https: // site and there should be no problems with duplicate titles.

But I was distracted ...
If your project is based on any engine, then duplication of content will take place with a high probability, which means you need to deal with it, including with the help of a ban in robots.txt, and especially in the meta tag, because in the first case, Google can ignore the ban, but it can't give a damn about the meta tag ( so brought up).
For example, in WordPress pages with very similar content can get into the index of search engines, if indexing is allowed for the contents of categories, and the contents of the tag archive, and the contents of temporary archives. But if, using the Robots meta tag described above, you create a ban for the tag archive and the temporary archive (you can leave the tags, but disable the indexing of the content of the categories), then there will be no duplication of content. How to do this is described at the link given just above (for the OlInSeoPak plugin)
Summing up, I will say that the Robots file is designed to set global rules for denying access to entire directories of the site, or to files and folders whose names contain specified characters (by mask). You can see examples of setting such prohibitions just above.
Now let's look at specific examples of a robot designed for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will differ significantly (if not radically) from each other. True, they will all have one general point, and this moment is associated with the Yandex search engine.
Because in runet Yandex has a fairly large weight, then you need to take into account all the nuances of its work, and here we the Host directive will help... It will explicitly point this search engine to the main mirror of your site.
For her, it is advised to use a separate User-agent blog, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand Host and, accordingly, its inclusion in the User-agent record intended for all search engines (User-agent: *) can lead to negative consequences and incorrect indexing.
It’s hard to say what the real situation is, because search algorithms are a thing in themselves, so it’s better to do it as advised. But in this case, you will have to duplicate all the rules that we set User-agent: * in the User-agent: Yandex directive. If you leave User-agent: Yandex with empty Disallow:, then this way you will allow Yandex to go anywhere and drag everything into the index.
Robots for WordPress
I will not give an example of a file recommended by the developers. You can watch it yourself. Many bloggers do not restrict Yandex and Google bots at all in their walks through the content of the WordPress engine. Most often in blogs you can find robots, automatically filled with a plugin.
But, in my opinion, all the same, it is necessary to help the search in the difficult task of sifting the grains from the chaff. Firstly, it will take Yandex and Google bots a lot of time to index this garbage, and there may not be any time left for adding web pages with your new articles to the index. Secondly, bots crawling through engine junk files will create additional load on your host's server, which is not good.
You can see my version of this file for yourself. It is old, it hasn’t changed for a long time, but I try to follow the principle “don’t fix what didn’t break,” and it’s up to you: use it, make your own, or spy on someone else. I still have a ban on indexing pages with pagination there until recently (Disallow: * / page /), but recently I removed it, relying on Canonical, which I wrote about above.
But in general, the only correct file for WordPress probably doesn't exist. It is possible, of course, to realize any prerequisites in it, but whoever said that they would be correct. There are many options for ideal robots.txt on the web.
I will give two extremes:
you can find a megafile with detailed explanations (the # symbol separates comments that would be better removed in a real file): User-agent: * # general rules for robots, except for Yandex and Google, # since for them the rules are below Disallow: / cgi-bin # folder on the hosting Disallow: /? # all request parameters on the main Disallow: / wp- # all WP files: / wp-json /, / wp-includes, / wp-content / plugins Disallow: / wp / # if there is a subdirectory / wp / where the CMS is installed ( if not, # the rule can be deleted) Disallow: *? s = # search Disallow: * & s = # search Disallow: / search / # search Disallow: / author / # author's archive Disallow: / users / # authors archive Disallow: * / trackback # trackbacks, notifications in comments about the appearance of an open # link to an article Disallow: * / feed # all feeds Disallow: * / rss # rss feed Disallow: * / embed # all embeds Disallow: * / wlwmanifest.xml # xml manifest file Windows Live Writer (if not used, # the rule can be deleted) Disallow: /xmlrpc.php # WordPress API file Disallow: * utm = # links with utm tags Disallow: * openstat = # links with openstat tags Allow: * / uploads # open the folder with uploads User-agent: GoogleBot # rules for Google (no duplicate comments) Disallow: / cgi-bin Disallow: /? Disallow: / wp- Disallow: / wp / Disallow: *? S = Disallow: * & s = Disallow: / search / Disallow: / author / Disallow: / users / Disallow: * / trackback Disallow: * / feed Disallow: * / rss Disallow: * / embed Disallow: * / wlwmanifest.xml Disallow: /xmlrpc.php Disallow: * utm = Disallow: * openstat = Allow: * / uploads Allow: /*/*.js # open js scripts inside / wp - (/ * / - for priority) Allow: /*/*.css # open css files inside / wp- (/ * / - for priority) Allow: /wp-*.png # pictures in plugins, cache folder and etc. Allow: /wp-*.jpg # pictures in plugins, cache folder, etc. Allow: /wp-*.jpeg # pictures in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins so as not to block JS and CSS User-agent: Yandex # rules for Yandex (no duplicate comments) Disallow: / cgi-bin Disallow: /? Disallow: / wp- Disallow: / wp / Disallow: *? S = Disallow: * & s = Disallow: / search / Disallow: / author / Disallow: / users / Disallow: * / trackback Disallow: * / feed Disallow: * / rss Disallow: * / embed Disallow: * / wlwmanifest.xml Disallow: /xmlrpc.php Allow: * / uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source & utm_medium & utm_campaign # Yandex recommends not to close # from indexing, but to delete parameters of tags, # Google does not support such rules Clean-Param: openstat # similarly # Specify one or more Sitemap files (you do not need to duplicate for each User-agent #). Google XML Sitemap creates 2 sitemaps like in the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main mirror of the site, as in the example below (with WWW / without WWW, if HTTPS # then write protocol, if you need to specify the port, we indicate). The Host command is understood by # Yandex and Mail.RU, Google does not take into account. Host: www.site.ru
But you can take an example of minimalism: User-agent: * Disallow: / wp-admin / Allow: /wp-admin/admin-ajax.php Host: https://site.ru Sitemap: https: // site. ru / sitemap.xml

The truth probably lies somewhere in the middle. Also, do not forget to add the Robots meta tag for "extra" pages, for example, using a wonderful plugin -. He will also help you to configure Canonical.
Correct robots.txt for Joomla
User-agent: * Disallow: / administrator / Disallow: / bin / Disallow: / cache / Disallow: / cli / Disallow: / components / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / layouts / Disallow: / libraries / Disallow: / logs / Disallow: / modules / Disallow: / plugins / Disallow: / tmp /
In principle, almost everything is taken into account here and it works well. The only thing is to add a separate User-agent: Yandex rule to it to insert the Host directive that defines the main mirror for Yandex, as well as specify the path to the Sitemap file.
Therefore, in the final form, the correct robots for Joomla, in my opinion, should look like this:
User-agent: Yandex Disallow: / administrator / Disallow: / cache / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / libraries / Disallow: / modules / Disallow: / plugins / Disallow: / tmp / Disallow: / layouts / Disallow: / cli / Disallow: / bin / Disallow: / logs / Disallow: / components / Disallow: / component / Disallow: / component / tags * Disallow: / * mailto / Disallow: /*.pdf Disallow : / *% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) User-agent: * Allow: /*.css?*$ Allow: /*.js?*$ Allow: / * .jpg? * $ Allow: /*.png?*$ Disallow: / administrator / Disallow: / cache / Disallow: / includes / Disallow: / installation / Disallow: / language / Disallow: / libraries / Disallow: / modules / Disallow : / plugins / Disallow: / tmp / Disallow: / layouts / Disallow: / cli / Disallow: / bin / Disallow: / logs / Disallow: / components / Disallow: / component / Disallow: / * mailto / Disallow: / *. pdf Disallow: / *% Disallow: /index.php Sitemap: http: // path to your map XML format
Yes, also note that in the second option there are directives Allow, allowing indexing of styles, scripts and images... This is written specifically for Google, because its Googlebot sometimes swears that the robot is prohibited from indexing these files, for example, from the folder with the theme used. He even threatens to lower his ranking for this.
Therefore, in advance, we allow the whole thing to be indexed using Allow. By the way, the same was in the example file for WordPress.

Good luck to you! See you soon on the pages of the blog site
you can watch more videos by going to
");">
You may be interested
Domains with and without www - the history of their appearance, the use of 301 redirects to glue them together
Mirrors, duplicate pages and url addresses - audit of your site or what could be the reason for the collapse of its SEO promotion

CMS Joomla has one drawback, it is duplicate page addresses. Duplicates are when one article is available at two addresses.

For example:

Read more and how to remove duplicate pages from indexing in Joomla under the cut.

CMS Joomla has one drawback, it is duplicate page addresses. Duplicates are when one article is available at two addresses. For example:

http: //site/dizayn/ikonki-sotsial-noy-seti-vkonrtakte.html
index.php? option = com_content & view = article & id = 99: vkontakteicons & catid = 5: design & Itemid = 5

How do duplicate pages appear? Very simply, in the example above, we see two links to one material. The first link is beautiful and human-readable (CNC link), created by the JoomSEF component that converts all links on the site into such a beautiful, readable form. The second link is the internal system link of Joomla, and if the Artio JoomSef component were not installed, then all links on the site would be like the second - incomprehensible and ugly. Now on how scary it is and how to deal with duplicates.

How harmful duplicates are for the site. I would not call it a very big drawback, since in my opinion, search engines should not strongly ban and pessimize the site for such duplicates, since these duplicates are not made on purpose, but are part of the CMS system. Moreover, I will note that it is a very popular system on which millions of sites are made, which means that search engines have learned to understand this "feature". But all the same, if there is an opportunity and desire, it is better to hide such takes from the eyes of the big brother.

How to deal with duplicates in Joomla and other cms

1) Two takes of one page, banned in robots.txt

For example, the following two addresses of one page are included in the search engine index:

http://site.ru/page.html?replytocom=371
http://site.ru/page.html?iframe=true&width=900&height=450

To close such duplicates in robots.txt you need to add:

Disallow: / *? *
Disallow: / *?

By this action, we have closed all site links with the "?" Sign from indexing. This option is suitable for sites where CNC work is enabled, and normal links do not have question marks - "?"

2. Use the rel = "canonical" tag

Let's say there are two links on the same page with different addresses. Google search engines and Yahoo can specify which URL on the page is the main one. To do this, in the tag you need to add the rel = "canonical" tag. Yandex does not support this option.

For Joomla for setting the tag rel = "canonical" I found two extensions called 1) plg_canonical_v1.2; and 2) 098_mod_canonical_1.1.0. You can test them. But I would have acted in a different way and simply forbade indexing all links that have a question mark in them, as shown in the example above.

3. Prohibit indexing in robots.txt Joomla duplicates (pages ending in index.php) and other unnecessary pages.

Since all duplicate pages in Joomla begin with index.php, you can prevent them all from being indexed with one line in robots.txt - Disallow: /index.php. Also, by doing this, we will prohibit the double home page when it is available at "http://site.ru/" and "http://site.ru/index.php".

4. Gluing a domain with and without www using 301 redirects (redirects).

To glue a domain with www and without, you need to make a redirect - 301 redirects. To do this, write in the .htaccess file:

RewriteEngine on

If you need to redirect from http://site.ru to www.site.ru on the contrary, the entry will look like this:

RewriteEngine On
RewriteCond% (HTTP_HOST) ^ site.ru
RewriteRule (. *) Http://www.site.ru/$1

5. The Host directive defines the main domain with or without www for Yandex.

For those webmasters who have just created their site, do not rush to follow the steps that I described in this paragraph, first you need to compose the correct robots.txt and prescribe the Host directive, this will define the main domain in the eyes of Yandex.

It will look like this:

User-Agent: Yandex
Host: site.ru

The Host directive is understood only by Yandex. Google doesn't understand it.

6. Joomla duplicate pages are glued together in the .htaccess file.

Very often, the main page of a site on joomla is available at http://site.ru/index.html or http://site.ru/index.php, http: //site.ru.html, that is, these are duplicates of the main pages (http://site.ru), of course you can get rid of them by closing them in robots.txt, but it is better to do this using .htaccess. To do this, add the following to this file:

Use this code if you need to get rid of the duplicate with index.php, do not forget to put your domain in the code instead of http: // your site.ru /.

To check whether you have succeeded or not, just enter the duplicate address (http://site.ru/index.рhp) into the browser, if it works, then you will be redirected to the http://site.ru page, it will also happen with search bots and they will not see these takes.

And by analogy, we glue Joomla duplicates with other prefixes to the URI of your main page, just edit the code that I gave above.

7. Specify sitemap in robots.txt

Although this does not apply to duplicates, since such a movement has already started, then at the same time I recommend specifying the path to the sitemap in the robots.txt file in xml format for search engines:

Sitemap: http: //domain.ru/sitemap.xml.gz
Sitemap: http: //domain.ru/sitemap.xml

Outcome

To summarize the above, for Joomla I would write these lines in robots.txt:

Disallow: /index.php
Specify the main host for Yandex
User-Agent: Yandex
Host: site.ru

And these are the lines in .htaccess

# Glueing a domain with www and without
RewriteEngine on
RewriteCond% (HTTP_HOST) ^ www.site.ru
RewriteRule ^ (. *) $ Http://site.ru/$1
# Gluing duplicate pages
RewriteCond% (THE_REQUEST) ^ (3.9) /index.php HTTP /
RewriteRule ^ index.php $ http: // your site.ru /

If you use other ways to eliminate duplicates, you know how to improve the above, or you just have something to say on this topic - write, I’m waiting in the comments.

How can I prevent the indexing of certain pages?

Permissions and prohibitions on indexing are taken by everyone search engines from file robots.txt located in the root directory of the server. A ban on indexing a number of pages may appear, for example, for reasons of secrecy or a desire not to index identical documents in different encodings. The smaller your server is, the faster the robot will bypass it. Therefore, prohibit in the robots.txt file all documents that it does not make sense to index (for example, statistics files or lists of files in directories). Pay special attention to CGI or ISAPI scripts - our robot indexes them along with other documents.

In its simplest form (everything is allowed except the script directory) the robots.txt file looks like this:

User-Agent: *
Disallow: / cgi-bin /

A detailed description of the file specification can be found on the page: "".

When writing robots.txt, pay attention to the following common mistakes:

1. The line with the User-Agent field is required and must precede the lines with the field Disallow... For example, the following robots.txt file does not prohibit anything:

Disallow: / cgi-bin
Disallow: / forum

2. Blank lines in the robots.txt file are significant, they separate the entries for different robots. For example, in the following fragment of the robots.txt file, the line Disallow: / forum ignored because there is no line with the field in front of it User-Agent.

User-Agent: *
Disallow: / cgi-bin
Disallow: / forum

3. String with field Disallow can prohibit indexing of documents with only one prefix. To prohibit multiple prefixes, you need to write several lines. For example, the file below prohibits indexing of documents starting with “ / cgi-bin / forum”, Which most likely do not exist (and not documents with prefixes / cgi-bin and / forum).

User-Agent: *
Disallow: / cgi-bin / forum

4. In lines with a field Disallow not absolute, but relative prefixes are recorded. That is, the file

User-Agent: *
Disallow: www.myhost.ru/cgi-bin

prohibits, for example, indexing a document http://www.myhost.ru/www.myhost.ru/cgi-bin/counter.cgi but does NOT prevent the document from being indexed http://www.myhost.ru/cgi-bin/counter.cgi.

5. In lines with a field Disallow it is the prefixes that are specified, and not anything else. So, the file:

User-Agent: *
Disallow: *

prohibits indexing of documents starting with the "*" character (which do not exist in nature), and is very different from a file:

User-Agent: *
Disallow: /

which prohibits indexing of the entire site.

If you cannot create / modify the file robots.txt, then all is not lost - just add an additional tag into the HTML code of your page (inside the tag ):

Then this document also will not be indexed.

You can also use the tag

It means that the search engine robot should not follow the links from this page.

To simultaneously prohibit indexing of the page and crawl links from it, use the tag

How to prevent indexing of certain parts of the text?

To prevent indexing of certain portions of text in the document, mark them with tags

Attention! The NOINDEX tag must not break the nesting of other tags. If you specify the following erroneous construction:

... code1 ...

... code2 ...

... code3 ...

the ban on indexing will include not only "code1" and "code2", but also "code3".

How do I select a master virtual host from multiple mirrors?

If your site is located on the same server (one IP), but is visible in the outside world under different names (mirrors, different virtual hosts), Yandex recommends that you choose the name under which you want to be indexed. Otherwise, Yandex will choose the main mirror on its own, and the rest of the names will be prohibited from indexing.

In order for the mirror of your choice to be indexed, it is enough to disable the indexing of all other mirrors using. This can be done using the non-standard robots.txt extension - the directive Host, specifying the name of the main mirror as its parameter. If www.glavnoye-zerkalo.ru- main mirror, then robots.txt should look something like this:

User-Agent: *
Disallow: / forum
Disallow: / cgi-bin
Host: www.glavnoye-zerkalo.ru

For compatibility with robots that do not fully follow the standard when processing robots.txt, the Host directive must be added in the group starting with the User-Agent record, immediately after the Disallow records.

The argument of the directive Host is the domain name with the port number ( 80 default), separated by a colon. If any site is not specified as an argument for Host, it implies the presence of the directive Disallow: /, i.e. complete prohibition of indexing (if there is at least one correct directive in the group Host). So the files robots.txt kind

User-Agent: *
Host: www.myhost.ru

User-Agent: *
Host: www.myhost.ru:80

are equivalent and prohibit indexing as www.otherhost.ru and www.myhost.ru:8080.

The Host directive parameter must consist of one correct host name (i.e. the corresponding RFC 952 and not an IP address) and a valid port number. Incorrectly composed lines Host ignored.

# Examples of ignored Host directives
Host: www.myhost- .ru
Host: www.- myhost.ru
Host: www.myhost.ru 0
Host: www.my_ host.ru
Host:. my-host.ru:8000
Host: my-host.ru.
Host: my .. host.ru
Host: www.myhost.ru/
Host: www.myhost.ru:8080/
Host: http: // www.myhost.ru
Host: www.mysi.te
Host: 213.180.194.129
Host: www.firsthost.ru, www.secondhost.ru
Host: www.firsthost.ru www.secondhost.ru

If you have Apache server, then instead of using the Host directive, you can set robots.txt using SSI directives:

User-Agent: *
Disallow: /

In this file, the robot is prohibited from crawling all hosts except www.main_name.ru

How to enable SSI, you can read the documentation for your server or contact your system administrator... You can check the result by simply requesting the pages:

Http://www.main_name.ru/robots.txt
http: //www.other_name.ru/robots.txt etc. The results should be different.

Recommendations for a web server Russian Apache

In robots.txt on sites with Russian Apache, all encodings except for the main one should be prohibited for robots.

If encodings are decomposed by ports (or servers), then DIFFERENT robots.txt should be issued on different ports (servers). Namely, in all robots.txt files for all ports / servers, except for the "main" one, it should be written:

User-Agent: *
Disallow: /

To do this, you can use the SSI mechanism,.

If encodings in your Apache are distinguished by the names of "virtual" directories, then you need to write one robots.txt, which should contain approximately the following lines (depending on the names of the directories):

User-Agent: *
Disallow: / dos
Disallow: / mac
Disallow: / koi