Sunday, April 29th, 2007...11:14 am
Eliminate Duplicate Content Using robots.txt
A few weeks ago I put together a quick SEO checklist and I’ve been meaning to expand on some of those points.
One issue in particular I’ve had to deal with recently is duplicate content.
I’ve been using the SEO in Firefox extension to check various stats about my site from time to time. It gives you a quick snapshot of your site including, but not limited to, PageRank, Alexa rank, the number of cached pages, and what caught my eye yesterday, how many of those pages are in the supplemental index. It was showing that I had 680 pages in the index, and all of them were supplemental.
As it turns out when I was doing the redesign of this site I forgot that each tag you create with Ultimate Tag Warrior creates its own page with the full content of any posts with that tag. All of these tag pages had been indexed by Google and must have been triggering a duplicate content flag.
So the steps I needed to take were:
- Prevent Google from indexing these pages in the future.
- Remove the offending pages from the Google index.
To prevent Google from indexing these pages I went to my robots.txt file.
Here is what mine looks like:
User-agent: *
Disallow: /*/feed/
Disallow: /*/feed/rss/
Disallow: /*/trackback/
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/
Disallow: /tag/
Sitemap: http://thewrongadvices.com/sitemap.xml
Here is a quick run down of what each of those lines do:
- The first line (User-agent: *) specifies which web crawlers the following directives should apply to. In this case the * means it should apply to all web crawlers.
- Each of the “Disallow:” lines tell web crawlers not to index the directories specified, as well any subdirectories. I’ve added /tag/ to prevent those pages from being indexed.
- The last option, “Sitemap:” is a new one. All major search engines now support autodiscovery of sitemaps. You can auto-create a new updated sitemap after each post you make by using the Google Sitemap Generator for Wordpress.
Now that my robots.txt file has been updated to disallow the /tags/ directory the next time Google crawls my site those pages will be removed from the index, and hopefully that should address the supplemental problem.
Google also provides a facility in its Webmaster Tools to request expedited removal of pages from the index, but you should be careful when using this because you can potentially remove your whole site from the index.










17 Comments
April 30th, 2007 at 7:22 am
This is really excellent advice. I’ve just updated my robots files on both of my domains. Thanks
April 30th, 2007 at 11:52 am
This is exactly what has happened in my site….but can you explain how to use index removal tool in webmaster to remove all tags page?
April 30th, 2007 at 12:05 pm
In the removal tool you need to click on the second radio buttom labeled “A directory and all subdirectories on your site”. The directory you want to put in is http://yoursite.com/tag
This will removal all the indexed pages in the tag subdirectories.
It’s not instantaneous though. I submitted a removal request yesterday and it is still pending.
April 30th, 2007 at 3:32 pm
Amazing! Thanks Dan!
April 30th, 2007 at 4:49 pm
I’ve been meaning to read up on this, but you’ve just explained it beautifully. Nice and concise - thanks!
May 1st, 2007 at 2:33 am
Great!
Thanks for the tips
May 1st, 2007 at 6:46 am
Thanks for the tip Dan.
Also I guess users should once validate the robots.txt from within the google webmaster console.
By the way if I make the /tag pages to do a partial listing of the posts, is it fine or will this also trigger duplicate content ?
May 1st, 2007 at 8:02 am
Venu - Only showing partial posts should help with duplicate content issues. I’ve removed the tag directories completely because really there is no need for them to be indexed. I’ll check my search engine ranking in a few days and see what the results are.
May 22nd, 2007 at 12:24 pm
Just out of curiosity; do you really recommend to disallow tagged pages from robots? When I did a check with your robots.txt http://thewrongadvices.com/robots.txt I’m not able to find disallow statement for tag.
May 22nd, 2007 at 12:32 pm
I removed it again to do some testing. I’m tracking my rank and seeing if indexed tag pages that contain excerpts hurt or help ranking. I’m not sure how much of an effect it will have either way but I’m curious.
May 22nd, 2007 at 12:51 pm
Keep us posted on your finding… It would help lot of others like me
June 3rd, 2007 at 6:15 am
How about using same category name as tag, just like having category named “seo” and tagging its topics with “seo” as well. This generates 2 similar pages with even same titles and content. Is that considered as content duplications?
June 3rd, 2007 at 11:10 pm
Andy as long as you aren’t showing full content on category and tag pages you should be fine.
September 17th, 2007 at 9:57 am
but what happend with the /content , / author, /year ex /2007 , why you don´t add those folders to your robots.txt
October 20th, 2007 at 11:15 am
[…] Eliminate Duplicate Content Using robots.txt from The Wrong Advices […]
November 12th, 2007 at 5:26 pm
An example is to block all those sectiosn from the blog where the post use to be acrchived. Do it by Robots.txt.
like folder : archives..
November 13th, 2007 at 4:23 am
That´s not a solutions cause those are the only pages that link your content, the best that u can do is to create a meta noindex,follow for those categories.
XeD
Leave a Reply