So I have been designing websites for quite a while now. I realize the importance of a robots.txt
file, and the implications it holds for search engine optimization (SEO) and preventing duplicate content. However, for some reason, I have never bothered looking into building a better robots.txt
file for my personal blog/WordPress powered portfolio. Weird, right? Honestly, what made me realize I needed to do something better was Google’s Webmaster tools, which I started using since re-launching my website this month. After looking through a few weeks of data in my account, I quickly realized that I would be running into duplicate content issues at the very least. So, I started re-searching best practices for writing robots.txt
files for WordPress installations.
My first stop of course was the WordPress Codex. Suffice it to say, I wasn’t disappointed. They have a VERY good example of a robots.txt
file written specifically for a base WordPress install.
Most of this is quite useful just the way it is. It solves almost all duplicate content problems (comments, categories, query strings (?)), and keeps the robots out of unnecessary/private directories such as the ‘wp-’ group of folders (wp-content, wp-admin, etc…). It allows a few important bots into all directories, such as the Google Image bot, and the Google AdSense bot (if you are displaying AdWords ads).
An interesting, and I feel, important item is that they are blocking ‘duggmirror.’ DuggMirror is a site that mirrors content from other sites that have been dugg (see digg). DuggMirror can be a duplicate content nightmare, and can cause Google to index THEIR site instead of YOUR site or value their content more highly. So, what does that mean for you? That means say “bye-bye” to that traffic that you have earned and deserve. Blocking DuggMirror solves that problem.
So, for the most part, I have decided to keep this robots.txt
file in-tact. However, I did make a few changes, and my robots.txt
file is below in its entirety, along with an explanation of the changes I made.
For starters I got rid of most of the ‘wp-’ directives and shortened them to the one Disallow: wp-*
directive using the robots.txt
wildcard character. The wildcard character allows you to match one or more characters in a URL. Historically there was a bit of controversy over using wildcards in robots.txt
files because early on they lacked support. However, most (and all of the engines that I really care about) now support the use of wildcards. You do need to be safe when using wildcards though, because if you don’t, you may cause yourself some unwanted headaches.
I also got rid of the directives that were in reference to Google AdSense bot. I don’t use Google AdSense, and I doubt I ever will, so having it in there serves no purpose.
So there you have it. That is my strategy so far with my robots.txt
file when it comes to my WordPress powered website. It’s definitely a work in progress that I will have to continually monitor to make sure I am getting the results that I want. Thankfully Google Webmaster tools makes that pretty simple.