Looking at 12,000 crawl errors staring back at you in Webmaster Tools can make your hopes of eradicating those errors seem like an insurmountable task that will never be accomplished. The key is to know which errors are the most crippling to your site, and which ones are simply informational and can be brushed aside so you can deal with the real meaty problems. The reason it’s important to religiously keep an eye on your errors is the impact they have on your users and Google’s crawler.
Having thousands of 404 errors, especially ones for URLs that are being indexed or linked to by other pages pose a potentially poor user experience for your users. If they are landing on multiple 404 pages in one session, their trust for your site decreases and of course leads to frustration and bounces.
You also don’t want to miss out on the link juice from other sites that are pointing to a dead URL on your site, if you can fix that crawl error and redirect it to a good URL you can capture that link to help your rankings.
Additionally, Google does have a set crawl budget allotted to your site, and if a lot of the robot’s time is spent crawling your error pages, it doesn’t have the time to get to your deeper more valuable pages that are actually working.
Without further ado, here are the main categories that show up in the crawl errors report of Google Webmaster Tools:
This section usually returns pages that have shown errors such as 403 pages, not the biggest problems in Webmaster Tools. For more documentation with a list of all the HTTP status codes, check out Google’s own help pages. Also check out SEO Gadget’s amazing Server Headers 101 infographic on SixRevisions.
Errors in sitemaps are often caused by old sitemaps that have since 404’d, or pages listed in the current sitemap that return a 404 error. Make sure that all the links in your sitemap are quality working links that you want Google to crawl.
One frustrating thing that Google does is it will continually crawl old sitemaps that you have since deleted to check that the sitemap and URLs are in fact dead. If you have an old sitemap that you have removed from Webmaster Tools, and you don’t want being crawled, make sure you let that sitemap 404 and that you are not redirecting the sitemap to your current sitemap.
From Google employee Susan Moskwa:
“The best way to stop Googlebot from crawling URLs that it has discovered in the past is to make those URLs (such as your old Sitemaps) 404. After seeing that a URL repeatedly 404s, we stop crawling it. And after we stop crawling a Sitemap, it should drop out of your "All Sitemaps" tab.”
Most of these errors are often caused by redirect errors. Make sure you minimize redirect chains, the redirect timer is set for a short period, and don’t use meta refreshes in the head of your pages.
Matt Cutts has a good Youtube video on redirect chains, start 2:45 in if you want to skip ahead.
Google crawler exhausted after a redirect chain.
What to watch for after implementing redirects:
- When you redirect pages permanently, make sure they return the proper HTTP status code, 301 Moved Permanently.
- Make sure you do not have any redirect loops, where the redirects point back to themselves.
- Make sure the redirects point to valid pages and not 404 pages, or other error pages such as 503 (server error) or 403 (forbidden).
- Make sure your redirects actually point to a page and are not empty.
Tools to use:
- Check your redirects with a response header checker tool like URI Valet or the Check Server Headers Tool.
- Screaming Frog is an excellent tool to check which pages on your site are showing a 301 redirect, and which ones are showing 404 errors or 500 errors. The free version caps out at 500 pages on the site, beyond this you would need to buy the full version.
- The SiteOpSys Search Engine Indexing Checker is an excellent tool where you can put in a list of your URLs that you submitted as redirects. This tool will allow you to check your URLs in bulk to see which ones are indexing and which ones are not. If your original URLs that you had redirected are no longer indexing that means Google removed the old URL from its index after it saw the 301 redirect and you can remove that redirect line from your .htaccess file now.
Always use absolute and not relative links, if content scrapers scrape your images or links, they can reference your relative links on their site and if improperly parsed you may see not followed errors show up in your Webmaster Tools, this has happened with one of our sites before and it’s almost impossible to find out where the source link that caused the error is coming from.
Not found errors are by and large 404 errors on your site. 404 errors can occur a few ways:
- You delete a page on your site and do not 301 redirect it
- You change the name of a page on your site and don’t 301 redirect it
- You have a typo in an internal link on you site, which links to a page that doesn’t exist
- Someone else from another site links to you but has a typo in their link
- You migrate a site to a new domain and the subfolders do not match up exactly
Best practice: if you are getting good links to a 404’d page, you should 301 redirect it to the page the link was supposed to go to, or if that page has been removed then to a similar or parent page. You do not have to 301 redirect all 404 pages. This can in fact slow down your site if you have way too many redirects. If you have an old page or a large set of pages that you want completely erased, it is ok to let these 404. It is actually the Google recommended way to let the Googlebot know which pages you do not want anymore.
There is an excellent Webmaster Central Blog post on how Google views 404 pages and handles them in webmaster tools. Everyone should read it as it dispels the common “all 404s are bad and should be redirected” myth.
Rand also has a great post on whether 404’s are always bad for SEO also.
Restricted by robots.txt
These errors are more informational, since it shows that some of your URLs are being blocked by your robots.txt file so the first step is to check out your robots.txt file and ensure that you really do want to block those URLs being listed.
Sometimes there will be URLs listed in here that are not explicitly blocked by the robots.txt file. These should be looked at on an individual basis as some of them may have strange reasons for being in there. A good method to investigate is to run the questionable URLs through URI valet and see the response code for this. Also check your .htacess file to see if there is a rule that is redirecting the URL.
If you have pages that have very thin content, or look like a landing page these may be categorized as a soft 404. This classification is not ideal, if you want a page to 404 you should make sure it returns a hard 404, and if your page is listed as a soft 404 and it is one of your main content pages, you need to fix that page to make sure it doesn’t get this error.
If you are returning a 404 page and it is listed as a Soft 404, it means that the header HTTP response code does not return the 404 Page Not Found response code. Google recommends “that you always return a 404 (Not found) or a 410 (Gone) response code in response to a request for a non-existing page.“
We saw a bunch of these errors with one of our clients when we redirected a ton of broken URLs to a temporary landing page which only had an image and a few lines of text. Google saw this as a custom 404 page, even though it was just a landing page, and categorized all the redirecting URLs as Soft 404s.
If a page takes too long to load, the Googlebot will stop trying to call it after a while. Check your server logs for any issues and check the page load speed of your pages that are timing out.
Types of timed out errors:
- DNS lookup timeout – the Googlebot request could not get to your domain’s server, check DNS settings. Sometimes this is on Google’s end if everything looks correct on your side. Pingdom has an EXCELLENT tool to check out the DNS health of your domain and it will show you any issues that pop up.
- URL timeout – an error from one of your specific pages, not the whole domain.
- Robots.txt timeout – If your robots.txt file exists but the server timed out when Google tried to crawl it, Google will postpone the crawl of your site until it can reach the robots.txt file to make sure it doesn’t crawl any URLs that were blocked by the robots.txt file. Note that if you do not have a robots.txt and Google gets a 404 from trying to access your robots.txt, it will continue on to crawl the site as it assumes that the file doesn’t exist.
Unreachable errors can occur from internal server errors or DNS issues. A page can also be labeled as Unreachable if the robots.txt file is blocking the crawler from visiting a page. Possible errors that fall under the unreachable heading are “No response”, “500 error”, and “DNS issue” errors.
There is a long list of possible reasons for unreachable errors, so rather than list it here, I’ll point you to Google’s own reference guide here. Rand also touched on the impact of server issues back in 2008.
Google Webmaster Tools is far from perfect. While we all appreciate Google’s transparency with showing us what they are seeing, there are still some things that need to be fixed. To start with, Google is the best search engine in the universe, yet you cannot search through your error reports to find that one URL from a month ago that was keeping you up at night. At least they could have supplemented this with good pagination, but nope you have to physically click through 20 pages of data to get to page 21. One workaround for this is to edit the page number by editing the end of the URL string that shows what part of the errors list you are looking at. You can download all of the data into an Excel document, which is the best solution, but Google should still upgrade Webmaster Tools to allow searching from within the application.
Also, the owner of the site should have the ability to delete ALL sitemaps on the domain they own, even if someone else uploaded it a year ago. Currently you can only delete the sitemap that you yourself uploaded through your Webmaster Tools account. If Jimmy from Agency X uploaded an image sitemap a year ago before you let them go, this will still show up in the All Sitemaps tab. The solution to get rid of it is to let the sitemap 404 and it will drop off eventually but it can be a thorn in your side to have to see it every day until it leaves.
Perhaps, as Bing starts to upgrade its own Webmaster Tools, we will begin to see some more competition between the two search engines in their product offerings. Then one day, just maybe, we will get complete transparency and complete control of our sites in the search engines.