Ways in which we can control robots and search engines
- robots.txt (eg: http://www.google.com/robots.txt)
- lives at yoursite.com/robots.txt
- tells crawlers what they should and shouldn’t access
- isn’t always respected by search engines
- can save crawl bandwidth
- doesn’t work well when trying to stop a page from being indexed
- dont use this method to canonicalize non-www vs www
- meta robots (eg: <META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>)
- lives in the header at the top of a page meaning you can only control a single page at a time
- tells search engines whether or not the page should be indexed or the links should be followed
- requires crawl budget
- nofollow tag (eg: <a href=”signin.php” rel=”nofollow”>)
- applied to specific links
- advises whether a page is editorially vouched for or not and whether or not you would like to pass on PageRank and link equity metrics to that page
- Google Search Console (previously known as Webmaster Tools)
- able to be used to restrict access to pages/files but settings are applied at a search engine level within their respective webmaster consoles
- URL status codes
- 410 permanently removes (may take a long time for a page to return to the index if you want that URL back in the index)
- 301 permanently redirects
- 302 temporarily redirects
Scenarios & Specific Use Cases
1) Robots.txt and meta robots tags working together
Question: What if we take a page like blogtest.html and set a user agent to say all robots can’t crawl /blogtest.html
set a meta robots tag to <META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
and a page appears in search results?
- The search engine can’t see the noindex because they were only able to read the disallow from the robots.txt.
- To truly remove a page, you have to say meta noindex and let them crawl it (remove the disallow from robots.txt
2) Still have low quality content that you are improving?
- Large quantity of pages => use robots.txt
- Small quantity of pages
- use meta robots noindex
- remove on each page that has been improved
- submit in Google Search Console using an XML sitemap to let google know it’s ready for re-crawling
3) Large amount of duplicate URLs or thin content?
- Use rel=canonical
- Allow indexing
- Allow it to be crawled
- Could also put meta noindex, follow on those pages, but not completely necessary and may interfere with rel=canonical
4) To pass link equity/crawling without them appearing in search results:
- On individual pages use meta robots to noindex, follow
- Don’t disallow those pages in robots.txt
5) What should I do with search results type pages that are indexed?
- Turn most common and popular individual sets of search results into a category style landing page and add additional content to that page to add unique value
- If the pages are not useful for visitors, disallow with robots.txt but be careful – check your traffic to those pages first.