Video Details

Ways in which we can control robots and search engines

  • robots.txt (eg: http://www.google.com/robots.txt)
    • lives at yoursite.com/robots.txt
    • tells crawlers what they should and shouldn’t access
    • isn’t always respected by search engines
    • can save crawl bandwidth
    • doesn’t work well when trying to stop a page from being indexed
    • dont use this method to canonicalize non-www vs www
  • meta robots (eg: <META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>)
    • lives in the header at the top of a page meaning you can only control a single page at a time
    • tells search engines whether or not the page should be indexed or the links should be followed
    • requires crawl budget
  • nofollow tag (eg: <a href=”signin.php” rel=”nofollow”>)
    • applied to specific links
    • advises whether a page is editorially vouched for or not and whether or not you would like to pass on PageRank and link equity metrics to that page
  • Google Search Console (previously known as Webmaster Tools)
    • able to be used to restrict access to pages/files but settings are applied at a search engine level within their respective webmaster consoles
  • URL status codes
    • 410 permanently removes (may take a long time for a page to return to the index if you want that URL back in the index)
    • 301 permanently redirects
    • 302 temporarily redirects

Scenarios & Specific Use Cases

1) Robots.txt and meta robots tags working together

Question: What if we take a page like blogtest.html and set a user agent to say all robots can’t crawl /blogtest.html

ie.

User-agent: *
Disallow: /blogtest.html

AND

set a meta robots tag to <META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>

and a page appears in search results?

Answer:

  • The search engine can’t see the noindex because they were only able to read the disallow from the robots.txt.
  • To truly remove a page, you have to say meta noindex and let them crawl it (remove the disallow from robots.txt

2) Still have low quality content that you are improving?

  • Large quantity of pages => use robots.txt
  • Small quantity of pages
    • use meta robots noindex
    • remove on each page that has been improved
    • submit in Google Search Console using an XML sitemap to let google know it’s ready for re-crawling

3) Large amount of duplicate URLs or thin content?

  • Use rel=canonical
  • Allow indexing
  • Allow it to be crawled
  • Could also put meta noindex, follow on those pages, but not completely necessary and may interfere with rel=canonical

4) To pass link equity/crawling without them appearing in search results:

  • On individual pages use meta robots to noindex, follow
  • Don’t disallow those pages in robots.txt

5) What should I do with search results type pages that are indexed?

  • Turn most common and popular individual sets of search results into a category style landing page and add additional content to that page to add unique value
  • If the pages are not useful for visitors, disallow with robots.txt but be careful – check your traffic to those pages first.