Alexander Huth

Crawler Friendly Sitemaps for Large Scale Websites

Having millions of landing pages for your keywords makes crawler management one of the most important SEO tasks. While the solution is simple - make your website crawler friendly and don’t go deeper than four or five levels – the execution can be difficult at times.

But fear not, sitemaps can come in very handy when you’re trying to improve discovery and indexation by search engines. And when I say sitemaps, I mean the classic HTML page. You know, the one page linking to a lot of other pages of your website… And not it’s fancy XML sibling you submit to Google Webmaster Tools.

One sitemap per letter

The concept of having a sitemap is as old as the Web itself. Most sitemaps are used on small websites with less than one hundred pages – and most of them try to recreate a file system hierarchy or just try to look like a tree. This sitemap is a little different, though.

It starts by breaking down your landing pages by the first/initial letter. Now find a nice place in your homepage footer from where you can link to /sitemap or just put the 26 links to /sitemap-a, /sitemap-b, … and /sitemap-z directly, saving one level already. As you can see, every letter gets it’s own sitemap… In case you use special characters and/or numbers in page names, group those pages as well.

2.000 links per sitemap page

According to the Internet, the frequencies of initial letters in an English dictionary look something like this1: S (12,28 %), C (8,02 %), P (7,09 %), A (6,53 %), D (6,20 %), M (5,45 %), B (5,44 %), R (5,35 %), T (5,09 %), I (4,63 %), E (4,53 %), F (4,22 %).

What’s the result? If the website in need of better crawlability has two million pages, the sitemap for the letter S would need to hold close to 250.000 links etc.

Putting so many links on just one page is obviously not the best idea: Loading such a big page would take ages and Google and other search engines can only digest a certain number of links per page2. But at least that infamous limit of 100 links per page was officially dropped some time ago, so let’s just go with 2.000 links per page.

Pagination, the efficient way

2.000 links per page mean that we have to paginate our per letter sitemaps. There are different ways to design an efficient pagination (if you’re interested in them, I recommend reading the Guide to Pagination on the strucr website), but let’s not overthink it: A pagination that just links all paginated pages for the specific letter is the only thing we need. If you want to be fancy and find plain numbers too boring, you could name their anchors after the first and last landing page name on them.

The naming scheme for the pagination page URLs could be something like /sitemap-a/2 or /sitemap-a?page=2. Make sure to not block pagination pages or URL parameters as you can’t have both: Better crawlability and specifications in robots.txt or Google Webmaster Tools stopping the bots don’t go together well…

Millions of pages, only four levels deep

Our sitemap page now consists of the following elements:

  • The usual chrome (header, footer, sidebar)
  • The list of 26 links to all sitemaps
  • The 2.000 links to our landing pages
  • The pagination with as many elements as we have paginated pages for this letter

The sitemaps itself can’t replace a coherent navigation, an adequate hierarchy and efficient internal linking. But setting up sitemaps this way makes every page available in three clicks or less. They can act as the backbone or parallel crawler map of the website, allowing bots to discover and crawl pages that would have been buried several levels deep otherwise.

  1. This is different from ETAOIN SRHLDCU, but we’re talking most frequent first letters of words in an English dictionary. The data shown was found in this forum post