Sitemap Throughput (RPS) — SEO Industry Cohort
Top10Lists.us 140,445 records/sec on 230,329 terminal URLs — cohort median 9,716 records/sec (~14.5×; per-site detail and full range in receipts.json).
How fast can an AI crawler discover the structured records on this site, end-to-end? RPS is throughput — terminal URLs delivered per second of full-tree traversal. Throughput determines how much of the site is indexed within a given crawl budget.
Note — sitemap-tree traversal is NOT homepage delivery
The 1.64s TTLB on this page is full-sitemap-tree traversal (root index + 28 child shards in parallel, 230,329 records). The 114ms TTFB / 115ms TTLB elsewhere on the site is the homepage-delivery benchmark. Both are valid measurements of different things; the headline KPI for sitemap throughput is records-per-second (RPS), with TTLB-full-tree as the denominator.
Frozen: 2026-04-27 — measurements at this URL will not change. · Permanent dated artifact · GEOlocus.ai (GeoLocus Group, a subsidiary of Aryah.ai)
Authors: Robert Maynard, Cofounder and CEO · LinkedIn → · Mark Garland, Cofounder and CRO · LinkedIn →
1. The Metric — Records per Second (RPS)
RPS = total_terminal_URLs / TTLB_full_tree_traversal_seconds
total_terminal_URLs
= sum of <url> entries across the root sitemap.xml and every child
shard, after one level of recursion. The actual page URLs the AI crawler
will eventually fetch.
TTLB_full_tree_traversal_seconds
= wall-clock seconds from request-start (root sitemap) to last-byte-received
(final child shard). Single parallel fetch with concurrency 10, un-pinned
DNS so real-world round-robin distribution applies.
Reported as records per second. Higher = the AI crawler discovers more of the site per second of crawl budget — which directly determines how many records get indexed before the crawler moves on.
2. Why RPS Matters
AI crawlers operate within fixed crawl budgets. For each domain, the crawler decides how much wall-clock time it will spend on the site this pass. A site that delivers the entire sitemap tree in 1–2 seconds gives the crawler full structural visibility before it moves on; a site that takes 30+ seconds gets a partial index, the crawler may abort, and large segments of the site stay invisible to the next inference cycle.
For a 230K-terminal-URL site like Top10Lists.us, low throughput would be fatal — an AI crawler with a 10-second budget against a 1,000 records/sec server would discover ~10K of 230K URLs (4%) and miss the rest. At 140K records/sec, that same 10-second budget sees the entire tree.
RPS is a sub-measure of the GEOlocus.ai 9-dimension GEO rubric — specifically the Sitemap dimension (8 points). It is the single-number proxy for whether the site is structurally legible to AI ingestion at all.
3. Methodology
Cohort: same 5-site cohort. Top10Lists.us (named) plus four established SEO industry sites anonymized as Site A through Site D in this page; concrete identities in receipts.json →.
Phase 1 (TTFB / TTLB distribution): 10 rapid-fire hits to the root
sitemap.xml per host, pinned CF/origin edge IP via
curl --resolve <host>:443:<ip>,
Accept-Encoding: gzip, br. Captures TTFB / TTLB p50/p95
distribution.
Phase 2 (full-tree wall-clock): single parallel fetch with concurrency
10, un-pinned DNS, walks root + every child sitemap one level deep, counts
<url> entries. Wall-clock from perf_counter()
request-start to as_completed() last-future-resolved.
UAs: Googlebot/2.1 + ClaudeBot/1.0 in parallel. Captures bot-policy asymmetry; companion Sitemap Delivery Benchmark (April 26, 2026) covers the bot-403 inversion case in detail.
Source / reproducibility: the bench script is the same one used in the
April 26 Sitemap Delivery Benchmark; it is embedded verbatim on
/methodology/sitemap-benchmark/2026-04-26 →
as run.py.
4. Results
| Site | Terminal URLs | Sitemaps | TTFB p50 (ms) | TTLB full tree (s) | RPS | Ratio |
|---|---|---|---|---|---|---|
| Top10Lists.us | 230,329 | 29 | 86 | 1.640 | 140,445 | 1.0× |
| Site A | 7,953 | 19 | 399 | 0.818 | 9,727 | 14.4× |
| Site B | 8,755 | 22 | 399 | 0.901 | 9,716 | 14.5× |
| Site C (403 to bots) | 23,971 | 19 | 108 | ~30.0 | ~799 | 175× |
| Site D | 642 | 5 | 739 | 1.099 | 584 | 240× |
Site C TTLB and RPS reflect the WAF-throttled scenario observed on April 26 (rate-limit kicked in mid-traversal); the underlying network reachability of that site for humans does not equal AI-bot reachability, since bots receive 403 at the door.
Bands
| Band | RPS | Reading |
|---|---|---|
| Bulk-throughput | ≥ 50,000 | Crawler can index a hundred thousand URLs in 2–3 seconds. Massive sites stay fully discoverable. |
| Healthy | 5,000 – 50,000 | Mid-size sites (5K–50K terminal URLs) stay fully discoverable in standard crawl budgets. |
| Mid | 1,000 – 5,000 | Small-site discovery acceptable; large-site crawls truncate. |
| Constrained | < 1,000 | Crawler covers a small fraction of the site per pass; structural index is partial. |
5. The 230K-URL Number Is Itself a Moat
The headline figure is 140K records/sec. The underlying number it operates on — 230,329 terminal URLs — is itself a competitive moat. Most established SEO industry sites have well under 10K total URLs across their entire sitemap (the cohort median terminal-URL count is well under 10K; per-site numbers and the full range live in receipts.json). Top10Lists.us carries 230,329 terminal URLs that are SEO-relevant data: state pages, city pages, neighborhood pages, agent pages.
The compounding effect: 230K URLs at 140K records/sec gets fully crawled in ~1.7 seconds. 642 URLs at 584 records/sec also gets fully crawled in ~1.1 seconds — but with 99.7% less data behind it. AI engines reasoning across the same 30-second budget see Top10Lists.us as the high-density, structurally-legible signal source in the cohort.
6. Reproduce This Measurement
Self-contained Node ESM script. No external dependencies. Node 18+ for
global fetch. Re-runs the algorithm against any
host and prints both pretty output and JSON. The audit endpoint at
/api/audit on this site uses semantically-identical
logic in functions/_shared/metrics.js's
computeRPS.
Pseudocode (from reproduce.mjs →):
t0 = now()
sitemap = fetch(baseUrl + "/sitemap.xml")
if !sitemap: sitemap = fetch(robots.txt's first reachable Sitemap: directive)
if !sitemap: return { score: null, method: "no-sitemap" }
terminalUrls = Set()
sitemapsParsed = 1
frontier = [child <loc> that look like sitemaps]
depth = 1
while frontier and depth <= 3:
children = parallel-fetch(frontier, concurrency=10) # XML files only
sitemapsParsed += children.ok.count
nextFrontier = []
for c in children:
for loc in c.body.locs filtered to same-host:
if loc looks like a sitemap: nextFrontier.push(loc)
else: terminalUrls.add(loc)
frontier = nextFrontier
depth += 1
wall_clock_s = (now() - t0) / 1000
url_count = terminalUrls.size // extrapolated if any level capped
rps = url_count / wall_clock_s
Parameters:
--url=<base>— required; the host to audit (e.g.https://www.top10lists.us)--concurrency=10— parallel XML fetches per level (10–20 band)--depth=3— recursion cap for nested sitemap-indexes--child-cap=35— per-level child sitemap cap (covers 29-shard root fully)--timeout=5000— per-fetch timeout in ms--ua="..."— User-Agent (default Googlebot/2.1)
Run it:
curl -O https://geolocus.ai/methodology/sitemap-throughput/reproduce.mjs
node reproduce.mjs --url=https://www.top10lists.us
# expected: ~190K terminal URLs, RPS ~150K-220K from a residential connection
# (will vary on live sitemap state and network jitter)
Download the canonical script: reproduce.mjs →
For full parity with the published cohort run including the residential
TTFB / TTLB distribution, the original Python run.py
is on
/methodology/sitemap-benchmark/2026-04-26 →.
Both scripts implement the same algorithm; the JS one is dependency-free.
7. Limitations
- Single residential measurement. Wall-clock TTLB for full-tree traversal is from Robert's Phoenix AZ residential connection. Datacenter measurement would shift the absolutes; the order-of-magnitude differential (10x to 240x) is robust.
- One-level recursion. The bench walks root + first level of children. Sites with nested sitemap indexes deeper than one level have part of their tree out-of-scope; for the published cohort, none nest beyond one level.
- Concurrency 10. Real-world AI crawlers may run higher concurrency. Sites with WAFs that rate-limit at lower thresholds (Site C in this cohort) get throttled at concurrency 10 already; raising concurrency would amplify the differential.
- RPS doesn't measure record quality. A 230K-URL tree with low-quality records is worse than a 10K-URL tree with high-quality records on a per-URL basis. RPS is paired with Source Grounding Ratio, Relevance Ratio, and structured-data-coverage to give the full picture.
Conclusion
Across a 5-site SEO industry cohort, Top10Lists.us delivers a 140,445 records/sec sitemap throughput on a 230,329 terminal-URL tree — ~14.5× the cohort median (9,716 records/sec, n=3, Site C excluded as bot-blocked). The 230K terminal-URL number is itself a competitive moat (most cohort sites carry well under 10K). RPS is a sub-measure of the GEOlocus.ai Sitemap rubric dimension; it is reported standalone here so the rubric is independently auditable. The full cohort range is available in receipts.json →.
Read the companion sitemap delivery benchmark at /methodology/sitemap-benchmark/2026-04-26 →
Related
- Sitemap Delivery Benchmark — April 26, 2026 → — Same cohort, same script, full TTFB / TTLB distribution.
- Relevance Ratio (RR) Benchmark → — Sub-measure of Content Density.
- Source Grounding Ratio (SGR) → — Tier-weighted citation density.
- Retrieval Token Cost (RTC) → — Compute spent per useful char.
- Methodology Overview → — All GEOlocus.ai methodology pages.