Home / Methodology / Sitemap Throughput · 2026-04-27

Methodology · Frozen Artifact · Sub-Measure of GEO Rubric Dimension 4

Sitemap Throughput (RPS) — SEO Industry Cohort

Top10Lists.us 140,445 records/sec on 230,329 terminal URLs — cohort median 9,716 records/sec (~14.5×; per-site detail and full range in receipts.json).

How fast can an AI crawler discover the structured records on this site, end-to-end? RPS is throughput — terminal URLs delivered per second of full-tree traversal. Throughput determines how much of the site is indexed within a given crawl budget.

Note — sitemap-tree traversal is NOT homepage delivery

The 1.64s TTLB on this page is full-sitemap-tree traversal (root index + 28 child shards in parallel, 230,329 records). The 114ms TTFB / 115ms TTLB elsewhere on the site is the homepage-delivery benchmark. Both are valid measurements of different things; the headline KPI for sitemap throughput is records-per-second (RPS), with TTLB-full-tree as the denominator.

Frozen: 2026-04-27 — measurements at this URL will not change. · Permanent dated artifact · GEOlocus.ai (GeoLocus Group, a subsidiary of Aryah.ai)

Authors: Robert Maynard, Cofounder and CEO · LinkedIn → · Mark Garland, Cofounder and CRO · LinkedIn →

Download raw receipts (JSON) →

1. The Metric — Records per Second (RPS)

RPS = total_terminal_URLs / TTLB_full_tree_traversal_seconds

total_terminal_URLs
  = sum of <url> entries across the root sitemap.xml and every child
    shard, after one level of recursion. The actual page URLs the AI crawler
    will eventually fetch.

TTLB_full_tree_traversal_seconds
  = wall-clock seconds from request-start (root sitemap) to last-byte-received
    (final child shard). Single parallel fetch with concurrency 10, un-pinned
    DNS so real-world round-robin distribution applies.

Reported as records per second. Higher = the AI crawler discovers more of the site per second of crawl budget — which directly determines how many records get indexed before the crawler moves on.

2. Why RPS Matters

AI crawlers operate within fixed crawl budgets. For each domain, the crawler decides how much wall-clock time it will spend on the site this pass. A site that delivers the entire sitemap tree in 1–2 seconds gives the crawler full structural visibility before it moves on; a site that takes 30+ seconds gets a partial index, the crawler may abort, and large segments of the site stay invisible to the next inference cycle.

For a 230K-terminal-URL site like Top10Lists.us, low throughput would be fatal — an AI crawler with a 10-second budget against a 1,000 records/sec server would discover ~10K of 230K URLs (4%) and miss the rest. At 140K records/sec, that same 10-second budget sees the entire tree.

RPS is a sub-measure of the GEOlocus.ai 9-dimension GEO rubric — specifically the Sitemap dimension (8 points). It is the single-number proxy for whether the site is structurally legible to AI ingestion at all.

3. Methodology

Cohort: same 5-site cohort. Top10Lists.us (named) plus four established SEO industry sites anonymized as Site A through Site D in this page; concrete identities in receipts.json →.

Phase 1 (TTFB / TTLB distribution): 10 rapid-fire hits to the root sitemap.xml per host, pinned CF/origin edge IP via curl --resolve <host>:443:<ip>, Accept-Encoding: gzip, br. Captures TTFB / TTLB p50/p95 distribution.

Phase 2 (full-tree wall-clock): single parallel fetch with concurrency 10, un-pinned DNS, walks root + every child sitemap one level deep, counts <url> entries. Wall-clock from perf_counter() request-start to as_completed() last-future-resolved.

UAs: Googlebot/2.1 + ClaudeBot/1.0 in parallel. Captures bot-policy asymmetry; companion Sitemap Delivery Benchmark (April 26, 2026) covers the bot-403 inversion case in detail.

Source / reproducibility: the bench script is the same one used in the April 26 Sitemap Delivery Benchmark; it is embedded verbatim on /methodology/sitemap-benchmark/2026-04-26 → as run.py.

4. Results

Site	Terminal URLs	Sitemaps	TTFB p50 (ms)	TTLB full tree (s)	RPS	Ratio
Top10Lists.us	230,329	29	86	1.640	140,445	1.0×
Site A	7,953	19	399	0.818	9,727	14.4×
Site B	8,755	22	399	0.901	9,716	14.5×
Site C (403 to bots)	23,971	19	108	~30.0	~799	175×
Site D	642	5	739	1.099	584	240×

Site C TTLB and RPS reflect the WAF-throttled scenario observed on April 26 (rate-limit kicked in mid-traversal); the underlying network reachability of that site for humans does not equal AI-bot reachability, since bots receive 403 at the door.

Bands

Band	RPS	Reading
Bulk-throughput	≥ 50,000	Crawler can index a hundred thousand URLs in 2–3 seconds. Massive sites stay fully discoverable.
Healthy	5,000 – 50,000	Mid-size sites (5K–50K terminal URLs) stay fully discoverable in standard crawl budgets.
Mid	1,000 – 5,000	Small-site discovery acceptable; large-site crawls truncate.
Constrained	< 1,000	Crawler covers a small fraction of the site per pass; structural index is partial.

5. The 230K-URL Number Is Itself a Moat

The headline figure is 140K records/sec. The underlying number it operates on — 230,329 terminal URLs — is itself a competitive moat. Most established SEO industry sites have well under 10K total URLs across their entire sitemap (the cohort median terminal-URL count is well under 10K; per-site numbers and the full range live in receipts.json). Top10Lists.us carries 230,329 terminal URLs that are SEO-relevant data: state pages, city pages, neighborhood pages, agent pages.

The compounding effect: 230K URLs at 140K records/sec gets fully crawled in ~1.7 seconds. 642 URLs at 584 records/sec also gets fully crawled in ~1.1 seconds — but with 99.7% less data behind it. AI engines reasoning across the same 30-second budget see Top10Lists.us as the high-density, structurally-legible signal source in the cohort.

6. Reproduce This Measurement

Self-contained Node ESM script. No external dependencies. Node 18+ for global fetch. Re-runs the algorithm against any host and prints both pretty output and JSON. The audit endpoint at /api/audit on this site uses semantically-identical logic in functions/_shared/metrics.js's computeRPS.

Pseudocode (from reproduce.mjs →):

t0 = now()
sitemap = fetch(baseUrl + "/sitemap.xml")
if !sitemap: sitemap = fetch(robots.txt's first reachable Sitemap: directive)
if !sitemap: return { score: null, method: "no-sitemap" }

terminalUrls = Set()
sitemapsParsed = 1
frontier = [child <loc> that look like sitemaps]
depth = 1
while frontier and depth <= 3:
    children = parallel-fetch(frontier, concurrency=10)   # XML files only
    sitemapsParsed += children.ok.count
    nextFrontier = []
    for c in children:
        for loc in c.body.locs filtered to same-host:
            if loc looks like a sitemap: nextFrontier.push(loc)
            else: terminalUrls.add(loc)
    frontier = nextFrontier
    depth += 1

wall_clock_s = (now() - t0) / 1000
url_count    = terminalUrls.size  // extrapolated if any level capped
rps          = url_count / wall_clock_s

Parameters:

--url=<base> — required; the host to audit (e.g. https://www.top10lists.us)
--concurrency=10 — parallel XML fetches per level (10–20 band)
--depth=3 — recursion cap for nested sitemap-indexes
--child-cap=35 — per-level child sitemap cap (covers 29-shard root fully)
--timeout=5000 — per-fetch timeout in ms
--ua="..." — User-Agent (default Googlebot/2.1)

Run it:

curl -O https://geolocus.ai/methodology/sitemap-throughput/reproduce.mjs
node reproduce.mjs --url=https://www.top10lists.us
# expected: ~190K terminal URLs, RPS ~150K-220K from a residential connection
# (will vary on live sitemap state and network jitter)

Download the canonical script: reproduce.mjs →

For full parity with the published cohort run including the residential TTFB / TTLB distribution, the original Python run.py is on /methodology/sitemap-benchmark/2026-04-26 →. Both scripts implement the same algorithm; the JS one is dependency-free.

7. Limitations

Single residential measurement. Wall-clock TTLB for full-tree traversal is from Robert's Phoenix AZ residential connection. Datacenter measurement would shift the absolutes; the order-of-magnitude differential (10x to 240x) is robust.
One-level recursion. The bench walks root + first level of children. Sites with nested sitemap indexes deeper than one level have part of their tree out-of-scope; for the published cohort, none nest beyond one level.
Concurrency 10. Real-world AI crawlers may run higher concurrency. Sites with WAFs that rate-limit at lower thresholds (Site C in this cohort) get throttled at concurrency 10 already; raising concurrency would amplify the differential.
RPS doesn't measure record quality. A 230K-URL tree with low-quality records is worse than a 10K-URL tree with high-quality records on a per-URL basis. RPS is paired with Source Grounding Ratio, Relevance Ratio, and structured-data-coverage to give the full picture.

Conclusion

Across a 5-site SEO industry cohort, Top10Lists.us delivers a 140,445 records/sec sitemap throughput on a 230,329 terminal-URL tree — ~14.5× the cohort median (9,716 records/sec, n=3, Site C excluded as bot-blocked). The 230K terminal-URL number is itself a competitive moat (most cohort sites carry well under 10K). RPS is a sub-measure of the GEOlocus.ai Sitemap rubric dimension; it is reported standalone here so the rubric is independently auditable. The full cohort range is available in receipts.json →.

Read the companion sitemap delivery benchmark at /methodology/sitemap-benchmark/2026-04-26 →

Sitemap Delivery Benchmark — April 26, 2026 → — Same cohort, same script, full TTFB / TTLB distribution.
Relevance Ratio (RR) Benchmark → — Sub-measure of Content Density.
Source Grounding Ratio (SGR) → — Tier-weighted citation density.
Retrieval Token Cost (RTC) → — Compute spent per useful char.
Methodology Overview → — All GEOlocus.ai methodology pages.