Signal-to-Noise Benchmark — SEO Industry Cohort
Top10Lists.us 100.00% mean Relevance Ratio — cohort median 73.38% (Site A–D, receipts.json names them; range available there too).
Of what an AI ingestor reads on this page, what fraction is on-topic answer body vs. peripheral chrome (nav, sidebar, related-posts, ads, newsletter forms)? Five sites, 24 pages, deterministic selection rule, public-tools-only stack.
Frozen: 2026-04-27 — measurements at this URL will not change. · Permanent dated artifact · GeoLocus Group, a subsidiary of Aryah.ai
Companion to /methodology/sitemap-benchmark/2026-04-26 →
Authors: Robert Maynard, Cofounder and CEO · LinkedIn → · Mark Garland, Cofounder and CRO · LinkedIn →
Note — 2026-04-27 post-publish revision (CDR → RR)
A first draft of this benchmark reported Content Density Ratio (CDR) — visible-content bytes
divided by total HTML bytes. CDR conflates “translation tax” (script/CSS download weight)
with “boilerplate noise” (nav/sidebar/related-posts inside the rendered page).
The metric AI ingestors actually instrument (trafilatura,
readability.js) is boilerplate noise alone, computed at the visible-text layer
after compression and rendering have already happened. This document supersedes the CDR draft.
Numbers and definition below are the authoritative figures.
Note — 2026-04-27 post-publish revision 2 (chrome strip)
The first measurement put Top10Lists.us at 90.52% mean RR — pulled below
the upper-90s by the same site-chrome (top nav ~50 chars, footer Quick Links + contact card
~500–600 chars) the methodology calibration paragraph above flagged. The chrome was
human-navigation aid; AI bots do not navigate. So the chrome was stripped from the
bot-facing edge functions (gildi PR #284, single-point edit in
supabase/functions/_shared/site-chrome.ts; siteHeaderHTML(),
siteFooterHTML(), siteHeaderCSS() all return empty strings;
15 caller edge functions inherit the change). Re-measurement on the same 5-URL sample with
cache flushed: 97.67% mean RR, with 4 of the 5 pages
at exactly 100.00% — pure article body, zero peripheral chrome.
The home page row remained at 88.36% because / was served by a static
built _home.html shared by humans and bots (out of bot-HTML scope at
revision 2 — closed in revision 3 below).
JSON-LD, canonical, title, og:* tags, BreadcrumbList all preserved — pure GEO signal
stays.
Note — 2026-04-27 post-publish revision 3 (homepage fork)
Revision 2 closed every Top10Lists.us bot-facing surface except /, where
one static _home.html served humans and bots in parallel and kept the
home-page row at 88.36%. Revision 3 closes that gap by forking the homepage:
gildi PR #287 (functions/_middleware.js + public/_home-bot.html)
adds a bot-only stripped variant routed by the existing detectBot(ua)
detector at LAYER 5b. Browsers continue to receive the full _home.html
(now built into index.html) with nav, footer, and form UI intact.
Bots receive _home-bot.html: chrome blocks deleted, all content
(hero + sections + lookup form + AI notice) wrapped in a single <main>
so trafilatura keeps it as primary content rather than dropping it as boilerplate.
JSON-LD, canonical, og:* tags, and founder schemas (Robert Q18157412 + LinkedIn,
Mark Q139572756 + LinkedIn — no Wikipedia) are byte-identical across variants;
bot-counter invariant intact via the onRequest finally{}
block. Re-measurement on the same 5-URL sample lands at 100.00%
mean RR with 5 of 5 pages at exactly 100.00%. The
per-page table and aggregates below have been updated; the previous post-strip-only
numbers (97.67% mean) are preserved in the docstring change-log for audit.
The April 27, 2026 press release frames a single thesis: before AI can read your site,
it must translate it. This page is the empirical instrumentation of that thesis. We
measure the Relevance Ratio (RR) — primary-content
characters divided by total visible-text characters — across a 5-site SEO industry cohort.
RR is reproducible with trafilatura + a small boilerplate-strip helper;
cohort-fair because every site is reduced to the same numerator/denominator definition.
Cohort: Top10Lists.us (baseline) plus four established SEO industry sites, anonymized in the page body as Site A through Site D. Per-site identities and per-page URLs are surfaced in the downloadable receipts.json →. Sample: 5 pages per site = 25 pages total (24 returned 200; one Site B page returned 403 even with Googlebot UA — see Limitations).
1. The Metric — Relevance Ratio (RR)
For each fetched page, working at the visible-text layer:
RR = primary_content_chars / total_visible_text_chars
total_visible_text_chars
= visible-text characters from the rendered HTML, after stripping ONLY
<script>, <style>, <noscript>, and HTML comments.
Keep nav, header, footer, sidebar, ads, newsletter forms,
related-articles widgets, recent-posts lists, breadcrumbs, comments --
everything a human or LLM would see if rendered.
primary_content_chars
= visible-text characters of the article body, AFTER additionally
removing the boilerplate envelope (nav/header/footer/aside/form/iframe/svg
when located outside <main>/<article>, ARIA-roled boilerplate,
and class/id substring matches for related/sidebar/widget/newsletter/
subscribe/comments/breadcrumb/recommended/promo/advert/popup/modal/
cookie/share/pagination/site-footer/site-header/nav-/navbar/menu).
Sanity floor: also compute primary content via trafilatura.extract();
take the MAX of the boilerplate-strip pass and the trafilatura pass.
Reported as a percentage. Higher RR = a smaller fraction of what the AI ingestor reads is wrapper noise.
2. Why RR Over the Prior CDR Draft
CDR (signal_bytes / total_response_body_bytes) is a defensible
measurement of crawl-time bandwidth efficiency, but it is not the metric
AI ingestors actually run. Ingestors strip script/style as the first pass (compression and
rendering happen for free), then feed the visible-text layer to a boilerplate-removal
extractor. RR measures the only thing that matters at that layer: of the words the LLM
sees, what fraction is on-topic article body vs. peripheral chrome?
CDR also penalises a site for having a heavy JS shell even when the visible-text layer is pristine. RR isolates the question Robert posed in the press-release framing: of what an AI ingestor reads on this page, how much is the answer?
We retain CDR as a separate downstream concern (it costs crawler budget but does not directly poison citations). The page that lives at this URL from this date forward measures RR.
3. Methodology
Page selection (deterministic & reproducible)
For each site:
- Fetch
<homepage>/sitemap.xml. - From the sitemap index, evaluate the first 8 shards and pick the deepest (most
<url>entries). This avoids cherry-picking and ensures we benchmark the workhorse content surface, not the tiny “about” shard. - From that shard, select 4 deterministic indices: floor(n × 1/5), floor(n × 2/5), floor(n × 3/5), floor(n × 4/5). Same five pages will be selected on any future re-run.
- Add the homepage as page 1.
Total per site: 1 homepage + 4 deep pages = 5 pages. The same URLs were measured for the prior CDR draft so the new RR figures are directly comparable.
User-Agent policy (transparent)
| Site | UA used | Reason |
|---|---|---|
| Top10Lists.us | Googlebot/2.1 | No block |
| Site A | Googlebot/2.1 | No block |
| Site B | Googlebot/2.1 | No block (one deep page returned 403; flagged) |
| Site C | Browser UA (Chrome 120) | Returns 403 to all bot UAs (incl. Googlebot). Identity surfaced in receipts.json with outcome: "blocked". |
| Site D | Googlebot/2.1 | No block |
The cohort is therefore measured under the UA each site actually serves with HTTP 200. Site C's RR is a best-case figure (browser UA tends to receive richer payloads than bot UAs); the AI-bot-observed RR for Site C is N/A.
Extraction pipeline
Implemented in Python with requests + BeautifulSoup4
+ lxml + trafilatura. Source:
sn_bench_rr.py. Reproducible end-to-end with
pip install trafilatura beautifulsoup4 lxml requests.
1. Fetch URL with cohort UA, follow redirects.
2. total_visible_text_chars:
- parse HTML
- decompose <script>, <style>, <noscript>, comments
- get_text(separator=' ', strip=True), collapse whitespace
- count chars
3. primary_content_chars (boilerplate-strip pass):
- parse HTML again
- decompose <script>, <style>, <noscript>, comments
- decompose <iframe>, <svg> globally
- decompose <nav>/<header>/<footer>/<aside>/<form>
ONLY when NOT descendant of <main> or <article>
- decompose role=navigation|banner|contentinfo|complementary
- decompose any element whose class/id contains a boilerplate token
- get_text + collapse whitespace, count chars
4. primary_content_chars (trafilatura pass):
- trafilatura.extract(html, url, favor_recall=True,
include_tables=True, include_comments=False)
- count chars of returned text
5. primary_content_chars = MAX(boilerplate-strip, trafilatura)
6. RR = primary_content_chars / total_visible_text_chars
The MAX rule is the calibration choice. On clean-room HTML where <main>
contains the entire answer, boilerplate-strip wins. On WordPress sites where
<article> exists but trafilatura's classifier handles boilerplate
more aggressively, trafilatura wins. The MAX prevents either failure mode from artificially
deflating a score.
Calibration
Original 2026-04-27 measurement: Top10Lists.us bot HTML rendered <body>
= a small <header class="site-header"> (~50 chars: top-nav links)
+ <main> (the entire answer body) +
<footer class="site-footer"> (~500–600 chars: contact + Quick Links),
and we measured 88–97% RR with a page-mean of 90.52% — the fixed ~600-char
site-chrome amortising poorly across smaller pages.
Post-strip (2026-04-27, gildi PR #284): the site-chrome was removed from bot HTML. Re-measurement
on the same 5-URL sample lands at a 97.67% page-mean, with 4 of 5 pages at exactly
100.00%. The home-page row remained at 88.36% because / was
served by a static _home.html shared with humans — outside bot-HTML scope
at that point.
Post-fork (2026-04-27, gildi PR #287): the homepage forked into a bot-only stripped variant
(public/_home-bot.html: chrome deleted, content wrapped in
<main>) routed at / by the existing
detectBot(ua) detector in functions/_middleware.js.
Browsers continue to receive the full _home.html with nav, footer, and form
UI intact. Re-measurement on the same 5-URL sample lands at a
100.00% page-mean, with 5 of 5 pages at
exactly 100.00%. The cohort comparison and bands below reflect this post-fork
ground truth; the per-page raw RR table preserves pre-strip, post-strip, and post-fork rows so
the reader can see exactly which chars went where at each stage.
How to reproduce
The bench script sn_bench_rr.py is available on request. Anyone with
Python 3 and the four libraries above can re-execute the measurement against the same
cohort — and because the page-selection rule is deterministic, will land on the same
five pages per site (modulo sitemap drift).
4. Per-Page RR (raw)
25 pages total (5 sites × 5 pages, minus 1 Site B page that returned 403). Top10Lists.us rows show both pre-strip and post-fork RR — the chrome-strip (revision 2, gildi PR #284) and the homepage fork (revision 3, gildi PR #287) both happened on 2026-04-27 and we preserve all readings so the staged effect is auditable rather than retconned. The post-fork column is the current ground truth. Cohort site URLs are surfaced under their concrete domains in receipts.json →; path stems below are kept opaque in the page body to preserve the Site A–D mapping.
| Site | Page | Status | Total visible (pre) | Primary (pre) | RR (pre) | RR (post-fork) |
|---|---|---|---|---|---|---|
| Top10Lists.us | / | 200 | 6,738 | 5,954 | 88.36% | 100.00% (forked) |
| Top10Lists.us | /arkansas/texarkana/comet/top10realestateagents | 200 | 4,111 | 3,495 | 85.02% | 100.00% |
| Top10Lists.us | /california/long-beach/el-dorado-south/top10realestateagents | 200 | 18,948 | 18,332 | 96.75% | 100.00% |
| Top10Lists.us | /connecticut/hartford/prospect-hill-historic-district/top10realestateagents | 200 | 5,310 | 4,694 | 88.40% | 100.00% |
| Top10Lists.us | /florida/pierson/seville/top10realestateagents | 200 | 10,386 | 9,770 | 94.07% | 100.00% |
| Site A | homepage | 200 | 16,712 | 11,853 | 70.93% | n/a |
| Site A | blog post #1 | 200 | 19,184 | 14,458 | 75.36% | n/a |
| Site A | blog post #2 | 200 | 17,205 | 13,038 | 75.78% | n/a |
| Site A | blog post #3 | 200 | 16,723 | 12,519 | 74.86% | n/a |
| Site A | blog post #4 | 200 | 14,698 | 10,282 | 69.96% | n/a |
| Site B | homepage | 200 | 14,599 | 12,123 | 83.04% | n/a |
| Site B | glossary page #1 | 200 | 9,771 | 6,345 | 64.94% | n/a |
| Site B | booking form | 200 | 3,865 | 1,568 | 40.57% | n/a |
| Site B | deep page (403) | 403 | — | — | skipped | n/a |
| Site B | glossary page #2 | 200 | 12,027 | 8,682 | 72.19% | n/a |
| Site C | homepage | 200 | 15,079 | 11,268 | 74.73% | n/a |
| Site C | news post #1 | 200 | 7,062 | 3,645 | 51.61% | n/a |
| Site C | news post #2 | 200 | 6,145 | 2,725 | 44.34% | n/a |
| Site C | news post #3 | 200 | 6,687 | 3,266 | 48.84% | n/a |
| Site C | news post #4 | 200 | 6,798 | 3,377 | 49.68% | n/a |
| Site D | homepage | 200 | 27,605 | 21,047 | 76.24% | n/a |
| Site D | blog post #1 | 200 | 19,456 | 16,597 | 85.31% | n/a |
| Site D | blog post #2 | 200 | 15,768 | 12,943 | 82.08% | n/a |
| Site D | blog post #3 | 200 | 20,684 | 17,543 | 84.81% | n/a |
| Site D | blog post #4 | 200 | 18,453 | 15,783 | 85.53% | n/a |
5. Per-Site Aggregates
| Site | n | Mean RR | Median RR | UA |
|---|---|---|---|---|
| Top10Lists.us (post-fork) | 5 | 100.00% | 100.00% | Googlebot/2.1 |
| Top10Lists.us (post-strip, superseded) | 5 | 97.67% | 100.00% | Googlebot/2.1 |
| Top10Lists.us (pre-strip, superseded) | 5 | 90.52% | 88.40% | Googlebot/2.1 |
| Site D | 5 | 82.80% | 84.81% | Googlebot/2.1 |
| Site A | 5 | 73.38% | 74.86% | Googlebot/2.1 |
| Site B | 4 | 65.18% | 68.56% | Googlebot/2.1 |
| Site C | 5 | 53.84% | 49.68% | Browser UA (bot 403) |
Bands
| Band | RR mean | Meaning for an AI ingestor |
|---|---|---|
| Near-pristine | ≥ 90% | Almost everything the LLM reads is the answer. Minimal site-header/footer, no related-posts widgets, no sidebars. |
| Article-clean | 75% – 90% | Solid article extraction; chrome and widgets exist but stay below the noise floor of modern boilerplate-removers. |
| Mid | 55% – 75% | Significant peripheral content (sidebars, related lists, comments) competes with the article body for attention. |
| Noise-heavy | < 55% | Article body is a minority of what the ingestor reads; the page is dominated by widgets, repeated nav copy, recommended-posts grids, or aggressive sidebar promos. |
6. Per-Site Interpretation
Top10Lists.us — pristine (100.00% post-fork)
The clean-room HTML strategy shows up directly: <main> is the
entire answer, no related-articles, no sidebars, no comments, no recommended-posts. The
2026-04-27 chrome strip (gildi PR #284, single-point edit in
_shared/site-chrome.ts) took 4 of the 5 sample pages to exactly
100.00%; the homepage fork that immediately followed (gildi PR #287,
new bot-only public/_home-bot.html routed at /
via functions/_middleware.js) closes the last gap and brings
5 of 5 sample pages to 100.00% — pure article body,
zero peripheral chrome anywhere on the bot-facing surface. JSON-LD, canonical, title,
og:* and BreadcrumbList tags all preserved (they live in <script>
tags and <head>, which are stripped before the visible-text
measurement — pure GEO signal stays). Pre-strip the page-mean was 90.52%; post-strip
was 97.67%; post-fork is 100.00%. All three are preserved in the per-page table and the
aggregate ladder for audit.
Site D — article-clean (82.80%)
Modern WordPress theme with semantic <article> and minimal
sidebar bleed-through. Blog posts cluster 82–85%; the site honours the
article-body / chrome separation that boilerplate extractors rely on.
Site A — mid-to-clean (73.38%)
Heavy editorial site with substantial related-content blocks, but the body content is recognised correctly and the long-form posts cluster tightly at ~75%. The dip on the homepage (70.93%) is the homepage being mostly nav + tile grid.
Site B — mid (65.18%)
Large gap between the two glossary pages (65–72%) and the booking page (40.57%). The booking page is dominated by a form — almost no article body — so RR honestly reflects “this is a form, not an answer.”
Site C — noise-heavy (53.84%)
Roughly half of what an LLM reads on a Site C blog post is peripheral chrome (related news, recent posts, share widgets, sidebar). And this is the browser UA figure — the bot UA gets a 403 at the door, so an AI crawler observes 0%. The lowest-relevance site we measured. Concrete identity surfaced in receipts.json.
7. Why Site B's Reading Shifts vs the CDR Draft
Two different questions, two different answers.
In the superseded CDR pass, Site B landed at 0.88% — bottom of the cohort — because the Site B homepage and glossary pages download multi-MB HTML for ~10 KB of visible content. That CDR figure is correct as a measurement of crawl-time bandwidth waste.
It is not correct as a measurement of boilerplate noise: once you reach the visible-text layer, Site B's glossary pages have substantive article bodies (RR 65–72%). The 0.88%-to-65.18% shift is not a contradiction — it is the difference between “how much script/CSS does the crawler download” (CDR) and “of what the LLM ends up reading, how much is article” (RR).
RR is the question this benchmark answers. CDR is a separate downstream concern (it costs crawler budget but does not directly poison citations).
8. The Agency-Bot-Block Inversion
Site C is a content-marketing agency. Its commercial pitch is that it produces content for
clients. Yet the site returns HTTP 403 Forbidden
to every bot UA we tested — Googlebot, ClaudeBot, GPTBot. The site is unreachable to
AI ingestion at the door. The concrete identity is surfaced in
receipts.json →
with outcome: "blocked" evidence.
We measured under a Chrome 120 desktop UA so we could compute any RR figure for Site C at all. That figure — 53.84% mean — is the best case. The real RR an AI crawler would observe is N/A, because those crawlers can't reach the page at all. And even at best case, roughly half of what a browser-based reader sees is peripheral chrome.
An agency that 403s the very systems clients are paying it to be visible to is the clearest single-data-point illustration of the gap between SEO industry posture and AI ingestion reality. We document it here so the receipt exists.
9. Limitations
- Sample size. 5 pages per site. We chose deterministic indices to make the result reproducible, not exhaustive. A full-cohort run (e.g., 50 pages/site) would tighten confidence intervals; ranks unlikely to flip between adjacent bands.
- Site C UA workaround. Site C returns 403 to every bot UA we tested. We measured under a Chrome 120 desktop UA. This is a best case for Site C — the RR an AI crawler would observe is N/A because those crawlers can't reach the page at all.
- One Site B page returned 403. One deep Site B page returned 403 even with Googlebot UA; n=4 for that site (concrete URL in receipts.json).
- Boilerplate detection is heuristic. The class/id token list and semantic-tag rules are calibrated to the cohort. Sites that use unusual class names for related-posts widgets may receive an inflated RR.
-
MAX-of-two extractors is a calibration choice.
We adopt it because either extractor alone has known failure modes (boilerplate-strip
under-extracts on plain-
<div>blogs; trafilatura under-extracts on structured-data-heavy pages with agent grids and license cards). The MAX rule biases toward generosity; reported RR is an upper bound on what an ingestor's specific extractor would compute. - RR doesn't measure content quality. A page can have a 95% RR and still be junk; conversely, a page with rich body content can be obscured by heavy chrome. RR measures where the noise lives, not whether the answer is correct. For correctness, the AIFS Probe layer (4-platform citation rate) remains the authoritative measurement.
10. Connection to the 9-Dimension GEO Rubric
The canonical Aryah AI GEO rubric awards 15 points for the Content Density dimension — co-equal weight with AI Bot Access and Authority. Until now, that 15-point allocation has been scored against soft heuristics (word count, schema entity density, body-to-shell ratio approximations).
RR is a sharper instrumentation of the same dimension at the layer that actually matters — the visible-text layer the ingestor's extractor consumes. We are not proposing to replace the 15-point Content Density score with RR alone. RR is one of the instrumentations under that umbrella, alongside semantic-container coverage, content-to-chrome word ratios, and JSON-LD entity density.
What RR adds: a single-number reproducibility check that a third-party reader can run in
five minutes, against any site, on any machine with Python 3,
trafilatura, and BeautifulSoup4. That is the
standard the press release frames — not “trust our composite,” but
“here is the script; run it yourself.”
Conclusion
Across a 5-site SEO industry cohort — Top10Lists.us plus four established SEO industry sites (anonymized as Site A–D in this page; identities in receipts.json) — Top10Lists.us delivers a 100.00% mean Relevance Ratio (post homepage fork 2026-04-27): the highest in the cohort, with 5 of 5 sample pages registering at exactly 100.00%. The same 5 URLs measured 90.52% mean before any chrome intervention, then 97.67% after the bot-HTML chrome strip (gildi PR #284), then 100.00% after the homepage fork (gildi PR #287); we preserve all three readings in the per-page table and aggregate ladder so the staged intervention is auditable. The cohort median (n=3, excluding Site C which 403s every bot UA) lands at 73.38% — a clean ~27 percentage-point gap, with two-thirds of one competitor's body content (Site B 65.18%) ceding to peripheral chrome. The full cohort range is available in receipts.json →.
Read the press release this benchmark supports at geolocus.ai/press →
Related
- Sitemap Delivery Benchmark — April 26, 2026 → — Companion frozen artifact (records, TTFB, throughput, bot accessibility).
- GEO Evaluation Methodology — April 26, 2026 → — File checks + multi-system AI evaluation methodology.
- Methodology Overview → — All GEOlocus.ai methodology pages.
- Press → — Press release referencing this benchmark.