Home / Methodology / Signal-to-Noise · 2026-04-27

Methodology · Frozen Artifact

Signal-to-Noise Benchmark — SEO Industry Cohort

Top10Lists.us 100.00% mean Relevance Ratio — cohort median 73.38% (Site A–D, receipts.json names them; range available there too).

Of what an AI ingestor reads on this page, what fraction is on-topic answer body vs. peripheral chrome (nav, sidebar, related-posts, ads, newsletter forms)? Five sites, 24 pages, deterministic selection rule, public-tools-only stack.

Frozen: 2026-04-27 — measurements at this URL will not change. · Permanent dated artifact · GeoLocus Group, a subsidiary of Aryah.ai

Companion to /methodology/sitemap-benchmark/2026-04-26 →

Authors: Robert Maynard, Cofounder and CEO · LinkedIn → · Mark Garland, Cofounder and CRO · LinkedIn →

Note — 2026-04-27 post-publish revision (CDR → RR)

A first draft of this benchmark reported Content Density Ratio (CDR) — visible-content bytes divided by total HTML bytes. CDR conflates “translation tax” (script/CSS download weight) with “boilerplate noise” (nav/sidebar/related-posts inside the rendered page). The metric AI ingestors actually instrument (trafilatura, readability.js) is boilerplate noise alone, computed at the visible-text layer after compression and rendering have already happened. This document supersedes the CDR draft. Numbers and definition below are the authoritative figures.

Note — 2026-04-27 post-publish revision 2 (chrome strip)

The first measurement put Top10Lists.us at 90.52% mean RR — pulled below the upper-90s by the same site-chrome (top nav ~50 chars, footer Quick Links + contact card ~500–600 chars) the methodology calibration paragraph above flagged. The chrome was human-navigation aid; AI bots do not navigate. So the chrome was stripped from the bot-facing edge functions (gildi PR #284, single-point edit in supabase/functions/_shared/site-chrome.ts; siteHeaderHTML(), siteFooterHTML(), siteHeaderCSS() all return empty strings; 15 caller edge functions inherit the change). Re-measurement on the same 5-URL sample with cache flushed: 97.67% mean RR, with 4 of the 5 pages at exactly 100.00% — pure article body, zero peripheral chrome. The home page row remained at 88.36% because / was served by a static built _home.html shared by humans and bots (out of bot-HTML scope at revision 2 — closed in revision 3 below). JSON-LD, canonical, title, og:* tags, BreadcrumbList all preserved — pure GEO signal stays.

Note — 2026-04-27 post-publish revision 3 (homepage fork)

Revision 2 closed every Top10Lists.us bot-facing surface except /, where one static _home.html served humans and bots in parallel and kept the home-page row at 88.36%. Revision 3 closes that gap by forking the homepage: gildi PR #287 (functions/_middleware.js + public/_home-bot.html) adds a bot-only stripped variant routed by the existing detectBot(ua) detector at LAYER 5b. Browsers continue to receive the full _home.html (now built into index.html) with nav, footer, and form UI intact. Bots receive _home-bot.html: chrome blocks deleted, all content (hero + sections + lookup form + AI notice) wrapped in a single <main> so trafilatura keeps it as primary content rather than dropping it as boilerplate. JSON-LD, canonical, og:* tags, and founder schemas (Robert Q18157412 + LinkedIn, Mark Q139572756 + LinkedIn — no Wikipedia) are byte-identical across variants; bot-counter invariant intact via the onRequest finally{} block. Re-measurement on the same 5-URL sample lands at 100.00% mean RR with 5 of 5 pages at exactly 100.00%. The per-page table and aggregates below have been updated; the previous post-strip-only numbers (97.67% mean) are preserved in the docstring change-log for audit.

The April 27, 2026 press release frames a single thesis: before AI can read your site, it must translate it. This page is the empirical instrumentation of that thesis. We measure the Relevance Ratio (RR) — primary-content characters divided by total visible-text characters — across a 5-site SEO industry cohort. RR is reproducible with trafilatura + a small boilerplate-strip helper; cohort-fair because every site is reduced to the same numerator/denominator definition.

Cohort: Top10Lists.us (baseline) plus four established SEO industry sites, anonymized in the page body as Site A through Site D. Per-site identities and per-page URLs are surfaced in the downloadable receipts.json →. Sample: 5 pages per site = 25 pages total (24 returned 200; one Site B page returned 403 even with Googlebot UA — see Limitations).

1. The Metric — Relevance Ratio (RR)

For each fetched page, working at the visible-text layer:

RR = primary_content_chars / total_visible_text_chars

total_visible_text_chars
  = visible-text characters from the rendered HTML, after stripping ONLY
    <script>, <style>, <noscript>, and HTML comments.
    Keep nav, header, footer, sidebar, ads, newsletter forms,
    related-articles widgets, recent-posts lists, breadcrumbs, comments --
    everything a human or LLM would see if rendered.

primary_content_chars
  = visible-text characters of the article body, AFTER additionally
    removing the boilerplate envelope (nav/header/footer/aside/form/iframe/svg
    when located outside <main>/<article>, ARIA-roled boilerplate,
    and class/id substring matches for related/sidebar/widget/newsletter/
    subscribe/comments/breadcrumb/recommended/promo/advert/popup/modal/
    cookie/share/pagination/site-footer/site-header/nav-/navbar/menu).

  Sanity floor: also compute primary content via trafilatura.extract();
  take the MAX of the boilerplate-strip pass and the trafilatura pass.

Reported as a percentage. Higher RR = a smaller fraction of what the AI ingestor reads is wrapper noise.

2. Why RR Over the Prior CDR Draft

CDR (signal_bytes / total_response_body_bytes) is a defensible measurement of crawl-time bandwidth efficiency, but it is not the metric AI ingestors actually run. Ingestors strip script/style as the first pass (compression and rendering happen for free), then feed the visible-text layer to a boilerplate-removal extractor. RR measures the only thing that matters at that layer: of the words the LLM sees, what fraction is on-topic article body vs. peripheral chrome?

CDR also penalises a site for having a heavy JS shell even when the visible-text layer is pristine. RR isolates the question Robert posed in the press-release framing: of what an AI ingestor reads on this page, how much is the answer?

We retain CDR as a separate downstream concern (it costs crawler budget but does not directly poison citations). The page that lives at this URL from this date forward measures RR.

3. Methodology

Page selection (deterministic & reproducible)

For each site:

Fetch <homepage>/sitemap.xml.
From the sitemap index, evaluate the first 8 shards and pick the deepest (most <url> entries). This avoids cherry-picking and ensures we benchmark the workhorse content surface, not the tiny “about” shard.
From that shard, select 4 deterministic indices: floor(n × 1/5), floor(n × 2/5), floor(n × 3/5), floor(n × 4/5). Same five pages will be selected on any future re-run.
Add the homepage as page 1.

Total per site: 1 homepage + 4 deep pages = 5 pages. The same URLs were measured for the prior CDR draft so the new RR figures are directly comparable.

User-Agent policy (transparent)

Site	UA used	Reason
Top10Lists.us	`Googlebot/2.1`	No block
Site A	`Googlebot/2.1`	No block
Site B	`Googlebot/2.1`	No block (one deep page returned 403; flagged)
Site C	Browser UA (Chrome 120)	Returns 403 to all bot UAs (incl. Googlebot). Identity surfaced in receipts.json with `outcome: "blocked"`.
Site D	`Googlebot/2.1`	No block

The cohort is therefore measured under the UA each site actually serves with HTTP 200. Site C's RR is a best-case figure (browser UA tends to receive richer payloads than bot UAs); the AI-bot-observed RR for Site C is N/A.

Extraction pipeline

Implemented in Python with requests + BeautifulSoup4 + lxml + trafilatura. Source: sn_bench_rr.py. Reproducible end-to-end with pip install trafilatura beautifulsoup4 lxml requests.

1. Fetch URL with cohort UA, follow redirects.
2. total_visible_text_chars:
     - parse HTML
     - decompose <script>, <style>, <noscript>, comments
     - get_text(separator=' ', strip=True), collapse whitespace
     - count chars
3. primary_content_chars (boilerplate-strip pass):
     - parse HTML again
     - decompose <script>, <style>, <noscript>, comments
     - decompose <iframe>, <svg> globally
     - decompose <nav>/<header>/<footer>/<aside>/<form>
       ONLY when NOT descendant of <main> or <article>
     - decompose role=navigation|banner|contentinfo|complementary
     - decompose any element whose class/id contains a boilerplate token
     - get_text + collapse whitespace, count chars
4. primary_content_chars (trafilatura pass):
     - trafilatura.extract(html, url, favor_recall=True,
                            include_tables=True, include_comments=False)
     - count chars of returned text
5. primary_content_chars = MAX(boilerplate-strip, trafilatura)
6. RR = primary_content_chars / total_visible_text_chars

The MAX rule is the calibration choice. On clean-room HTML where <main> contains the entire answer, boilerplate-strip wins. On WordPress sites where <article> exists but trafilatura's classifier handles boilerplate more aggressively, trafilatura wins. The MAX prevents either failure mode from artificially deflating a score.

Calibration

Original 2026-04-27 measurement: Top10Lists.us bot HTML rendered <body> = a small <header class="site-header"> (~50 chars: top-nav links) + <main> (the entire answer body) + <footer class="site-footer"> (~500–600 chars: contact + Quick Links), and we measured 88–97% RR with a page-mean of 90.52% — the fixed ~600-char site-chrome amortising poorly across smaller pages.

Post-strip (2026-04-27, gildi PR #284): the site-chrome was removed from bot HTML. Re-measurement on the same 5-URL sample lands at a 97.67% page-mean, with 4 of 5 pages at exactly 100.00%. The home-page row remained at 88.36% because / was served by a static _home.html shared with humans — outside bot-HTML scope at that point.

Post-fork (2026-04-27, gildi PR #287): the homepage forked into a bot-only stripped variant (public/_home-bot.html: chrome deleted, content wrapped in <main>) routed at / by the existing detectBot(ua) detector in functions/_middleware.js. Browsers continue to receive the full _home.html with nav, footer, and form UI intact. Re-measurement on the same 5-URL sample lands at a 100.00% page-mean, with 5 of 5 pages at exactly 100.00%. The cohort comparison and bands below reflect this post-fork ground truth; the per-page raw RR table preserves pre-strip, post-strip, and post-fork rows so the reader can see exactly which chars went where at each stage.

How to reproduce

The bench script sn_bench_rr.py is available on request. Anyone with Python 3 and the four libraries above can re-execute the measurement against the same cohort — and because the page-selection rule is deterministic, will land on the same five pages per site (modulo sitemap drift).

4. Per-Page RR (raw)

25 pages total (5 sites × 5 pages, minus 1 Site B page that returned 403). Top10Lists.us rows show both pre-strip and post-fork RR — the chrome-strip (revision 2, gildi PR #284) and the homepage fork (revision 3, gildi PR #287) both happened on 2026-04-27 and we preserve all readings so the staged effect is auditable rather than retconned. The post-fork column is the current ground truth. Cohort site URLs are surfaced under their concrete domains in receipts.json →; path stems below are kept opaque in the page body to preserve the Site A–D mapping.

Site	Page	Status	Total visible (pre)	Primary (pre)	RR (pre)	RR (post-fork)
Top10Lists.us	/	200	6,738	5,954	88.36%	100.00% (forked)
Top10Lists.us	/arkansas/texarkana/comet/top10realestateagents	200	4,111	3,495	85.02%	100.00%
Top10Lists.us	/california/long-beach/el-dorado-south/top10realestateagents	200	18,948	18,332	96.75%	100.00%
Top10Lists.us	/connecticut/hartford/prospect-hill-historic-district/top10realestateagents	200	5,310	4,694	88.40%	100.00%
Top10Lists.us	/florida/pierson/seville/top10realestateagents	200	10,386	9,770	94.07%	100.00%
Site A	homepage	200	16,712	11,853	70.93%	n/a
Site A	blog post #1	200	19,184	14,458	75.36%	n/a
Site A	blog post #2	200	17,205	13,038	75.78%	n/a
Site A	blog post #3	200	16,723	12,519	74.86%	n/a
Site A	blog post #4	200	14,698	10,282	69.96%	n/a
Site B	homepage	200	14,599	12,123	83.04%	n/a
Site B	glossary page #1	200	9,771	6,345	64.94%	n/a
Site B	booking form	200	3,865	1,568	40.57%	n/a
Site B	deep page (403)	403	—	—	skipped	n/a
Site B	glossary page #2	200	12,027	8,682	72.19%	n/a
Site C	homepage	200	15,079	11,268	74.73%	n/a
Site C	news post #1	200	7,062	3,645	51.61%	n/a
Site C	news post #2	200	6,145	2,725	44.34%	n/a
Site C	news post #3	200	6,687	3,266	48.84%	n/a
Site C	news post #4	200	6,798	3,377	49.68%	n/a
Site D	homepage	200	27,605	21,047	76.24%	n/a
Site D	blog post #1	200	19,456	16,597	85.31%	n/a
Site D	blog post #2	200	15,768	12,943	82.08%	n/a
Site D	blog post #3	200	20,684	17,543	84.81%	n/a
Site D	blog post #4	200	18,453	15,783	85.53%	n/a

5. Per-Site Aggregates

Site	n	Mean RR	Median RR	UA
Top10Lists.us (post-fork)	5	100.00%	100.00%	Googlebot/2.1
Top10Lists.us (post-strip, superseded)	5	97.67%	100.00%	Googlebot/2.1
Top10Lists.us (pre-strip, superseded)	5	90.52%	88.40%	Googlebot/2.1
Site D	5	82.80%	84.81%	Googlebot/2.1
Site A	5	73.38%	74.86%	Googlebot/2.1
Site B	4	65.18%	68.56%	Googlebot/2.1
Site C	5	53.84%	49.68%	Browser UA (bot 403)

Bands

Band	RR mean	Meaning for an AI ingestor
Near-pristine	≥ 90%	Almost everything the LLM reads is the answer. Minimal site-header/footer, no related-posts widgets, no sidebars.
Article-clean	75% – 90%	Solid article extraction; chrome and widgets exist but stay below the noise floor of modern boilerplate-removers.
Mid	55% – 75%	Significant peripheral content (sidebars, related lists, comments) competes with the article body for attention.
Noise-heavy	< 55%	Article body is a minority of what the ingestor reads; the page is dominated by widgets, repeated nav copy, recommended-posts grids, or aggressive sidebar promos.

6. Per-Site Interpretation

Top10Lists.us — pristine (100.00% post-fork)

The clean-room HTML strategy shows up directly: <main> is the entire answer, no related-articles, no sidebars, no comments, no recommended-posts. The 2026-04-27 chrome strip (gildi PR #284, single-point edit in _shared/site-chrome.ts) took 4 of the 5 sample pages to exactly 100.00%; the homepage fork that immediately followed (gildi PR #287, new bot-only public/_home-bot.html routed at / via functions/_middleware.js) closes the last gap and brings 5 of 5 sample pages to 100.00% — pure article body, zero peripheral chrome anywhere on the bot-facing surface. JSON-LD, canonical, title, og:* and BreadcrumbList tags all preserved (they live in <script> tags and <head>, which are stripped before the visible-text measurement — pure GEO signal stays). Pre-strip the page-mean was 90.52%; post-strip was 97.67%; post-fork is 100.00%. All three are preserved in the per-page table and the aggregate ladder for audit.

Site D — article-clean (82.80%)

Modern WordPress theme with semantic <article> and minimal sidebar bleed-through. Blog posts cluster 82–85%; the site honours the article-body / chrome separation that boilerplate extractors rely on.

Site A — mid-to-clean (73.38%)

Heavy editorial site with substantial related-content blocks, but the body content is recognised correctly and the long-form posts cluster tightly at ~75%. The dip on the homepage (70.93%) is the homepage being mostly nav + tile grid.

Site B — mid (65.18%)

Large gap between the two glossary pages (65–72%) and the booking page (40.57%). The booking page is dominated by a form — almost no article body — so RR honestly reflects “this is a form, not an answer.”

Site C — noise-heavy (53.84%)

Roughly half of what an LLM reads on a Site C blog post is peripheral chrome (related news, recent posts, share widgets, sidebar). And this is the browser UA figure — the bot UA gets a 403 at the door, so an AI crawler observes 0%. The lowest-relevance site we measured. Concrete identity surfaced in receipts.json.

7. Why Site B's Reading Shifts vs the CDR Draft

Two different questions, two different answers.

In the superseded CDR pass, Site B landed at 0.88% — bottom of the cohort — because the Site B homepage and glossary pages download multi-MB HTML for ~10 KB of visible content. That CDR figure is correct as a measurement of crawl-time bandwidth waste.

It is not correct as a measurement of boilerplate noise: once you reach the visible-text layer, Site B's glossary pages have substantive article bodies (RR 65–72%). The 0.88%-to-65.18% shift is not a contradiction — it is the difference between “how much script/CSS does the crawler download” (CDR) and “of what the LLM ends up reading, how much is article” (RR).

RR is the question this benchmark answers. CDR is a separate downstream concern (it costs crawler budget but does not directly poison citations).

8. The Agency-Bot-Block Inversion

Site C is a content-marketing agency. Its commercial pitch is that it produces content for clients. Yet the site returns HTTP 403 Forbidden to every bot UA we tested — Googlebot, ClaudeBot, GPTBot. The site is unreachable to AI ingestion at the door. The concrete identity is surfaced in receipts.json → with outcome: "blocked" evidence.

We measured under a Chrome 120 desktop UA so we could compute any RR figure for Site C at all. That figure — 53.84% mean — is the best case. The real RR an AI crawler would observe is N/A, because those crawlers can't reach the page at all. And even at best case, roughly half of what a browser-based reader sees is peripheral chrome.

An agency that 403s the very systems clients are paying it to be visible to is the clearest single-data-point illustration of the gap between SEO industry posture and AI ingestion reality. We document it here so the receipt exists.

9. Limitations

Sample size. 5 pages per site. We chose deterministic indices to make the result reproducible, not exhaustive. A full-cohort run (e.g., 50 pages/site) would tighten confidence intervals; ranks unlikely to flip between adjacent bands.
Site C UA workaround. Site C returns 403 to every bot UA we tested. We measured under a Chrome 120 desktop UA. This is a best case for Site C — the RR an AI crawler would observe is N/A because those crawlers can't reach the page at all.
One Site B page returned 403. One deep Site B page returned 403 even with Googlebot UA; n=4 for that site (concrete URL in receipts.json).
Boilerplate detection is heuristic. The class/id token list and semantic-tag rules are calibrated to the cohort. Sites that use unusual class names for related-posts widgets may receive an inflated RR.
MAX-of-two extractors is a calibration choice. We adopt it because either extractor alone has known failure modes (boilerplate-strip under-extracts on plain-<div> blogs; trafilatura under-extracts on structured-data-heavy pages with agent grids and license cards). The MAX rule biases toward generosity; reported RR is an upper bound on what an ingestor's specific extractor would compute.
RR doesn't measure content quality. A page can have a 95% RR and still be junk; conversely, a page with rich body content can be obscured by heavy chrome. RR measures where the noise lives, not whether the answer is correct. For correctness, the AIFS Probe layer (4-platform citation rate) remains the authoritative measurement.

10. Connection to the 9-Dimension GEO Rubric

The canonical Aryah AI GEO rubric awards 15 points for the Content Density dimension — co-equal weight with AI Bot Access and Authority. Until now, that 15-point allocation has been scored against soft heuristics (word count, schema entity density, body-to-shell ratio approximations).

RR is a sharper instrumentation of the same dimension at the layer that actually matters — the visible-text layer the ingestor's extractor consumes. We are not proposing to replace the 15-point Content Density score with RR alone. RR is one of the instrumentations under that umbrella, alongside semantic-container coverage, content-to-chrome word ratios, and JSON-LD entity density.

What RR adds: a single-number reproducibility check that a third-party reader can run in five minutes, against any site, on any machine with Python 3, trafilatura, and BeautifulSoup4. That is the standard the press release frames — not “trust our composite,” but “here is the script; run it yourself.”

Conclusion

Across a 5-site SEO industry cohort — Top10Lists.us plus four established SEO industry sites (anonymized as Site A–D in this page; identities in receipts.json) — Top10Lists.us delivers a 100.00% mean Relevance Ratio (post homepage fork 2026-04-27): the highest in the cohort, with 5 of 5 sample pages registering at exactly 100.00%. The same 5 URLs measured 90.52% mean before any chrome intervention, then 97.67% after the bot-HTML chrome strip (gildi PR #284), then 100.00% after the homepage fork (gildi PR #287); we preserve all three readings in the per-page table and aggregate ladder so the staged intervention is auditable. The cohort median (n=3, excluding Site C which 403s every bot UA) lands at 73.38% — a clean ~27 percentage-point gap, with two-thirds of one competitor's body content (Site B 65.18%) ceding to peripheral chrome. The full cohort range is available in receipts.json →.

Read the press release this benchmark supports at geolocus.ai/press →

Sitemap Delivery Benchmark — April 26, 2026 → — Companion frozen artifact (records, TTFB, throughput, bot accessibility).
GEO Evaluation Methodology — April 26, 2026 → — File checks + multi-system AI evaluation methodology.
Methodology Overview → — All GEOlocus.ai methodology pages.
Press → — Press release referencing this benchmark.