Original research
No. 07 · Discoverability · Original scrape data
llms.txt in the wild.
What 62 popular dev, SaaS, AI, and SEO sites are actually shipping — and the three things 90% are getting right.
Scraped llms.txt from 62 well-known tech sites — AI labs, dev tools, SaaS platforms, SEO incumbents. The goal: see what's actually being shipped now that Google Lighthouse 13.3 added llms.txt as a formal audit under a new "Agentic Browsing" category.
Important framing up front. The Lighthouse audit is for agentic browsing readiness — not Google Search ranking. Google Search has separately stated llms.txt isn't needed for AI features. So this isn't "Google now uses llms.txt for SEO." It's "Google now formally measures whether your site is legible to AI agents." Different surface, different optimization, same file.
Adoption rate: 63%. 39 of the 62 sites had a valid llms.txt file. Higher than expected. Solid coverage among AI labs and dev tool companies. Notable gaps among legacy SEO incumbents (SemRush, Backlinko, SEJ — none had a working file).
The numbers
Adoption is solid. Quality varies wildly.
- Adoption rate
- 39 of 62 sites (63%) have a valid llms.txt
- Has the
>summary blockquote - 68% — meaning 32% silently violate the spec's most basic structural requirement
- Has at least one
##section - 92% — most files have at least some topic structure
- Has an H1
- 90%
- Zero structure (bare list of links)
- 7% — rarer than expected, but the offenders are large sites
- Size range
- 546 chars (Continue.dev) → 284,256 chars (PostHog) — a 520× variance
- Median size
- ~12K chars
The good ones
Five files that nail it.
Each of these treats llms.txt as a curated map for an agent — not a sitemap dump.
-
docs.railway.app — 37K chars · 15 sections · 236 links
Every link carries a one-sentence description of what an agent will find at that URL. 99.6% description ratio. Reads like a hand-curated guide, not an export.
-
clerk.com — 19K chars · 5 sections · 78 links
Tight, complete, 100% described. Proof that you don't need 100K chars to ship a useful file. Five sections is enough when each one earns its place.
-
planetscale.com — 100K chars · 3 sections · 503 links
Big file, but organized. 75% description ratio. Shows that "large" isn't the same as "lazy" — they grouped 500 links under 3 well-named sections rather than dumping them flat.
-
frase.io — 10K chars · 10 sections · 62 links
Short, focused, 100% described. The cleanest small-file example in the sample.
-
docs.stripe.com — 93K chars · 26 sections · 528 links
Follows Stripe's docs information architecture exactly. 92% description ratio. The file teaches the agent the same mental model Stripe's human readers use.
The weak ones
Three patterns to avoid.
1. The wall of bare links
Sentry ships a 20K-char file with zero ## sections and only 17 links. Moz ships 61K chars, zero sections, 139 bare links. Both files look complete by size, but to an AI agent crawling for navigation, they're barely better than a sitemap with markdown wrappers. No curation, no descriptions, no signal about what matters.
2. The too-thin file
Continue.dev's file is 546 characters total — basically just a description and 5 links. Probably auto-generated from minimal config and forgotten about. Buttondown ships a 996-char file with no sections. These look like the team enabled an "llms.txt generator" plugin and never refined the output.
3. The silent spec violation
32% of files in this sample skip the > summary blockquote. That's the line the original spec defines as the single-sentence description of your whole site — what an agent pulls when it has space for only one quote. Skipping it means a spec-respecting agent has nothing to quote, and may fall back to the meta description or the first <p> tag, which is rarely what you'd choose.
Practical takeaways
Five rules if you're shipping or refactoring an llms.txt.
- 01
Include the
>blockquote right under the H1. Make it dense. Assume the agent uses only this line. - 02
Group links under
##headers by topic. 5–15 sections is the sweet spot. Zero sections = bare list. 50+ sections (PostHog ships 55) creates navigation overhead. - 03
Every link gets a one-sentence description unless the title alone is unambiguous (comparison pages, vendor pairs). Aim for 80%+ description ratio.
- 04
Keep total size under ~30K chars unless you have a genuine docs-tree reason to go big. Agents may not parse the whole file. Front-load value.
- 05
No HTML, no JavaScript, no markdown tables. Plain markdown links + descriptions. The spec is intentionally minimal. Don't over-engineer.
What we don't yet know
Three honest open questions.
Does file size matter at the upper end?
PostHog ships 284K chars — the largest in the sample. They're widely cited in AI search results. But causation vs. correlation is unclear — they're cited because of their content, not necessarily their llms.txt structure.
Is the > blockquote actually pulled by any major LLM today?
We don't have observable evidence. The spec defines it, Lighthouse audits for the H1 but not specifically the summary blockquote. Whether ChatGPT, Claude, or Perplexity actually use that line is open.
Is one big file better than a hub plus per-section files?
Several big docs sites are now experimenting with the split approach (/docs/llms.txt, /api/llms.txt). No clear winner. The spec doesn't address it. We may know more once Lighthouse adds more audits to the Agentic Browsing category.
The dataset
Raw scrape data — 39 valid files.
Probed via curl against https://{domain}/llms.txt with a 6-second timeout. Validated as HTTP 200, content greater than 100 characters, content does not start with < (filters HTML 404s). Sections counted as lines matching ^## . Description ratio computed as the percentage of bullet lines containing a : after the link (a proxy for "has a description"). Licensed CC0 — use freely.
| domain | chars | sections | links | summary |
|---|---|---|---|---|
| docs.railway.app | 37,214 | 15 | 236 | yes |
| planetscale.com | 100,038 | 3 | 503 | yes |
| clerk.com | 19,087 | 5 | 78 | yes |
| frase.io | 9,693 | 10 | 62 | yes |
| docs.stripe.com | 93,238 | 26 | 528 | no |
| nuxt.com | 51,916 | 4 | 320 | yes |
| neon.tech | 27,721 | 19 | 185 | yes |
| posthog.com | 284,256 | 55 | 2,594 | yes |
| python.langchain.com | 200,238 | 2 | 1,384 | yes |
| docs.anthropic.com | 166,613 | 3 | 1,544 | no |
| vercel.com | 166,728 | — | — | — |
| mantine.dev | 41,182 | 12 | 414 | no |
| weaviate.io | 36,687 | — | — | — |
| pinecone.io | 35,651 | 12 | 242 | yes |
| bun.sh | 33,272 | 2 | 317 | no |
| sentry.io | 20,420 | 0 | 17 | no |
| athenahq.ai | 12,329 | 2 | 78 | yes |
| scrunchai.com | 12,307 | 5 | 81 | no |
| qdrant.tech | 11,748 | 2 | 61 | yes |
| cursor.com | 9,476 | 20 | 167 | no |
| linear.app | 9,443 | 2 | 141 | yes |
| framer.com | 8,148 | 6 | 40 | yes |
| intercom.com | 8,047 | 14 | 63 | yes |
| nextjs.org | 7,464 | 4 | 34 | yes |
| notion.com | 6,930 | 7 | 49 | yes |
| resend.com | 5,601 | 14 | 45 | yes |
| composio.dev | 4,759 | 12 | 26 | no |
| honeycomb.io | 4,734 | 13 | 18 | yes |
| amplitude.com | 2,774 | 7 | 6 | no |
| prisma.io | 2,409 | 2 | 13 | yes |
| crewai.com | 2,160 | 4 | 27 | yes |
| otterly.ai | 2,033 | 4 | 13 | yes |
| svelte.dev | 1,676 | — | — | — |
| railway.app | 1,405 | 3 | 9 | yes |
| supabase.com | 1,258 | 2 | 19 | no |
| continue.dev | 546 | 1 | 5 | yes |
| moz.com | 61,882 | 0 | 139 | no |
| tryprofound.com | 13,019 | — | — | — |
| buttondown.email | 996 | 0 | 4 | no |
Dashes (—) indicate the file was valid but our parser failed on that specific metric (typically because the file format threw off the regex assumptions). The file itself is still listed because it counts as adopted.
Methodology
Probed 62 well-known dev tool, SaaS, AI lab, and SEO platform domains via plain curl with a 6-second timeout. Each successful response was validated as HTTP 200, content length greater than 100 characters, and not starting with < (which would indicate an HTML 404 page rather than a real markdown file). Section count is the number of lines matching ^## . Link count is the number of https?:// matches in the file. Description ratio is the percentage of ^- bullet lines containing a colon after the link, used as a proxy for "this link has a description." All measurements are point-in-time on May 21, 2026, and may not reflect current state if a site has updated their file since.
This dataset is published under the CC0 1.0 license — public domain, no attribution required. Use freely.