How to Create a Perfect llms.txt File (AI-Powered, 2026)
Hand-curating an llms.txt for a 500-page documentation site is brutal. You're manually deciding which URLs make the cut, which H2 section each one belongs in, what one-line description gives the AI useful signal. For most teams, that's a half-day of work spread across product, marketing, and dev — and the file gets stale within weeks. There's a faster way: use AI to generate it. This guide walks through the exact 4-step workflow we use, with real prompt examples and the 14-point validation checklist.
TL;DR Summary
- 4 steps: crawl, AI categorize, edit, ship. Total time: 5-10 minutes.
- Crawl pulls up to 50 pages, prioritized by homepage status + internal link count.
- AI groups pages into H2 sections, writes the H1 + blockquote summary, and outputs editable markdown.
- Always edit before publishing — the AI is good at structure, less good at brand voice.
- Our generator: 50 credits per run, refunded on failure. Uses DeepSeek v4 primary with Gemini and Claude fallbacks.
- After shipping, validate with the LLMs.txt Checker across all 14 spec-grade parameters.
- Regenerate quarterly or when navigation changes.
1. Why Use AI to Generate llms.txt
The honest answer: hand-curation is fine for small sites. If you have 20 marketing pages and one product, you can write a great llms.txt over coffee. The problem is what happens beyond that.
The hand-curation tax
Once your site has 100+ pages, hand-curation gets painful:
- You have to review every page to decide if it's worth surfacing to AI
- You have to group pages into logical H2 sections — what counts as "Documentation" vs "Guides" vs "Reference"?
- You have to write a one-line description for each page that's useful to an AI
- You have to decide what goes in the Optional section
- You have to do this again every time the site changes
For a 500-page docs site, that's 4-8 hours of focused work. Most teams don't have that time, so they either ship a bad llms.txt (3 H2 sections, 12 links, no descriptions) or ship none at all.
What AI does well
Large language models are genuinely good at:
- Page categorization: looking at 50 URLs + titles + descriptions, picking sensible H2 groupings, deciding what belongs where
- Concise descriptions: turning a 200-word meta description into a 12-word link description
- Structural consistency: keeping every H2 section similarly sized, every link in the same format
- Identifying Optional content: changelog, press, legal — the AI knows what humans typically skip
These are exactly the parts that take humans the longest. Delegating them to AI cuts the work from hours to minutes.
What AI does badly
AI is less good at:
- Brand voice: the blockquote summary the AI writes will be technically accurate but generic
- Product nuance: the AI doesn't know that your "Webhooks" section is critical to enterprise customers but irrelevant to free-tier users
- Tribal knowledge: which obscure page secretly answers 30% of your support tickets? The AI can't see that signal
- Strategic curation: the AI doesn't know which pages you're actively de-emphasizing for business reasons
That's why every workflow ends with a human editing pass. AI handles the volume; you handle the judgment.
2. The 4-Step AI Workflow
Every AI-powered llms.txt generator follows roughly the same flow. Here's the version we ship in the InstaRank SEO LLMs.txt Generator:
- Crawl — discover the URLs that matter (10-60 seconds)
- AI categorize — group + describe + structure (5-15 seconds)
- Edit — human refinement (2-5 minutes)
- Ship — upload to production (1 minute)
Total elapsed time: under 10 minutes for the first run. Subsequent regenerations are faster because you already know the editing patterns.
3. Step 1: Crawl Your Site
The generator can't recommend pages it doesn't know exist. So step 1 is discovery. The crawl phase has three jobs: find candidate URLs, fetch their titles + meta descriptions, and prioritize the list.
What the crawler pulls
For each candidate page:
- URL — the absolute URL, with protocol and path
- Title — from the HTML
<title>tag, fallback to H1 - Meta description — from
<meta name="description"> - Status code — only 2xx pages are kept
- Internal link count — used as a popularity proxy
Page prioritization
We cap at 50 pages because that's the sweet spot for AI context budget. To pick the top 50 from a larger site, we use deterministic ordering:
- Homepage first — always. It sets the project framing.
- Then by internal link count — descending. Pages with the most inbound internal links are usually the most important ones.
- Then alphabetical — as a stable tiebreaker.
Deterministic ordering matters because it makes your scores reproducible. Run the generator twice on the same site, get the same probe set, get the same score. No random jitter.
A note on JavaScript-rendered sites
The crawl phase uses HTML-only fetching by default — no Puppeteer, no JS rendering — for speed. If your site is fully SPA-rendered and the meta descriptions only appear after JS executes, the crawl will surface URLs but with thin metadata. Workaround: ensure your SPA outputs title + meta description in the initial HTML server-side (Next.js metadata API, Nuxt useHead, etc.). That's also better for traditional SEO.
4. Step 2: AI Categorization
Once we have the page list, we feed it to the AI. Our generator uses our centralized AI service (services/ai/unified-client.ts) which routes to DeepSeek v4 as the primary model, with Gemini and Claude as fallbacks.
The AI prompt structure
The system prompt establishes the spec:
You are an expert SEO and AI optimization engineer producing
a spec-conformant /llms.txt file for the site {SITE_NAME}
({DOMAIN}).
The /llms.txt standard requires this exact markdown structure:
# <Site / Project Name>
> <One-sentence summary>
## <Section Name>
- [Page Title](url): Optional description
- [Another Page](url)
## Optional
- [Lower-priority page](url): Skippable section
Rules:
1. H1 is required (site/project name).
2. Blockquote (>) directly after H1.
3. H2 sections group pages by purpose.
4. Every list item MUST be markdown link format.
5. Use absolute URLs only.
6. Include "## Optional" section.
7. Output PURE markdown only — no code fences.The user prompt provides the crawled page data:
Generate the /llms.txt file for {SITE_NAME} ({DOMAIN}).
Input pages:
1. URL: https://example.com/
Title: Acme Analytics — Real-time product analytics
Description: Connect to your warehouse in 5 minutes...
2. URL: https://example.com/docs/quickstart
Title: Quickstart Guide
Description: Five-minute setup from npm install...
[...up to 50 pages...]
Output the complete markdown content of /llms.txt only.Generation parameters
We set specific generation parameters that matter:
maxTokens: 16384— supports up to ~64KB output, enough for the largest reasonable llms.txttemperature: 0.4— low enough for consistency, high enough for natural prose in descriptionstaskType: 'content'— routes through our content-style provider chaintimeout: 90_000— 90 second cap; typical run finishes in 5-15 seconds
Output validation
AI output is validated against the same parser our checker uses. We detect two failure modes:
- Truncation: the AI hit its token limit mid-section. We detect this by looking for dangling list markers (
- [with no closing) or an empty trailing H2 section. Retry once with reduced section count. - Hallucination: the AI invented URLs that weren't in the input. We compare every URL in the output against the input allowlist. Retry once with an explicit allowlist constraint in the prompt.
If retries don't fix it, the generator falls back to: strip hallucinated lines from the output, return what we have, mark as validatedOk: false in the response, refund the credits.
5. Step 3: Edit the Draft
The generator returns an editable markdown textarea. Always edit before publishing. Here's what to look at.
H1 wording
The AI extracts your H1 from the homepage title. That's usually right, but sometimes the title is stripped ("Acme | Real-time analytics" becomes just "Acme") when you wanted the full thing. Make sure the H1 reads like your actual brand, not a stripped title tag.
Blockquote summary
The AI writes a generic one-sentence summary. Tighten it. The blockquote is the first thing AI agents see — it sets the framing for everything below. A good summary tells the AI what your product does and who it's for in one sentence.
Bad: > Acme is a software company providing solutions for businesses. (generic, says nothing)
Good: > Acme is real-time product analytics for engineering teams — connects to your warehouse in 5 minutes, sub-second query performance. (specific, named audience, one differentiator)
Section names
The AI picks sensible H2 names, but you know your users better. Match how your users would phrase the question. If your customers ask "how much does it cost", the section should be ## Pricing, not ## Plans Overview.
Link descriptions
The AI uses your meta descriptions as link descriptions. That's often fine, but meta descriptions are written for Google SERPs — they're sometimes too markety. Rewrite any link description that reads like ad copy.
Bad: [API Reference](url): The world's most powerful analytics API with unmatched scalability
Good: [API Reference](url): Full endpoint catalog with code samples in 8 languages
Optional section curation
The AI's default Optional section is usually too small. Move more content there. Things that belong in Optional:
- Changelog
- Press coverage
- Press releases
- About / Team / Careers
- Legal / Privacy / Terms
- Old or deprecated content you can't remove yet
- Marketing pages (sometimes — depends on your strategy)
The Optional section lets AI agents drop those URLs first when context budget is tight, preserving the budget for what actually matters.
Size check
Keep total file size under 50KB. If you're over, either trim links or move bulk content to /llms-full.txt. The generator displays the current size in the editor.
6. Step 4: Ship to Production
Once the file is edited, you ship it. Three common deployment patterns:
Static hosting (Vercel, Netlify, Cloudflare Pages)
Drop the file in your static-asset directory:
# Next.js public/llms.txt # Vite / Vue public/llms.txt # Nuxt public/llms.txt # Astro public/llms.txt # Hugo static/llms.txt
Most static hosts serve .txt files with Content-Type: text/plain by default — that's fine. Some hosts let you override per-file Content-Type via a config file (e.g., _headers on Cloudflare Pages, vercel.json on Vercel) if you want text/markdown specifically.
Next.js route handler
For Next.js, you can also serve it as a dynamic route handler if you want to generate the file on demand:
// app/llms.txt/route.ts
import { NextResponse } from 'next/server';
export async function GET() {
const content = `# Acme Analytics
> Real-time product analytics for engineering teams.
## Product
- [Features](https://acme.com/features): Real-time dashboards
- [Pricing](https://acme.com/pricing): Free / Pro / Enterprise
## Documentation
- [Quickstart](https://acme.com/docs): Five-minute setup
`;
return new NextResponse(content, {
headers: { 'Content-Type': 'text/markdown; charset=utf-8' },
});
}Custom server
Nginx, Apache, or any custom server — configure to serve /llms.txt with the right Content-Type. Example Nginx config:
location = /llms.txt {
alias /var/www/yoursite/llms.txt;
add_header Content-Type text/markdown;
add_header Cache-Control "public, max-age=3600";
}Verify after deploy
After deployment, hit the URL with curl:
curl -I https://yourdomain.com/llms.txt # Want to see: # HTTP/2 200 # Content-Type: text/markdown; charset=utf-8 curl https://yourdomain.com/llms.txt | head -5 # Want to see your H1 + blockquote on top
7. The 14-Point Validation Checklist
Before declaring victory, run through the full validation list. Easiest way: paste your domain into our LLMs.txt Checker. The checker runs all 14 in about 30 seconds. Doing it manually? Here's the list:
- File exists at
/llms.txt— HTTP 200, HTTPS, root path - Has H1 title — first non-blank line starts with
# - Has H2 sections with markdown links — at least one
## Sectionfollowed by a markdown list of links - All sampled URLs return 200 — spot check 5-10 links manually, or use our checker for a 25-link sample
- Has blockquote summary —
>line directly after the H1 (blank lines between are okay) - Correct Content-Type —
text/markdownortext/plain, nevertext/html - No duplicate links — every URL appears at most once
- Markdown link format only — no bare URLs, no HTML
<a>tags, no generic anchor text ("click here", "read more") - No auth-walled or paywalled links — every URL is publicly accessible
- Healthy file size — under 50KB ideal, under 150KB OK, over 500KB warning
- Linked URLs not blocked by robots.txt — cross-check against your own robots.txt
- Site's robots.txt allows AI bots — GPTBot, ClaudeBot, PerplexityBot, Google-Extended must be Allowed
- Bonus:
/llms-full.txtcompanion exists — paired full-content file - Bonus: HTML discovery tag —
<link rel="llms" href="/llms.txt">in homepage<head>
8. Prompt Engineering for Custom Generators
If you're building your own llms.txt generator (e.g., for a specific platform or with a different AI provider), here are the prompt-engineering lessons we've learned the hard way.
Be explicit about the spec
Don't assume the model knows the llms.txt spec. Include the full structure example in the system prompt every time. Models trained before September 2024 may not have seen the spec at all.
Forbid code fences explicitly
Models love wrapping output in ```markdown ... ``` fences. The output needs to be pure markdown — no fences. Add "Output PURE markdown — no code fences, no commentary, no preamble." to your system prompt, and strip any leading/trailing fences in post-processing as belt-and-suspenders.
Constrain to allowlisted URLs
Pass the URL list in the prompt and explicitly say "Only use URLs from the provided list. Do NOT invent URLs.". Then validate output against the allowlist programmatically — don't trust the constraint to hold every time.
Set output size limits
For a 50-page input, set maxTokens: 16384 or more. Truncation mid-section is far more annoying than hitting the limit cleanly.
Two-pass for quality
For higher-quality output, do two passes: first pass generates structure + sections + link descriptions. Second pass rewrites the blockquote and refines descriptions. More expensive (~2x tokens) but the output is noticeably better. Our public generator runs single-pass for cost reasons; internal tools can splurge.
9. When to Regenerate
llms.txt isn't a ship-it-and-forget-it file. AI agents reward freshness. Regenerate when:
- Major content launches — new product area, doc rewrite, new section
- Navigation changes — URL restructure, renamed sections, deprecated routes
- Quarterly cadence — even if nothing major changed; AI agents reward updated
Last-Modifiedheaders - Broken link issues — when our checker shows sampled URLs returning 404, regenerate to pick up the current URL set
- Brand voice changes — new positioning means re-writing the blockquote
Most sites under-update. A regeneration takes 5 minutes — easier than waiting for someone to complain.
10. Frequently Asked Questions
How much does AI-generated llms.txt cost?
50 credits per run on InstaRank SEO. Credits are refunded automatically if the AI fails or returns malformed output, so worst-case cost is zero. Other vendors price similarly — running locally with the OpenAI API is roughly $0.05-0.20 per generation.
Which AI model is best for this?
We use DeepSeek v4 as primary because it's cost-effective and good at structured markdown output. Gemini 2.5 Pro is a close second. Claude is more verbose but produces slightly better blockquote summaries. Honestly, any frontier model works — the prompt matters more than the model.
Can the AI hallucinate URLs that don't exist?
Yes, occasionally. We detect hallucinations by validating every output URL against the input allowlist. On detection, we retry once with explicit constraints. If hallucinations persist, we strip them before returning the file.
What if my site has 5000 pages?
The generator caps at 50 pages because that's the AI context sweet spot. For very large sites, the right move is to pick your top 30-50 manually, then use the generator on that curated set. Or ship a tighter llms.txt + a more exhaustive llms-full.txt.
Can I use the generator for a competitor's domain?
You can scan any domain you have authorization to crawl. Whether you should deploy a competitor's generated llms.txt on your own site is a different question — that file is meant to shape how AI describes you, not them. The legitimate use case is competitor research: see what their llms.txt looks like for your own benchmarking.
Does this work for non-English sites?
Yes. Frontier LLMs handle most major languages well. The structure (H1, blockquote, H2 sections) is language-agnostic. The descriptions will be in whatever language the source pages use.
How do I A/B test different llms.txt versions?
You can't directly — there's only one /llms.txt path. Indirect approach: change version A → wait 30 days → measure AI-referred traffic (UTM tags help) → switch to version B → measure. Slow, but the only signal worth tracking.
Should I include affiliate links or marketing CTAs?
No. AI agents will quote your llms.txt content verbatim in answers. You don't want a customer's AI assistant to surface "Buy Acme Pro now — 30% off this week!" mid-conversation. Keep llms.txt informational. Save the CTAs for the linked pages themselves.
Generate your llms.txt now
AI-powered. 50 credits per run. Refunded automatically on failure. Under 2 minutes from URL to publishable markdown.
Open the LLMs.txt Generator