You already know AI search matters. What you probably do not have is the order of operations. Most advice on how to optimize for AI search throws twenty tactics at you at once, with no sense of what to do first or why.
This is the playbook we run, in sequence. Each step depends on the one before it. Skip ahead and you waste effort optimizing pages an AI engine cannot even read.
The Order That Matters
Do these seven steps in order. The sequence is the point.
- Confirm AI bots can reach your pages
- Pick the pages to start with
- Restructure them for extraction (answer-first)
- Add citable substance
- Clarify your entities and schema
- Build off-site presence the engines trust
- Measure whether it is working
Here is why order matters. There is no point rewriting a page for answer-first extraction if a crawler is blocked from fetching it. There is no point adding schema to a page you have not restructured. Reachability gates everything, structure gates substance, and measurement only means something once the rest is in place. Work top to bottom.
Confirm reachability first
The free AI Crawler Checker shows which of 34 AI crawlers your robots.txt allows or blocks, with the exact line to fix.
If you only have an afternoon, do Step 1 and Step 3 on your five best pages. That alone moves the needle more than a month of scattered tweaks.
Step 1: Confirm AI Bots Can Reach Your Pages
Before anything else, make sure the AI crawlers can actually fetch the pages you care about. If a bot is blocked, you cannot be cited, no matter how good the content is. It is the classic silent failure in AI search optimization, and the easiest to fix.
Each major engine uses named AI crawlers you can allow or block independently of Googlebot.
| Crawler | Owner / purpose | To allow in robots.txt |
|---|---|---|
| OAI-SearchBot | OpenAI, ChatGPT search index (gates citations) | User-agent: OAI-SearchBot / Allow: / |
| ChatGPT-User | OpenAI, real-time fetch when a user asks | User-agent: ChatGPT-User / Allow: / |
| Claude-SearchBot | Anthropic, Claude search index (gates citations) | User-agent: Claude-SearchBot / Allow: / |
| Claude-User | Anthropic, real-time fetch when a user asks | User-agent: Claude-User / Allow: / |
| PerplexityBot | Perplexity, search index (gates citations) | User-agent: PerplexityBot / Allow: / |
| Perplexity-User | Perplexity, real-time fetch when a user asks | n/a (Perplexity says it generally ignores robots.txt) |
| GPTBot | OpenAI, model training only | User-agent: GPTBot / Allow: / |
| ClaudeBot | Anthropic, model training only | User-agent: ClaudeBot / Allow: / |
| Google-Extended | Google, Gemini training control (a robots.txt token, not a crawler) | User-agent: Google-Extended / Allow: / |
Note the split: the search and fetch bots in the first six rows are what gate citations; the training bots in the last three rows only control whether your content trains future models. One caveat on the fetch agents: OpenAI's documentation says that because ChatGPT-User actions are user-initiated, "robots.txt rules may not apply," and Perplexity states that Perplexity-User "generally ignores" robots.txt. Your allow rules matter most for the index crawlers; OpenAI's and Perplexity's user-fetch agents follow a user's request either way, while Anthropic says blocking Claude-User does stop those fetches. The exact user-agent strings and rules are documented by the platforms themselves: OpenAI's crawler overview now documents four bots (GPTBot, OAI-SearchBot, ChatGPT-User, plus OAI-AdsBot for ad validation), Anthropic's crawler page covers the three Claude bots, Perplexity's crawler doc covers PerplexityBot and Perplexity-User, and Google's common crawlers reference covers Google-Extended. Note that Google-Extended controls Gemini training and grounding without affecting your normal Google Search ranking; AI Overviews are fed by ordinary Googlebot, so blocking Google-Extended neither protects rankings nor removes you from AI Overviews.
A Robots.txt Block That Allows Them All
If you want every major AI crawler to reach your pages, an explicit allow block leaves no ambiguity. Drop this at the top of your robots.txt:
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
A few things to watch. A blanket User-agent: * / Disallow: / block later in the file does not override these named blocks, because robots.txt matches the most specific user-agent group, but a stray Disallow: rule inside one of these named groups will. Check that no path you care about sits under a Disallow: line in the matching group. And remember robots.txt is a fetch directive, not an access control: it tells well-behaved crawlers what to skip, so the real failures usually live one layer down, in the two causes below.
The Two Blocks That Catch People
Two things block these crawlers more often than deliberate robots.txt rules:
- WAF (web application firewall) and bot-management rules. A Cloudflare or similar rule that challenges non-browser traffic will catch AI crawlers as collateral, even when robots.txt allows them. In our experience scanning sites with Geotoolbox, this is the single most common reachability problem, and the site owner almost never intended it.
- JavaScript rendering. If your main content loads client-side, the bot receives a nearly empty page: a Vercel and MERJ study of traffic across Vercel's network (December 2024) found that none of the major AI crawlers render JavaScript, while Googlebot does.
The check is binary: either the bot reaches the page or it does not. The free AI Crawler Checker shows what your robots.txt permits; verifying what crawlers actually receive past WAF rules and JavaScript is what the paid Content Analyzer scan does. Run one before you spend a minute on content, and fix any block here first.
Step 2: Pick the Pages to Start With
Do not optimize your whole site. Pick the pages most likely to be pulled into an AI answer and start there. Trying to do everything at once is why most people freeze or spray effort thinly across hundreds of URLs.
Prioritize on two signals. First, informational pages you already have authority on, the ones that already rank or earn links. They are the pages an engine is most likely to retrieve in the first place, so improving their extractability compounds. Second, pages that answer specific questions, since question-shaped content maps directly to how people prompt AI tools.
Skip, for now, thin pages, pure transactional pages, and anything with no existing search footprint. They can come later. A practical first batch is five to ten pages: your best guides, your most-linked explainers, and the posts that answer the questions your customers actually ask.
A quick way to rank the shortlist: score each candidate page from 1 to 3 on two things, existing authority (does it already rank or earn links) and question-fit (does it answer a clear, specific question someone would ask an AI tool). Add the two scores and start with the 5s and 6s. A page that already ranks for a real question is the fastest path to a citation, because the engine is already likely to retrieve it. Depth on ten pages beats shallow edits on a hundred.
Step 3: Restructure for Extraction (Answer-First)
This is the core rewrite, and it is mostly about where you put the answer. Generative engines lift self-contained statements. If your answer is buried in the fourth paragraph after setup and throat-clearing, the model has nothing clean to quote. Placement is measurable: a Kevin Indig analysis of 1.2 million AI answers, published February 2026, found 44.2% of its 18,012 verified citations pointed to the first 30% of the content.
Put the Answer in the First Sentence
Answer-first means the first sentence under a heading directly answers the question in the heading, then you elaborate. Compare:
Before: "When it comes to email send times, there are many factors to consider. Every audience is different, and what works for one brand may not work for another. That said, after analyzing our data..."
After: "The best time to send marketing emails is Tuesday to Thursday, 9 to 11 a.m. in the recipient's time zone. Here is the data behind that, and when it does not hold."
The second version can be quoted as-is. The first cannot. Notice what changed: the claim, the specifics, and the qualifier all moved into the opening, so a model can lift one sentence and still represent you accurately. The setup that used to come first did not disappear, it moved below the answer where a human reader who wants context can still find it.
Three Rules That Make Content Extractable
A few rules make content extractable:
- Self-contained chunks. Each section should make sense if a model lifts it alone, without the reader having seen the rest of the page. Avoid dangling references like "as mentioned above."
- Short paragraphs and clear headings. One idea per paragraph. Use a question-shaped heading, then answer it immediately.
- Lists and tables for structured facts. Comparisons, steps, and specs are easier to extract as a list or table than as prose.
You do not need to rewrite the whole page. Often, moving the answer to the top of each section and tightening the opening sentence is 80% of the win. The sentence-level craft (self-containment, originality, quotability) is covered in our guide to writing passages that LLMs cite.
Step 4: Add Citable Substance
Once a page is reachable and structured, give the engine something worth quoting. Models reach for specific, attributable facts over vague assertions.
Why Specificity Wins
This is the one tactic with hard research behind it. The Princeton-led study that defined GEO (generative engine optimization) found that GEO methods can lift visibility in generative-engine responses by up to 40%, with adding sources, quotations, and statistics among the most effective. Specificity is the lever. A larger 2026 experiment points the same way: a 252,000-trial study across six models (arXiv, May 2026) found that topical relevance, fresh timestamps, and explicit price information shifted which page gets cited, while formatting-only changes barely moved selection.
What to Add, and How
In practice that means:
- Replace "many businesses see strong results" with a real number and where it came from.
- Add a short quote from a named expert or a primary source where it supports a claim.
- Cite the origin of every statistic inline, so the claim carries its own credibility.
Inline attribution matters more than a footnote or a sources list at the bottom. When the number and its source sit in the same sentence, a model can lift the whole self-contained unit and reproduce the attribution, which is exactly the behavior you want, since a cited claim is far more likely to be repeated than a bare one. A statistic stranded in a reference list at the foot of the page loses that pairing the moment a section is extracted alone.
The shape to aim for, using your own real data:
Vague: "Switching to our platform can significantly improve your conversion rates." Citable: "In our 2025 test across [N] stores, switching the checkout flow cut cart abandonment from [X]% to [Y]%."
Fill the brackets with real numbers, a real sample size, and a real source. That specificity is what an engine lifts into an answer and attributes to you.
One caution. The same research is not permission to manufacture data. Stuffing invented or unsourced numbers into every paragraph degrades the page and the trust you are trying to build. Add facts because they are true and useful, not to game a model. If you do not have a real statistic, do not invent one.
Step 5: Clarify Your Entities and Schema (Realistically)
Here is where most advice oversells. You do not need a secret schema trick. Google's own guidance on AI features and your website states there are no additional requirements and no special structured data for appearing in AI Overviews.
So treat schema as housekeeping, not a growth hack. Article, FAQPage, and Organization markup help machines parse your page and disambiguate your brand, which is useful. They are not a switch that turns on citations. The measurement now backs that up: an Ahrefs study published May 2026 tracked 1,885 pages already being cited by AI that added JSON-LD against 4,000 control pages and found citations moved within statistical noise on ChatGPT and Google AI Mode, while AI Overviews citations fell 4.6%. Cited pages were almost three times more likely to carry JSON-LD, but the study attributes that correlation to overall site quality, not the markup.
What matters more is entity clarity in the prose: state plainly who you are, what you do, and the facts about your topic. Make sure your brand is described consistently across your site and your off-site profiles, so the model resolves you to one clear entity rather than a fuzzy one.
On llms.txt specifically: Google does not use it as a Search or AI Overviews ranking signal, so it will not lift your rankings or citations. But in May 2026 Google added an llms.txt audit to Chrome Lighthouse as an agentic-browsing best practice, so it is now low-cost infrastructure worth adding to help AI agents navigate your site. Add it for that reason, not for rankings, and do it after reachability and structure.
Step 6: Build Off-Site Presence the Engines Trust
Generative engines corroborate. A claim echoed across several independent, trusted sources is safer to repeat than one that lives only on your own domain. That makes your off-site footprint part of the optimization, not a separate marketing track.
Where Models Look for Corroboration
Three places carry weight because models lean on them:
- Reference and community sites. Wikipedia (if you genuinely qualify), and active discussion on Reddit and niche forums, show up disproportionately in AI citations for many topics.
- Video. A YouTube presence on your topic gives engines another corroborating, citable source.
- Third-party lists and roundups. Being included in "best X" articles and industry roundups gets you mentioned in the exact comparative answers buyers prompt for.
Mentions Beat Raw Link Count
The shift in mindset: for AI citation, consistent brand mentions across trusted sources often matter more than the raw backlink count that wins a Google position. A model deciding whether to repeat a claim about you is weighing how many independent places describe you the same way, not how many links point at your homepage.
So an accurate, consistent mention on a relevant forum thread or roundup can carry more citation weight than a high-authority link with no surrounding context. Earned, accurate description of your brand in the places models trust is the off-site half of GEO, and it is the half most teams ignore because it does not show up in a backlink report.
Step 7: Measure Whether It Is Working
You cannot manage what you cannot see, and AI search is partly invisible. Search Console only began covering Google's own AI surfaces in June 2026 (impressions for AI Overviews and AI Mode, rolling out to a subset of sites), nothing equivalent exists for the chat engines, and traffic from an AI tool usually lands in analytics as direct or referral with no keyword attached. Rankings no longer tell the whole story either: Ahrefs' 2026 re-run, using December 2025 data, found AI Overviews correlate with a 58% lower click-through rate for the top organic result, up from 34.5% in its March 2025 study, so you can hold position one and still lose traffic. So you triangulate and track direction, not a perfect count.
Track four things:
- Citation share. Run your core questions through ChatGPT, Perplexity, and Google's AI Overview and record whether you appear, versus competitors.
- Branded prompt presence. Ask the engines about your brand and log how accurately they describe you.
- AI referral traffic. Filter analytics for referrers like chatgpt.com, perplexity.ai, and gemini.google.com.
- Reachability status. Re-confirm crawlers can still fetch key pages after site changes.
Set realistic expectations. Optimizing a page does not guarantee a citation; engines only retrieve and cite a fraction of eligible pages, and which ones shift over time. Judge progress by the trend across weeks, not a single before-and-after. A monitoring view that tracks your AI visibility over time turns scattered manual checks into a baseline you can compare against. Record your starting point before you optimize.
The Per-Page AI Search Optimization Checklist
Run this on every page you optimize. If you cannot tick the first item, stop and fix it before touching the rest.
| # | Check | Pass condition |
|---|---|---|
| 1 | Reachable | OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot can fetch the page (robots + WAF allow); GPTBot and ClaudeBot too if you also want training inclusion |
| 2 | Rendered without JS | Main content is in the HTML a crawler receives, not loaded client-side only |
| 3 | Answer-first | Each section's first sentence answers its heading directly |
| 4 | Self-contained chunks | Sections make sense lifted alone; no dangling "as above" references |
| 5 | Citable substance | Specific, sourced statistics and at least one named quote where relevant |
| 6 | Entity clarity | Brand, topic, and key facts stated plainly; consistent with off-site profiles |
| 7 | Fresh | Visible publish/update date; content reflects the current state of the topic |
Seven checks, in order. Pages most often fail on 1, 3, or 5. For the full set of best practices behind these checks, grouped by category and rated by evidence, see our answer engine optimization best practices.
Common Mistakes That Waste Effort
Three beliefs send people in the wrong direction.
"GEO is just SEO, so my rankings carry over." Partly true, but it misleads. Ranking and citation are different selection systems, and the gap is widening: by March 2026 only 38% of Google AI Overview citations ranked in the top 10 for the same query, down from 76% in July 2025 (Ahrefs, 863,000 SERPs), and across AI assistants only about 12% of cited URLs rank in the top 10 for the original prompt, with ChatGPT, Gemini, and Copilot each hovering near 8% and Perplexity the outlier at 29%. High domain authority does not transfer cleanly to AI citations, so chasing more backlinks while ignoring extractability is effort spent on the wrong lever. For the full breakdown of the discipline, see our guide on what generative engine optimization is.
"I need to build an llms.txt file." It was over-prescribed as a ranking hack, which it is not; see Step 5 for what it is actually good for, and note it is no substitute for the seven steps above.
"If I optimize, I will get cited." Optimization improves your odds; it does not guarantee a citation. As Step 7 explains, the citation pool is selective and shifts over time. Expecting a one-to-one payoff leads people to abandon a working approach too early. Track the trend, not a single result.
A useful habit: label the advice you follow as TESTED (you or a credible study verified it) or CLAIMED (it sounds right but nobody has shown it works). Most AI-search advice circulating today is CLAIMED. Spend your time on the TESTED parts first.
Frequently Asked Questions
How do I rank in ChatGPT search? Make sure OAI-SearchBot and ChatGPT-User can reach the page (GPTBot only controls training, not search visibility), structure it answer-first, and back claims with specific sourced facts. ChatGPT leans on its search index plus corroborating sources, so off-site mentions help. Our guide on getting cited in ChatGPT search covers the engine-specific details.
How do I rank in AI answers generally? The same fundamentals transfer across engines: reachable pages, answer-first structure, citable substance, and corroboration. The per-engine tuning is secondary. For Perplexity specifically, see how to get cited in Perplexity.
How long until my content gets cited? Reachability fixes can take effect within days. Content and citation changes are slower and harder to attribute, because AI sourcing shifts gradually and is only partly observable. Judge it over weeks, by trend.
Why isn't my content cited even though I optimized it? Optimization improves odds, not certainty (see Step 7). Check the basics first: is the page actually reachable, is the answer extractable, and is the claim corroborated elsewhere.
Is SEO dead in 2026? No. Search is shifting toward synthesized answers, but the underlying work (reachable, clear, trustworthy content) still decides who gets surfaced. AI search optimization is an extension of SEO, not its replacement.
Do I need an llms.txt file? No, not for AI search, for the reasons covered in Step 5. Our full breakdown of whether llms.txt is worth it covers the few site types that are the exception. Spend the time on reachability and structure first.
Start With Step 1
You do not need a platform or a budget to begin, just the discipline to go in order. The cheapest, highest-value move is also the first one: confirm AI engines can actually reach your best pages. In our experience, it is common to find at least one silent block the owner never intended.
Geotoolbox's free AI Crawler Checker shows which of 34 AI crawlers your robots.txt allows or blocks in seconds; the paid Content Analyzer goes further by fetching the page as each bot and grading how extractable it is. Start there, fix what it flags, then work down the seven steps.
Sources
- Overview of OpenAI Crawlers - OpenAI
- Does Anthropic crawl data from the web? - Anthropic
- Perplexity crawlers - Perplexity
- Google's common crawlers - Google Search Central
- AI features and your website - Google Search Central
- The rise of the AI crawler - Vercel and MERJ, December 2024
- GEO: Generative Engine Optimization - Aggarwal et al., KDD 2024
- What Gets Cited: Competitive GEO in AI Answer Engines - arXiv, May 2026
- 44% of ChatGPT citations come from the first third of content - Search Engine Land (Kevin Indig), February 2026
- Schema markup and AI citations - Ahrefs, May 2026
- AI Overview citations vs top-10 rankings - Ahrefs, March 2026
- AI search citation overlap with Google's top 10 - Ahrefs, August 2025
- AI Overviews reduce clicks: December 2025 update - Ahrefs, February 2026
- llms.txt in Lighthouse agentic browsing - Chrome Developers, May 2026