AI SEO Research & Gap Analysis

What the industry and academic research say about optimizing content for AI search engines, what actually moves the needle for AI citations, and where our audit tool has room to grow.

Part 1: What The Research Says

The Princeton GEO Paper (The Foundation)

The foundational academic work on GEO is GEO: Generative Engine Optimization by Aggarwal et al. from Princeton/IIT Delhi, presented at KDD 2024. This paper coined the term and established the field.

Key findings from the study:

Optimization Method	Visibility Boost (Word Count)	Visibility Boost (Impression)
Cite Sources	30-40%	15-30%
Quotation Addition	30-40%	15-30%
Statistics Addition	30-40%	15-30%
Fluency Optimization	-	15-30%
Easy-to-Understand	-	15-30%

The most dramatic finding: adding citations to content increased visibility by 115.1% for websites that were originally ranked 5th in traditional search. Meanwhile, the top-ranked website's visibility actually decreased by 30.3% in generative responses. GEO disproportionately helps content that isn't already dominant.

Domain-specific insights:

Quotation Addition works best for People & Society, Explanation, and History content
Statistics Addition works best for Law & Government and Opinion content
Fluency and readability improvements had a broad positive effect across all domains

What Gets Cited: The Data

Research from Search Engine Land analyzing 8,000+ AI citations found clear patterns:

Answer Capsules are the strongest signal. 72.4% of blog posts cited by ChatGPT contained an identifiable "answer capsule" - a concise, self-contained explanation of 120-150 characters placed directly after a question-framed H2. This is the single most predictive formatting trait for getting cited.

Answer-first formatting matters. Placing the direct answer in the first 40-60 words of each section lets AI systems extract it without parsing introductory context. One study found this increased ChatGPT citations by 140%.

Section length has a sweet spot. Pages using 120-180 words between headings receive 70% more ChatGPT citations than pages with sections under 50 words. Too short and there's nothing to extract. Too long and it's hard to parse.

Original data is the second-strongest differentiator. First-party statistics, proprietary research, and unique datasets significantly increase citation likelihood across all platforms.

Content Freshness: The Gate That Overrides Everything

Research from Seer Interactive and Ahrefs shows freshness is a hard gate for AI citations:

65% of AI crawler hits target content published within the past year
79% target content from the last two years
AI-cited content is 25.7% fresher than what traditional Google ranks
ChatGPT shows the strongest freshness bias, citing URLs that are 393-458 days newer than Google organic results
Content that used to stay relevant for 24-36 months now feels outdated in 6-9 months for generative engines

The critical point: authority without recency is rarely sufficient. Even authoritative sources lose AI visibility when their facts are outdated.

Platform-Specific Citation Patterns

Each generative engine has distinct citation preferences (Profound):

ChatGPT: Heavily favors Wikipedia, established media, .com domains. 76.4% of most-cited pages updated within 30 days. Strongest freshness bias.

Perplexity: Averages 6.61 citations per response (most citation-dense). Heavily concentrates on Reddit and YouTube. Emphasizes E-E-A-T signals. Real-time retrieval makes freshness especially critical.

Claude: Prioritizes content demonstrating clear reasoning. Responds well to step-by-step explanations and methodology sections. Values logical flow and "why/how" over just "what."

Google AI Overviews: More distributed across source types. Heavily weights E-E-A-T signals. Reddit and Medium disproportionately cited. Powered by Gemini, actively filtering generic AI-generated content.

Cross-platform fragmentation: Only ~11% of domains are cited by both ChatGPT and Perplexity.

AI Crawler Access (robots.txt)

The baseline requirement for any GEO strategy:

Crawler	Owner	Purpose
GPTBot	OpenAI	Training + retrieval
ChatGPT-User	OpenAI	Live browsing requests
ClaudeBot	Anthropic	Training + retrieval
PerplexityBot	Perplexity	Real-time search index
Google-Extended	Google	AI training data
Googlebot	Google	Search + AI Overviews

If your robots.txt blocks these user agents, you don't exist to these engines.

The llms.txt Standard

A new proposed standard (llmstxt.org) is emerging alongside robots.txt. The llms.txt file is a markdown document at your site's root providing AI systems with a structured overview of your content, purpose, and key resources. OpenAI, Microsoft, and others are actively crawling these files.

Multimodal AI Readiness (Image Accessibility)

The AIVO Standard v2.2 (2025) establishes a canonical framework for multi-modal AI visibility. As models like Gemini 2.5 and GPT-4o directly ingest images alongside text, image metadata becomes a first-class signal for AI grounding.

Key findings:

Multimodal models use alt attributes and <figcaption> text to understand what an image depicts and how it relates to surrounding content
Images without alt text are content that AI cannot understand, reference, or cite
Semantic image markup (<figure> with <figcaption>) provides richer context than alt text alone
The AIVO Standard positions image readiness alongside text readiness as a core AI visibility requirement

Schema Completeness (Beyond Presence)

Research from Semrush and WebFX shows that LLMs use schema completeness to ground citations, not just schema presence. An Article schema with only @type does almost nothing. The same schema with headline, author, and datePublished gives models grounding confidence.

Key findings:

LLMs parse JSON-LD to verify authorship, publication dates, and content type before citing
Incomplete schemas are treated similarly to absent schemas by citation models
The most impactful properties vary by type: Article needs headline/author/datePublished, FAQPage needs mainEntity, HowTo needs name/step
Schema completeness acts as a trust multiplier on other authority signals

Brand Entity Consistency

The KnewSearch 2026 AI Search Visibility Benchmark found that brand-controlled sources account for approximately 86% of AI citations. The defining characteristic of a "brand-controlled" source is consistent entity identification across multiple page surfaces.

Key findings:

AI models resolve brand identity by cross-referencing entity names across title, OG tags, JSON-LD schema, and footer/header
Inconsistent entity naming (e.g., "Acme Corp" in the title but "Acme" in the footer and "Acme Corporation" in schema) reduces citation confidence
Pages where the brand name appears consistently across 4+ surfaces are significantly more likely to be cited as authoritative sources
Entity consistency is especially important for branded queries where multiple competing sources exist

Part 2: Emerging Best Practices (Consensus View)

Tier 1: Non-Negotiable

AI Crawler Access - Don't block GPTBot, ClaudeBot, PerplexityBot, Google-Extended in robots.txt
Content Freshness - Publish/modified dates must be visible, crawlable, and honest. Update every 6-9 months minimum
Answer-First Formatting - Direct answer in first 40-60 words after every question-framed heading

Tier 2: High Impact (Princeton paper's top methods)

Cite Sources - External authoritative links with formal citation patterns (+115% visibility)
Include Statistics - First-party data, specific numbers, percentages, quantitative claims
Add Quotations - Expert quotes with attribution

Tier 3: Structural (Makes content extractable)

Heading Hierarchy - One H1, question-framed H2s, H3 sub-topics, 120-180 words per section
Structured Data - JSON-LD schema markup, Open Graph tags, canonical URLs
Schema Completeness - JSON-LD types with all recommended properties populated (not just present)
Lists and Tables - Easiest content formats for AI to extract verbatim
Section Structure - Short paragraphs (30-150 words), bold key phrases, definition patterns
Image Accessibility - Alt text on all images, semantic <figure>/<figcaption> markup

Tier 4: Authority Signals

Author Attribution - Visible bylines with credentials and schema markup
Organization Identity - Organization schema, og:site_name, About/Contact pages
Entity Consistency - Brand name appears consistently across title, OG tags, schema, footer
E-E-A-T Signals - First-hand experience, original visuals, credentials
Cross-Platform Presence - Mentions across Reddit, LinkedIn, YouTube increase citations 2.8x

Tier 5: Emerging

llms.txt - Markdown file at site root for AI inference optimization
Content Clusters - Pillar pages with 3-7 supporting articles, interlinked

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI SEO Research & Gap Analysis

Part 1: What The Research Says

The Princeton GEO Paper (The Foundation)

What Gets Cited: The Data

Content Freshness: The Gate That Overrides Everything

Platform-Specific Citation Patterns

AI Crawler Access (robots.txt)

The llms.txt Standard

Multimodal AI Readiness (Image Accessibility)

Schema Completeness (Beyond Presence)

Brand Entity Consistency

Part 2: Emerging Best Practices (Consensus View)

Tier 1: Non-Negotiable

Tier 2: High Impact (Princeton paper's top methods)

Tier 3: Structural (Makes content extractable)

Tier 4: Authority Signals

Tier 5: Emerging

Sources

FilesExpand file tree

RESEARCH.md

Latest commit

History

RESEARCH.md

File metadata and controls

AI SEO Research & Gap Analysis

Part 1: What The Research Says

The Princeton GEO Paper (The Foundation)

What Gets Cited: The Data

Content Freshness: The Gate That Overrides Everything

Platform-Specific Citation Patterns

AI Crawler Access (robots.txt)

The llms.txt Standard

Multimodal AI Readiness (Image Accessibility)

Schema Completeness (Beyond Presence)

Brand Entity Consistency

Part 2: Emerging Best Practices (Consensus View)

Tier 1: Non-Negotiable

Tier 2: High Impact (Princeton paper's top methods)

Tier 3: Structural (Makes content extractable)

Tier 4: Authority Signals

Tier 5: Emerging

Sources