Skip to content

Latest commit

 

History

History
176 lines (115 loc) · 11.5 KB

File metadata and controls

176 lines (115 loc) · 11.5 KB

< Back to README

AI SEO Research & Gap Analysis

What the industry and academic research say about optimizing content for AI search engines, what actually moves the needle for AI citations, and where our audit tool has room to grow.


Part 1: What The Research Says

The Princeton GEO Paper (The Foundation)

The foundational academic work on GEO is GEO: Generative Engine Optimization by Aggarwal et al. from Princeton/IIT Delhi, presented at KDD 2024. This paper coined the term and established the field.

Key findings from the study:

Optimization Method Visibility Boost (Word Count) Visibility Boost (Impression)
Cite Sources 30-40% 15-30%
Quotation Addition 30-40% 15-30%
Statistics Addition 30-40% 15-30%
Fluency Optimization - 15-30%
Easy-to-Understand - 15-30%

The most dramatic finding: adding citations to content increased visibility by 115.1% for websites that were originally ranked 5th in traditional search. Meanwhile, the top-ranked website's visibility actually decreased by 30.3% in generative responses. GEO disproportionately helps content that isn't already dominant.

Domain-specific insights:

  • Quotation Addition works best for People & Society, Explanation, and History content
  • Statistics Addition works best for Law & Government and Opinion content
  • Fluency and readability improvements had a broad positive effect across all domains

What Gets Cited: The Data

Research from Search Engine Land analyzing 8,000+ AI citations found clear patterns:

Answer Capsules are the strongest signal. 72.4% of blog posts cited by ChatGPT contained an identifiable "answer capsule" - a concise, self-contained explanation of 120-150 characters placed directly after a question-framed H2. This is the single most predictive formatting trait for getting cited.

Answer-first formatting matters. Placing the direct answer in the first 40-60 words of each section lets AI systems extract it without parsing introductory context. One study found this increased ChatGPT citations by 140%.

Section length has a sweet spot. Pages using 120-180 words between headings receive 70% more ChatGPT citations than pages with sections under 50 words. Too short and there's nothing to extract. Too long and it's hard to parse.

Original data is the second-strongest differentiator. First-party statistics, proprietary research, and unique datasets significantly increase citation likelihood across all platforms.

Content Freshness: The Gate That Overrides Everything

Research from Seer Interactive and Ahrefs shows freshness is a hard gate for AI citations:

  • 65% of AI crawler hits target content published within the past year
  • 79% target content from the last two years
  • AI-cited content is 25.7% fresher than what traditional Google ranks
  • ChatGPT shows the strongest freshness bias, citing URLs that are 393-458 days newer than Google organic results
  • Content that used to stay relevant for 24-36 months now feels outdated in 6-9 months for generative engines

The critical point: authority without recency is rarely sufficient. Even authoritative sources lose AI visibility when their facts are outdated.

Platform-Specific Citation Patterns

Each generative engine has distinct citation preferences (Profound):

ChatGPT: Heavily favors Wikipedia, established media, .com domains. 76.4% of most-cited pages updated within 30 days. Strongest freshness bias.

Perplexity: Averages 6.61 citations per response (most citation-dense). Heavily concentrates on Reddit and YouTube. Emphasizes E-E-A-T signals. Real-time retrieval makes freshness especially critical.

Claude: Prioritizes content demonstrating clear reasoning. Responds well to step-by-step explanations and methodology sections. Values logical flow and "why/how" over just "what."

Google AI Overviews: More distributed across source types. Heavily weights E-E-A-T signals. Reddit and Medium disproportionately cited. Powered by Gemini, actively filtering generic AI-generated content.

Cross-platform fragmentation: Only ~11% of domains are cited by both ChatGPT and Perplexity.

AI Crawler Access (robots.txt)

The baseline requirement for any GEO strategy:

Crawler Owner Purpose
GPTBot OpenAI Training + retrieval
ChatGPT-User OpenAI Live browsing requests
ClaudeBot Anthropic Training + retrieval
PerplexityBot Perplexity Real-time search index
Google-Extended Google AI training data
Googlebot Google Search + AI Overviews

If your robots.txt blocks these user agents, you don't exist to these engines.

The llms.txt Standard

A new proposed standard (llmstxt.org) is emerging alongside robots.txt. The llms.txt file is a markdown document at your site's root providing AI systems with a structured overview of your content, purpose, and key resources. OpenAI, Microsoft, and others are actively crawling these files.

Multimodal AI Readiness (Image Accessibility)

The AIVO Standard v2.2 (2025) establishes a canonical framework for multi-modal AI visibility. As models like Gemini 2.5 and GPT-4o directly ingest images alongside text, image metadata becomes a first-class signal for AI grounding.

Key findings:

  • Multimodal models use alt attributes and <figcaption> text to understand what an image depicts and how it relates to surrounding content
  • Images without alt text are content that AI cannot understand, reference, or cite
  • Semantic image markup (<figure> with <figcaption>) provides richer context than alt text alone
  • The AIVO Standard positions image readiness alongside text readiness as a core AI visibility requirement

Schema Completeness (Beyond Presence)

Research from Semrush and WebFX shows that LLMs use schema completeness to ground citations, not just schema presence. An Article schema with only @type does almost nothing. The same schema with headline, author, and datePublished gives models grounding confidence.

Key findings:

  • LLMs parse JSON-LD to verify authorship, publication dates, and content type before citing
  • Incomplete schemas are treated similarly to absent schemas by citation models
  • The most impactful properties vary by type: Article needs headline/author/datePublished, FAQPage needs mainEntity, HowTo needs name/step
  • Schema completeness acts as a trust multiplier on other authority signals

Brand Entity Consistency

The KnewSearch 2026 AI Search Visibility Benchmark found that brand-controlled sources account for approximately 86% of AI citations. The defining characteristic of a "brand-controlled" source is consistent entity identification across multiple page surfaces.

Key findings:

  • AI models resolve brand identity by cross-referencing entity names across title, OG tags, JSON-LD schema, and footer/header
  • Inconsistent entity naming (e.g., "Acme Corp" in the title but "Acme" in the footer and "Acme Corporation" in schema) reduces citation confidence
  • Pages where the brand name appears consistently across 4+ surfaces are significantly more likely to be cited as authoritative sources
  • Entity consistency is especially important for branded queries where multiple competing sources exist

Part 2: Emerging Best Practices (Consensus View)

Tier 1: Non-Negotiable

  1. AI Crawler Access - Don't block GPTBot, ClaudeBot, PerplexityBot, Google-Extended in robots.txt
  2. Content Freshness - Publish/modified dates must be visible, crawlable, and honest. Update every 6-9 months minimum
  3. Answer-First Formatting - Direct answer in first 40-60 words after every question-framed heading

Tier 2: High Impact (Princeton paper's top methods)

  1. Cite Sources - External authoritative links with formal citation patterns (+115% visibility)
  2. Include Statistics - First-party data, specific numbers, percentages, quantitative claims
  3. Add Quotations - Expert quotes with attribution

Tier 3: Structural (Makes content extractable)

  1. Heading Hierarchy - One H1, question-framed H2s, H3 sub-topics, 120-180 words per section
  2. Structured Data - JSON-LD schema markup, Open Graph tags, canonical URLs
  3. Schema Completeness - JSON-LD types with all recommended properties populated (not just present)
  4. Lists and Tables - Easiest content formats for AI to extract verbatim
  5. Section Structure - Short paragraphs (30-150 words), bold key phrases, definition patterns
  6. Image Accessibility - Alt text on all images, semantic <figure>/<figcaption> markup

Tier 4: Authority Signals

  1. Author Attribution - Visible bylines with credentials and schema markup
  2. Organization Identity - Organization schema, og:site_name, About/Contact pages
  3. Entity Consistency - Brand name appears consistently across title, OG tags, schema, footer
  4. E-E-A-T Signals - First-hand experience, original visuals, credentials
  5. Cross-Platform Presence - Mentions across Reddit, LinkedIn, YouTube increase citations 2.8x

Tier 5: Emerging

  1. llms.txt - Markdown file at site root for AI inference optimization
  2. Content Clusters - Pillar pages with 3-7 supporting articles, interlinked

Sources