This Week in Brief
Perplexity's new 'Search as Code' architecture shifts agentic retrieval toward AI-written Python workflows — a structural change practitioners must monitor for citation-surface implications. Microsoft's MAI-Thinking-1 faces scrutiny over whether Common Crawl-sourced training data meets its 'clean, commercially licensed' claims, reopening enterprise compliance questions. Meanwhile, two RAG-adjacent research threads — Graph RAG and self-augmenting diffusion retrieval — suggest AI answer engines are moving toward more relational, multi-hop retrieval that rewards entity-rich, structured content.
AI Search & ASO
Perplexity Launches 'Search as Code': AI Agents Now Write Their Own Python Retrieval Pipelines
Perplexity has introduced Search as Code, a reference architecture in which AI agents generate Python retrieval workflows using an Agentic Search SDK inside a restricted sandbox. The company claims 100% software-vulnerability detection accuracy and 85.1% lower token consumption versus comparable approaches — figures that remain unvalidated by independent benchmarks. For GEO practitioners, the shift matters because structured, programmatically parseable pages are more likely to be surfaced by agent-driven retrieval pipelines than dense editorial prose; clean schema markup and answer-first page structure become table stakes, not optimisations. (Unconfirmed: benchmark claims pending outside validation.)
Google's own guidance, cited in Perplexity AI Magazine's 2026 GEO guide, confirms that AI Overviews and AI Mode depend on Google's standard Search ranking and quality systems, retrieval-augmented generation, and query fan-out — not a parallel ranking stack. Practitioners chasing AI Overview citations by ignoring classic SEO fundamentals are optimising the wrong layer. Crawlable pages, entity consistency, structured data, and strong E-E-A-T signals remain the prerequisite, with GEO-specific formatting (answer-first prose, question-format headings, JSON-LD) applied on top.
AI Lab Signals
Microsoft MAI-Thinking-1 Training Materials List Common Crawl Despite 'Clean Data' Pitch
Microsoft's in-house MAI-Thinking-1 model was marketed around commercially licensed, clean training data, but published materials list public-web and Common Crawl sources. Microsoft's stated position is that its crawler respects robots.txt opt-out controls — which is distinct from holding a negotiated licence for each publisher page. Enterprise compliance teams must now decide whether that distinction satisfies their procurement standards, and content publishers should treat robots.txt directives as an active data-governance lever, not a legacy SEO artefact.
High-Quality Public Training Data Approaching Exhaustion, Epoch AI Projects
An analysis citing Epoch AI research projects that the supply of high-quality human-generated public text available for LLM pre-training will be fully utilised in the near term, with synthetic data substitutes showing degraded model performance in early evaluations. For GEO practitioners, scarcity of fresh, high-quality web content increases the relative value of pages that are crawlable, original, and authoritative — sites that produce genuine primary research or unique expert analysis become disproportionately attractive training and retrieval sources. (Note: the Epoch AI projection is presented as a forward-looking research estimate, not a confirmed outcome.)
Dataset Quality Outweighs Architecture in LLM Performance, 2026 Springer Survey Concludes
A 2026 Springer survey covering the full lifecycle of data preparation for large language models concludes that dataset quality — including deduplication, provenance auditing, and diversity — drives model generalisation more than architectural choices. Leading labs invest as heavily in data pipelines as in compute. The practitioner implication is direct: content that passes quality filters (clear authorship, cited sources, low duplication, structured formatting) is more likely to be retained in training corpora and retrieved in inference-time RAG pipelines.
Training Data & Crawl
robots.txt Compliance ≠ Licensing: The Legal Gap Exposed by MAI-Thinking-1
The MAI-Thinking-1 controversy crystallises a distinction that publishers should encode in policy now: robots.txt compliance means a crawler did not take pages it was told to exclude; it does not mean the lab obtained a licence for pages it was permitted to fetch. Publishers seeking training-data opt-out should treat robots.txt as a floor, not a ceiling, and investigate whether platform-level data-licensing agreements or model-specific clauses are available from major lab vendors.
Common Corpus and Open Datasets Remain Central to LLM Pre-Training Pipelines
The DOT Data Labs 2026 guide confirms that open datasets including Common Corpus continue to anchor LLM pre-training alongside proprietary pipelines at leading labs. Rigorous deduplication and provenance auditing are now standard practice, meaning duplicate or low-signal pages are increasingly filtered out before training. Content teams should audit their own domains for thin or near-duplicate pages that would fail such filters — both for training-data inclusion and for retrieval scoring at inference time.
Research Radar (arXiv)
Graph RAG: When Knowledge Graphs Beat Vector Search
Standard vector-similarity RAG fails on multi-hop queries — questions whose answers require connecting entities across multiple documents. Graph RAG addresses this by constructing a structured knowledge graph from a document corpus, enabling entity-relation traversal that cosine similarity cannot replicate. For GEO practitioners, the implication is that content explicitly naming entities, their relationships, and their attributes (via structured data and clear prose) is better positioned for citation in Graph RAG-powered answer engines than content optimised solely for keyword density.
SARDI: Self-Augmenting Retrieval for Diffusion Language Models
(Pre-publication / arXiv) SARDI introduces a training-free dynamic RAG framework that uses low-confidence tokens discarded during the diffusion LM denoising process as lookahead signals to trigger additional retrieval steps, improving performance on multi-hop QA benchmarks over both diffusion and autoregressive RAG baselines. The framework is retriever-agnostic and requires no additional model training. For practitioners, this signals that next-generation retrieval systems will perform more iterative, confidence-weighted lookups — rewarding content that provides clear, high-confidence factual statements over hedged or ambiguous prose.
Practitioner Takeaway
Audit your highest-traffic pages for structural extractability this week: every top-level section should open with a 40–60 word self-contained answer, H2s should be phrased as the user query you are targeting, entities (brand names, product names, people, organisations) should be named explicitly rather than referred to by pronoun, and Schema.org JSON-LD should be emitted for every page type. This is the single intervention most directly supported by multiple signals this week — Perplexity's agent-written retrieval pipelines, Google's confirmation that AI Overviews rely on standard crawl and quality signals, and the Graph RAG research showing that entity-explicit content outperforms dense prose in multi-hop retrieval. Do it before sampling your target queries; without it, measurement is premature.
The 6-phase framework used to structure this newsletter is available as a complete methodology guide — including audit tools, templates, and implementation checklists.
Get the Framework — $20/mo or $200/yrNew to AI knowledge publication? Download the free briefing flyer — the data case for why your organisation cannot wait.