Common questions about the AI Knowledge Signal audit tool and the AI Knowledge Signal Framework.
The audit fetches the page at the URL you submit. It evaluates your page across five dimensions — Crawlability & Technical Access, Structural Clarity & Machine Readability, Knowledge Uniqueness & Contribution, Authority, Evidence & Trust, and AI Usability & Retrieval Readiness — that correspond to the most consequential failure points in the AI knowledge supply chain. A sixth phase, Maintenance, Monitoring & Improvement, is surfaced in the report's prompts section so you can track whether AI assistants find your content over time.
The result is a structured report: an overall score out of 10, a Corpus Survival Likelihood rating, per-dimension scores with specific findings, and a prioritised set of recommendations tied to the AI Knowledge Signal Framework.
Each of the five dimensions is scored 1–5. The overall score (1–10) is a weighted composite calculated as follows:
| Dimension | Weight | Rationale |
|---|---|---|
| Crawlability & Technical Access | 15% | Binary prerequisite; failure here is catastrophic but rare |
| Structural Clarity & Machine Readability | 25% | Highest impact on AI interpretability and embedding quality |
| Knowledge Uniqueness & Contribution | 25% | Core epistemic value; determines signal vs. noise |
| Authority, Evidence & Trust | 20% | Structural proxy for quality in training pipeline filters |
| AI Usability & Retrieval Readiness | 15% | Enhancement layer; differentiating but not foundational |
The weighted 1–5 average is mapped to a true 1–10 scale using the formula: round((weighted_avg − 1) × 2.25 + 1). This means all dimensions scoring 1/5 produces a 1/10, and all dimensions scoring 5/5 produces a 10/10. Score interpretation: 1–2 = critical failures across nearly all dimensions; 3–5 = below threshold, significant weaknesses, corpus survival unlikely without remediation; 6–7 = adequate, passes minimum threshold, improvements recommended; 8–10 = strong corpus survival likelihood. A score of 6 is a genuine passing grade — it means the content clears the minimum threshold, not that it is poor.
AI training pipelines are not passive collectors. They apply aggressive quality filters at multiple stages — from initial crawl to corpus selection to deduplication. Content that fails at any stage is removed, downweighted, or misrepresented in the final model.
Corpus Survival Likelihood is an assessment of whether your content would survive this entire supply chain — from being found by a crawler through to being retained in a training corpus. It is expressed as High, Medium — At Risk, or Low — Likely Filtered.
Epistemic Risk measures how likely your content is to be misrepresented by AI systems — not just filtered, but actively distorted. This can happen when content uses terms loosely, buries claims in rhetorical language, relies heavily on metaphor, or leaves entity relationships ambiguous.
AI systems ingest text and produce embeddings. Content that is structurally ambiguous produces distorted embeddings — the model learns something, but it may not be what the author intended. High Epistemic Risk content is content that will most likely be misquoted, paraphrased incorrectly, or attributed the wrong meaning by AI systems.
This matters because misrepresentation can be more damaging than non-inclusion. If an AI system summarises your framework incorrectly and distributes that summary at scale, you have no ability to correct it.
Every piece of content plays a specific role in the knowledge ecosystem. The audit classifies your page into one of five roles:
Knowing your page's epistemic role helps you understand what Knowledge Uniqueness score to expect and what the most relevant improvements are. Primary contributions should score higher on uniqueness; reference pages should score higher on structure.
Writing quality and AI training readiness are not the same thing. AI training pipelines cannot evaluate content quality the way a human editor can. They rely on structural proxies: schema markup, heading hierarchy, author metadata, canonical URLs, citation patterns, and factual density.
A polished essay with no schema markup, no author attribution, and a derivative argument will score poorly on Authority, Evidence & Trust and Knowledge Uniqueness & Contribution — regardless of how well it is written. The audit is calibrated not to inflate scores for surface quality.
Check the specific dimension findings in your report. Low scores on Authority, Evidence & Trust usually indicate missing metadata. Low scores on Knowledge Uniqueness & Contribution usually indicate derivative content or no identifiable original contribution. Both are fixable through the framework recommendations in the report.
The URL audit works with any publicly accessible HTML page — articles, blog posts, documentation, case studies, whitepapers, product pages, homepages, or any knowledge-oriented web page that returns a standard HTTP response. The content type selector lets you tell the audit exactly what kind of page you are scoring so it calibrates its expectations accordingly.
The Chrome and Edge extension also lets you paste text or upload a document — useful for newsletters, social posts, or draft content that does not yet have a published URL. It accepts .txt, .md, .docx and .html files: the text is pulled out and scored against the four content dimensions. PDF uploads are not supported, because PDF text does not extract reliably (columns, tables, and scanned image-only pages come out garbled) — for an accurate score, paste the copy directly or upload a .docx instead. This is the draft-content path; it is not the same as the URL audit, which is still HTML-only by design.
The following will not produce a full audit result:
The content type selector tells the audit what kind of content you are scoring, so it can apply the right calibration. The framework's five dimensions mean different things in practice depending on the format, for example:
The available content types are: Homepage / Landing Page, Blog Post / Article, Product or Pricing Page, Case Study / Customer Story, Documentation / How-to Guide, Whitepaper / Research Report, Newsletter / Email, Social Media Post, FAQ Page, Step-by-step Guide, Support Documentation, and Industry Insights.
If you leave it on Auto-detect, the audit infers the content type from the page itself — which works well for most cases. Declaring the type explicitly gives the most accurate and actionable results, particularly for content formats that sit outside the typical web-page rubric (newsletters, social posts, homepages).
The audit is designed to evaluate content as AI training pipelines encounter it on the web — which means evaluating HTML. PDFs require different parsing (text extraction, structure inference), carry different metadata conventions, and are handled inconsistently by web crawlers. A PDF can have excellent content and still fail on all the signals that matter to an AI training pipeline: no schema markup, no heading hierarchy in the HTML sense, no canonical URL, no meta description.
If your primary knowledge asset is a PDF, the most important AI training readiness action is to publish an equivalent HTML version. The framework addresses this in Phase 5 (AI Usability & Retrieval Readiness). An HTML page that references and summarises the PDF is what AI systems will ingest — and what this audit tool can help you improve.
The audit is a structured diagnostic, not a guarantee. Specific limitations to be aware of:
For pages the audit cannot process — PDFs, paywalled content, JS-only rendered pages — see What types of pages can I audit? above.
No. The audit is a diagnostic tool that identifies structural weaknesses and opportunities — it does not have access to the actual training pipelines of any AI company. AI training data selection involves many factors beyond any individual page's quality: dataset curation decisions, domain selection, deduplication at corpus scale, and evolving pipeline methodology.
What the audit can tell you is how your content compares against the observable structural criteria that training pipelines are known to favour. A high score means your content has the properties associated with corpus retention. A low score means it has properties associated with filtering or misrepresentation.
The goal is not guaranteed inclusion. The goal is removing the avoidable reasons for exclusion.
The audit uses four layers of accuracy control to ensure scores reflect the real content of your page — not the model's training memory, not a guess:
The only fields that involve model interpretation — rather than deterministic code — are the short natural-language summaries (main claim, corpus survival reason, key findings). These are bounded by the actual fetched text and constrained to specific word counts. They are interpretations of what was found, not inventions.
After making changes recommended in an audit report, re-audit the page to confirm the improvements have been reflected. Beyond that, the framework recommends treating AI knowledge publication as an ongoing practice rather than a one-time task (Phase 6: Maintenance, Monitoring & Improvement).
Re-audit when you make significant structural changes to a page, update your schema markup or metadata, publish new sections or substantially revise the content, or want to benchmark a page ahead of a new AI training window (typically every 6–12 months for major model updates).
No. The URL you submit and the content extracted from it are used only to generate the audit response and are not stored on our servers after the result is returned. The URL and page content are passed to the Anthropic Claude API as part of the analysis. Anthropic processes this subject to their API usage policy. See our Privacy Policy for full details.
When you submit a URL, our server issues a single HTTP GET request to that URL. The request uses a standard browser User-Agent and set of request headers (Accept, Accept-Language, Sec-Fetch-*) — the same headers a real Chrome browser sends on its first page load. We do this so that enterprise WAFs, CDNs, and bot-protection services do not silently reject the fetch and return a challenge page instead of your content. A tool that is blocked before it reaches your origin cannot audit your page.
We also make a separate, identifying request to /robots.txt on your origin using the User-Agent Mozilla/5.0 (compatible; AIKnowledgeAudit/1.0; +https://aiknowledgesignal.io). This request is transparent — if you inspect your logs you will see exactly who audited you. robots.txt is a public file intended to be fetched by crawlers, so the honest User-Agent is the correct choice there.
The page fetch has a 12-second timeout. The robots.txt fetch has a 5-second timeout. Both run in parallel, so the robots.txt result is usually already in memory by the time the page fetch completes.
The free audit allows up to 10 requests per hour per IP address. Each audit response includes standard rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) and the audit tool displays your remaining quota beneath the result. When the quota resets, you can run more audits on the same IP.
If you need higher volume — for example, auditing every page on a large site — the full framework download includes the complete SKILL.md, the GEO Framework PDF, and the How-To Guide for running structural checks yourself without using the hosted tool. The Chrome and Edge extension (included with a subscription) supports URL scans, paste-and-score, and document upload without the free audit's hourly cap, which is useful both for volume and for draft content not yet published to a URL.
We are deliberately specific about what this tool is and is not, because over-claiming damages the signal the tool is designed to produce. The audit does not:
Rather than returning a generic error, the audit classifies the specific block type and shows you exactly what happened. Common categories:
Each category maps to a specific panel with a concrete next step. The goal is that every non-success outcome is actionable — you know what type of block you hit, why an AI training crawler would face the same block, and what to do about it.
Yes. The audit fetches your robots.txt and parses rules for User-agent: * and the 11 named AI-training crawlers (GPTBot, CCBot, ClaudeBot, Anthropic-AI, Google-Extended, Cohere-AI, Bytespider, PerplexityBot, Diffbot, OmgiliBot, ImageSiftBot). If any of these directives disallow the path, the audit surfaces this as a hard Crawlability signal — you cannot score well if the major AI crawlers are explicitly told not to fetch your page.
We also parse the X-Robots-Tag HTTP response header for AI-opt-out directives including noai, noimageai, noindex, noml, and none. When any of these are present the audit rejects before analysis and tells you exactly which directive your server is sending.
We do not ignore these signals to "audit anyway". If you have explicitly told AI crawlers not to use your content, we treat that as the authoritative answer for AI ingestion and do not overwrite it with a Claude score.
The audit runs a weekly regression test against a curated list of 50 Australian enterprise URLs spanning jobs, property, automotive, travel, SaaS, telco, health insurance, industrial, FMCG, agri, medical, construction, engineering, IT services, retail, and marketplace sectors. A 10-URL random subset is tested each Monday morning (09:00 AEST) using GitHub Actions. Each run's results are committed back to the repository as a dated JSON file, creating a permanent, auditable history of how the tool has behaved over time.
If a URL that worked last week is now blocked — for example because a site adopted a new WAF — the run fails CI and automatically opens a tracked issue with the specific diff. This means regressions become visible within a week, and the fix is tied to the specific sector or provider change that caused it. The entire testing harness (product/tests/audit-smoke.mjs, audit-smoke-diff.mjs, .github/workflows/audit-smoke.yml) is in our public repository.
The full technical specification — how dimensions are scored, what the anti-hallucination architecture does, every reason code the tool can emit, every Change Log entry — is documented in product/audit.md in the repository. This is our intellectual property and it is deliberately transparent. An audit tool that cannot explain itself is not a diagnostic; it is a guess.
The AI Knowledge Signal Framework is a 6-phase methodology for producing, structuring, and publishing content in a way that maximises its likelihood of being accurately ingested and represented by AI training systems.
The six phases are run in order, each one removing a different reason AI systems ignore, misquote, or filter out content:
robots.txt, llms.txt, sitemaps, status codes, login walls, and JavaScript rendering.The Phase Priorities section in the AI Knowledge Signal Framework identifies the phases that are evidence-based and have the biggest impact.
For example, if your page has no schema markup and no author attribution, Phase Priorities will reference Phase 4 (Authority, Evidence & Trust) and Phase 1 (Crawlability & Technical Access). In the full framework, it contains the specific step-by-step guidance for addressing those gaps.
The Phase Priorities are designed to make the report immediately actionable — you can take the phase references directly into the full framework and find the exact steps required.
SEO tools optimise content for human search engines — ranking signals, keyword density, backlink profiles, and click-through rates. These metrics have some overlap with AI training readiness but are fundamentally different objectives.
AI training pipelines are not search engines. They do not rank pages or respond to click signals. They filter and select content for inclusion in training corpora based on structural quality, originality, authority signals, and epistemic clarity. Content that performs well in SEO (high keyword density, lots of internal links, broad reach) can simultaneously score very poorly on AI training readiness (low epistemic uniqueness, low factual density, derivative argument).
The data makes this gap concrete: only 12% of URLs cited in AI-generated answers overlap with Google's top 10 organic results — across a dataset of 15,000 prompts tested against ChatGPT, Gemini, and Copilot (Ahrefs, 2025). Perplexity is the outlier at roughly 1 in 3, but even that means two-thirds of its citations fall outside traditional rankings entirely. Meanwhile, structured GEO publication has been shown to increase visibility in AI-generated responses by up to 40% (arXiv, 2023).
It is also worth noting that GEO is not a website optimisation strategy — it is a whole-of-entity strategy. AI systems synthesise signals from your entire digital presence: your website, LinkedIn, YouTube, third-party reviews, and forum discussions. The target is not a URL. It is the web's consensus about you.
AI systems are not trained on the web — they are trained on a highly compressed, filtered, and biased representation of it. The journey from a web page or other digital content to a model's knowledge runs through five sequential stages, each one a filter:
Content is ruthlessly filtered before it ever has a chance to influence AI systems. Surviving the pipeline requires structural properties — not just good writing.
The cumulative discard rate across a typical training pipeline looks approximately like this:
This is by design. AI training datasets are precision instruments, not archives. Higher-quality sources are upsampled to further increase their influence relative to their raw volume — meaning the actual representation gap between well-structured and poorly-structured content is larger than the raw discard rates suggest.
No — this is a common misconception. During training, tokens are converted to embedding vectors as a computational intermediate. These embeddings pass through transformer layers and are then discarded. They are not stored in the model.
The model's “knowledge” is encoded in its weight matrices. When you interact with a language model, it is not retrieving stored text — it is generating responses by running input tokens through layers of learned weights.
Embeddings as persistent stored representations only appear in retrieval systems — RAG (Retrieval-Augmented Generation) and vector databases — which are architecturally separate from the base model. Optimising for base model training and optimising for RAG retrieval are related but distinct objectives.
Quality filters look for structural proxies of reliability and information density — not writing fluency. The properties that consistently correlate with corpus retention:
Most content fails AI training pipelines not because it is wrong, but because it lacks these structural signals. That is a fixable problem — which is what the AI Knowledge Signal Framework and the free audit tool address.
Run a free audit on any publicly accessible URL and get a structured report in under 15 seconds.