The FREE Audit Tool

The audit fetches the page at the URL you submit. It evaluates your page across five dimensions — Crawlability & Technical Access, Structural Clarity & Machine Readability, Knowledge Uniqueness & Contribution, Authority, Evidence & Trust, and AI Usability & Retrieval Readiness — that correspond to the most consequential failure points in the AI knowledge supply chain. A sixth phase, Maintenance, Monitoring & Improvement, is surfaced in the report's prompts section so you can track whether AI assistants find your content over time.

The result is a structured report: an overall score out of 10, a Corpus Survival Likelihood rating, per-dimension scores with specific findings, and a prioritised set of recommendations tied to the AI Knowledge Signal Framework.

Each of the five dimensions is scored 1–5. The overall score (1–10) is a weighted composite calculated as follows:

DimensionWeightRationale
Crawlability & Technical Access15%Binary prerequisite; failure here is catastrophic but rare
Structural Clarity & Machine Readability25%Highest impact on AI interpretability and embedding quality
Knowledge Uniqueness & Contribution25%Core epistemic value; determines signal vs. noise
Authority, Evidence & Trust20%Structural proxy for quality in training pipeline filters
AI Usability & Retrieval Readiness15%Enhancement layer; differentiating but not foundational

The weighted 1–5 average is mapped to a true 1–10 scale using the formula: round((weighted_avg − 1) × 2.25 + 1). This means all dimensions scoring 1/5 produces a 1/10, and all dimensions scoring 5/5 produces a 10/10. Score interpretation: 1–2 = critical failures across nearly all dimensions; 3–5 = below threshold, significant weaknesses, corpus survival unlikely without remediation; 6–7 = adequate, passes minimum threshold, improvements recommended; 8–10 = strong corpus survival likelihood. A score of 6 is a genuine passing grade — it means the content clears the minimum threshold, not that it is poor.

AI training pipelines are not passive collectors. They apply aggressive quality filters at multiple stages — from initial crawl to corpus selection to deduplication. Content that fails at any stage is removed, downweighted, or misrepresented in the final model.

Corpus Survival Likelihood is an assessment of whether your content would survive this entire supply chain — from being found by a crawler through to being retained in a training corpus. It is expressed as High, Medium — At Risk, or Low — Likely Filtered.

Epistemic Risk measures how likely your content is to be misrepresented by AI systems — not just filtered, but actively distorted. This can happen when content uses terms loosely, buries claims in rhetorical language, relies heavily on metaphor, or leaves entity relationships ambiguous.

AI systems ingest text and produce embeddings. Content that is structurally ambiguous produces distorted embeddings — the model learns something, but it may not be what the author intended. High Epistemic Risk content is content that will most likely be misquoted, paraphrased incorrectly, or attributed the wrong meaning by AI systems.

This matters because misrepresentation can be more damaging than non-inclusion. If an AI system summarises your framework incorrectly and distributes that summary at scale, you have no ability to correct it.

Every piece of content plays a specific role in the knowledge ecosystem. The audit classifies your page into one of five roles:

  • Primary contribution — introduces new knowledge, frameworks, findings, or claims not found elsewhere
  • Secondary synthesis — summarises, compares, or analyses existing primary sources
  • Applied interpretation — applies established knowledge to a specific case or context
  • Reference page — defines terms, lists resources, or serves as a structured reference
  • Unclear — the epistemic role cannot be identified from the content

Knowing your page's epistemic role helps you understand what Knowledge Uniqueness score to expect and what the most relevant improvements are. Primary contributions should score higher on uniqueness; reference pages should score higher on structure.

Writing quality and AI training readiness are not the same thing. AI training pipelines cannot evaluate content quality the way a human editor can. They rely on structural proxies: schema markup, heading hierarchy, author metadata, canonical URLs, citation patterns, and factual density.

A polished essay with no schema markup, no author attribution, and a derivative argument will score poorly on Authority, Evidence & Trust and Knowledge Uniqueness & Contribution — regardless of how well it is written. The audit is calibrated not to inflate scores for surface quality.

Check the specific dimension findings in your report. Low scores on Authority, Evidence & Trust usually indicate missing metadata. Low scores on Knowledge Uniqueness & Contribution usually indicate derivative content or no identifiable original contribution. Both are fixable through the framework recommendations in the report.

The URL audit works with any publicly accessible HTML page — articles, blog posts, documentation, case studies, whitepapers, product pages, homepages, or any knowledge-oriented web page that returns a standard HTTP response. The content type selector lets you tell the audit exactly what kind of page you are scoring so it calibrates its expectations accordingly.

The Chrome and Edge extension also lets you paste text or upload a document — useful for newsletters, social posts, or draft content that does not yet have a published URL. It accepts .txt, .md, .docx and .html files: the text is pulled out and scored against the four content dimensions. PDF uploads are not supported, because PDF text does not extract reliably (columns, tables, and scanned image-only pages come out garbled) — for an accurate score, paste the copy directly or upload a .docx instead. This is the draft-content path; it is not the same as the URL audit, which is still HTML-only by design.

The following will not produce a full audit result:

  • PDF files — the tool analyses HTML only; non-HTML content types return a Crawlability failure
  • Pages behind a login or paywall — if AI crawlers cannot access the content, neither can the audit
  • Pages that render content entirely via JavaScript — the audit scores the raw HTML; if your content only appears after JS execution, word count and structural signals will be very low
  • URLs returning 4xx or 5xx errors
  • Intranet or localhost addresses not reachable from the audit server

The content type selector tells the audit what kind of content you are scoring, so it can apply the right calibration. The framework's five dimensions mean different things in practice depending on the format, for example:

  • A homepage should not be penalised for low academic originality — it is an entry point, not a knowledge asset. The audit instead evaluates whether it communicates a clearly differentiated value proposition that AI systems can accurately represent.
  • A social post cannot have heading hierarchy or schema markup — applying the same structural expectations as a research paper would produce a meaningless score. The audit instead evaluates claim clarity, hook quality, and whether the post has a specific, quotable perspective.
  • A newsletter typically does not have Article schema or a datePublished field in HTML — so the audit evaluates sender identity, inline attribution, and insight density instead.
  • A whitepaper is held to the strictest expectations: abstract, named methodology, author credentials, cited primary sources, and clear primary findings are all required for a strong score.

The available content types are: Homepage / Landing Page, Blog Post / Article, Product or Pricing Page, Case Study / Customer Story, Documentation / How-to Guide, Whitepaper / Research Report, Newsletter / Email, Social Media Post, FAQ Page, Step-by-step Guide, Support Documentation, and Industry Insights.

If you leave it on Auto-detect, the audit infers the content type from the page itself — which works well for most cases. Declaring the type explicitly gives the most accurate and actionable results, particularly for content formats that sit outside the typical web-page rubric (newsletters, social posts, homepages).

The audit is designed to evaluate content as AI training pipelines encounter it on the web — which means evaluating HTML. PDFs require different parsing (text extraction, structure inference), carry different metadata conventions, and are handled inconsistently by web crawlers. A PDF can have excellent content and still fail on all the signals that matter to an AI training pipeline: no schema markup, no heading hierarchy in the HTML sense, no canonical URL, no meta description.

If your primary knowledge asset is a PDF, the most important AI training readiness action is to publish an equivalent HTML version. The framework addresses this in Phase 5 (AI Usability & Retrieval Readiness). An HTML page that references and summarises the PDF is what AI systems will ingest — and what this audit tool can help you improve.

The audit is a structured diagnostic, not a guarantee. Specific limitations to be aware of:

  • JavaScript-rendered content — pages that load content via client-side JavaScript will return sparse HTML. When word count after HTML stripping is between 50 and 199, the report includes a caution note that content may be incomplete. Technical (crawlability) checks still run accurately; content-dependent scores may be understated.
  • Long pages — the audit scores up to the first 9,000 characters of extracted text (approximately 1,500 words). For longer pages, the report includes a truncation notice so you know the Knowledge Uniqueness and AI Usability scores are based on a partial view.
  • Minimum content threshold — pages with fewer than 50 words after HTML stripping are rejected before analysis. A page with insufficient content cannot be meaningfully audited.
  • Single-page scope — the audit evaluates one URL at a time. Site-wide signals (domain authority, linking patterns, content breadth) are not captured.

For pages the audit cannot process — PDFs, paywalled content, JS-only rendered pages — see What types of pages can I audit? above.

No. The audit is a diagnostic tool that identifies structural weaknesses and opportunities — it does not have access to the actual training pipelines of any AI company. AI training data selection involves many factors beyond any individual page's quality: dataset curation decisions, domain selection, deduplication at corpus scale, and evolving pipeline methodology.

What the audit can tell you is how your content compares against the observable structural criteria that training pipelines are known to favour. A high score means your content has the properties associated with corpus retention. A low score means it has properties associated with filtering or misrepresentation.

The goal is not guaranteed inclusion. The goal is removing the avoidable reasons for exclusion.

The audit uses four layers of accuracy control to ensure scores reflect the real content of your page — not the model's training memory, not a guess:

  • Server-side fetch first — the server fetches your URL directly before any AI analysis runs. The language model receives the live HTML at the time of the audit, not a cached version.
  • Deterministic pre-analysis — before the language model processes anything, server-side code extracts 19 structured facts from the raw HTML: HTTP status, canonical tag, heading counts, schema markup fields, author metadata, and more. These are passed as hard ground truth. The model is explicitly instructed not to re-derive facts the technical layer has already established.
  • Temperature zero — the API call runs at temperature 0, eliminating sampling variability. The same page submitted twice produces the same scores. Score changes between audits reflect genuine content changes, not model randomness.
  • Response schema validation — after the model responds, the server validates every field: scores are clamped to valid ranges (1–5 per dimension, 1–10 overall), enum fields are normalised to their allowed values, and any deviation is logged. Out-of-schema output never reaches the client.

The only fields that involve model interpretation — rather than deterministic code — are the short natural-language summaries (main claim, corpus survival reason, key findings). These are bounded by the actual fetched text and constrained to specific word counts. They are interpretations of what was found, not inventions.

After making changes recommended in an audit report, re-audit the page to confirm the improvements have been reflected. Beyond that, the framework recommends treating AI knowledge publication as an ongoing practice rather than a one-time task (Phase 6: Maintenance, Monitoring & Improvement).

Re-audit when you make significant structural changes to a page, update your schema markup or metadata, publish new sections or substantially revise the content, or want to benchmark a page ahead of a new AI training window (typically every 6–12 months for major model updates).

No. The URL you submit and the content extracted from it are used only to generate the audit response and are not stored on our servers after the result is returned. The URL and page content are passed to the Anthropic Claude API as part of the analysis. Anthropic processes this subject to their API usage policy. See our Privacy Policy for full details.

Transparency & Limits

When you submit a URL, our server issues a single HTTP GET request to that URL. The request uses a standard browser User-Agent and set of request headers (Accept, Accept-Language, Sec-Fetch-*) — the same headers a real Chrome browser sends on its first page load. We do this so that enterprise WAFs, CDNs, and bot-protection services do not silently reject the fetch and return a challenge page instead of your content. A tool that is blocked before it reaches your origin cannot audit your page.

We also make a separate, identifying request to /robots.txt on your origin using the User-Agent Mozilla/5.0 (compatible; AIKnowledgeAudit/1.0; +https://aiknowledgesignal.io). This request is transparent — if you inspect your logs you will see exactly who audited you. robots.txt is a public file intended to be fetched by crawlers, so the honest User-Agent is the correct choice there.

The page fetch has a 12-second timeout. The robots.txt fetch has a 5-second timeout. Both run in parallel, so the robots.txt result is usually already in memory by the time the page fetch completes.

The free audit allows up to 10 requests per hour per IP address. Each audit response includes standard rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) and the audit tool displays your remaining quota beneath the result. When the quota resets, you can run more audits on the same IP.

If you need higher volume — for example, auditing every page on a large site — the full framework download includes the complete SKILL.md, the GEO Framework PDF, and the How-To Guide for running structural checks yourself without using the hosted tool. The Chrome and Edge extension (included with a subscription) supports URL scans, paste-and-score, and document upload without the free audit's hourly cap, which is useful both for volume and for draft content not yet published to a URL.

We are deliberately specific about what this tool is and is not, because over-claiming damages the signal the tool is designed to produce. The audit does not:

  • Guarantee inclusion in AI training corpora. We do not have access to the training pipelines of OpenAI, Anthropic, Google, or any other provider. We evaluate your content against the observable, published criteria those pipelines are known to use. A high score is strong structural evidence, not a commitment.
  • Execute JavaScript. The standard audit reads static HTML — the same content that Common Crawl, GPTBot, CCBot, and ClaudeBot see. If your content is rendered entirely client-side, the audit flags this as low content or a JavaScript-render warning rather than scoring a partial view silently. (A Playwright-based visual audit is planned as a premium tier for sites where this matters.)
  • Bypass WAFs, CAPTCHAs, or paywalls. If your site is behind Cloudflare Bot Management, Datadome, PerimeterX, Akamai Bot Manager, a login, or an IP allowlist, the audit reports that specifically and names the provider — but it will not attempt to circumvent the protection. AI training crawlers face the same block, so the correct remediation is to whitelist AI crawlers at your CDN or remove the gate, not to bypass it at audit time.
  • Crawl your whole site. One URL per audit. Site-wide signals (domain authority, internal linking, sitemap coverage, breadth of content) are out of scope. The framework download covers these at the whole-of-entity level.
  • Score PDFs or non-HTML formats. The audit is HTML-only by design — we score the URL exactly as an AI training crawler sees it. PDFs need an equivalent HTML publication to be auditable.
  • Score pages longer than ~1,500 words in full. Content after the first 9,000 characters of stripped text is truncated before analysis. When this happens, the report includes a truncation notice so you know the Knowledge Uniqueness and AI Usability scores reflect a partial view.
  • Provide real-time search-engine or AI-response data. The audit is a structural diagnostic of your page, not a measurement of current AI visibility. If you want to see how your domain is cited in AI answers, that is a separate research activity using tools like Ahrefs Brand Radar or direct testing against ChatGPT/Claude/Gemini.
  • Replace human editorial judgement. The model scores structural and epistemic signals; it cannot replace a human subject-matter expert reviewing your claims for accuracy. Use the audit to catch structural problems; use a human reviewer to catch factual ones.

Rather than returning a generic error, the audit classifies the specific block type and shows you exactly what happened. Common categories:

  • Cloudflare bot challenge — your site returned a Cloudflare challenge page. Remediation is in your Cloudflare dashboard under Security → Bots.
  • Datadome / PerimeterX / Akamai challenge — similar to Cloudflare, each provider has its own allowlist configuration.
  • WAF IP allowlist — your CDN is rejecting requests by source IP. AI training crawlers originating from cloud infrastructure face the same block.
  • Authentication required (HTTP 401) — the page is behind a login. AI crawlers cannot access it either.
  • Login form with short body — the page looks like a gated content page (password input + little body copy).
  • X-Robots-Tag: noai / noimageai / noindex — the site owner has explicitly declared this content as opted out of AI ingestion via HTTP header. The audit respects this and surfaces the specific directives found.
  • Target rate-limited (HTTP 429) — your site is rate-limiting our fetch. Wait and retry, or check your WAF's rate-limit rules.
  • Low content — the page returned HTML but contains less than 50 words of body text after stripping. Usually a JavaScript-rendered page or an empty shell.

Each category maps to a specific panel with a concrete next step. The goal is that every non-success outcome is actionable — you know what type of block you hit, why an AI training crawler would face the same block, and what to do about it.

Yes. The audit fetches your robots.txt and parses rules for User-agent: * and the 11 named AI-training crawlers (GPTBot, CCBot, ClaudeBot, Anthropic-AI, Google-Extended, Cohere-AI, Bytespider, PerplexityBot, Diffbot, OmgiliBot, ImageSiftBot). If any of these directives disallow the path, the audit surfaces this as a hard Crawlability signal — you cannot score well if the major AI crawlers are explicitly told not to fetch your page.

We also parse the X-Robots-Tag HTTP response header for AI-opt-out directives including noai, noimageai, noindex, noml, and none. When any of these are present the audit rejects before analysis and tells you exactly which directive your server is sending.

We do not ignore these signals to "audit anyway". If you have explicitly told AI crawlers not to use your content, we treat that as the authoritative answer for AI ingestion and do not overwrite it with a Claude score.

The audit runs a weekly regression test against a curated list of 50 Australian enterprise URLs spanning jobs, property, automotive, travel, SaaS, telco, health insurance, industrial, FMCG, agri, medical, construction, engineering, IT services, retail, and marketplace sectors. A 10-URL random subset is tested each Monday morning (09:00 AEST) using GitHub Actions. Each run's results are committed back to the repository as a dated JSON file, creating a permanent, auditable history of how the tool has behaved over time.

If a URL that worked last week is now blocked — for example because a site adopted a new WAF — the run fails CI and automatically opens a tracked issue with the specific diff. This means regressions become visible within a week, and the fix is tied to the specific sector or provider change that caused it. The entire testing harness (product/tests/audit-smoke.mjs, audit-smoke-diff.mjs, .github/workflows/audit-smoke.yml) is in our public repository.

The full technical specification — how dimensions are scored, what the anti-hallucination architecture does, every reason code the tool can emit, every Change Log entry — is documented in product/audit.md in the repository. This is our intellectual property and it is deliberately transparent. An audit tool that cannot explain itself is not a diagnostic; it is a guess.

The AI Knowledge Signal Framework

The AI Knowledge Signal Framework is a 6-phase methodology for producing, structuring, and publishing content in a way that maximises its likelihood of being accurately ingested and represented by AI training systems.

The six phases are run in order, each one removing a different reason AI systems ignore, misquote, or filter out content:

  1. Phase 1 — Crawlability & Technical Access. Can AI systems reach and ingest your content? Covers robots.txt, llms.txt, sitemaps, status codes, login walls, and JavaScript rendering.
  2. Phase 2 — Structural Clarity & Machine Readability. Is the page shaped so machines can parse it cleanly? Covers heading hierarchy, opening thesis, definitions, paragraph density, and clean HTML.
  3. Phase 3 — Knowledge Uniqueness & Contribution. Does the content add original signal worth retaining? Covers content type, primary vs. derivative material, defined contributions, and novelty.
  4. Phase 4 — Authority, Evidence & Trust. Are the claims provably credible? Covers named authorship, schema markup, citation patterns, source tiers, and epistemic honesty.
  5. Phase 5 — AI Usability & Retrieval Readiness. Can AI assistants cite and retrieve your content accurately? Covers structured data, claim density, terminological consistency, and reusable publishing formats.
  6. Phase 6 — Maintenance, Monitoring & Improvement. Are you tracking and refreshing over time? Covers re-auditing, prompt monitoring, content refreshes, and distribution.

The Phase Priorities section in the AI Knowledge Signal Framework identifies the phases that are evidence-based and have the biggest impact.

For example, if your page has no schema markup and no author attribution, Phase Priorities will reference Phase 4 (Authority, Evidence & Trust) and Phase 1 (Crawlability & Technical Access). In the full framework, it contains the specific step-by-step guidance for addressing those gaps.

The Phase Priorities are designed to make the report immediately actionable — you can take the phase references directly into the full framework and find the exact steps required.

SEO tools optimise content for human search engines — ranking signals, keyword density, backlink profiles, and click-through rates. These metrics have some overlap with AI training readiness but are fundamentally different objectives.

AI training pipelines are not search engines. They do not rank pages or respond to click signals. They filter and select content for inclusion in training corpora based on structural quality, originality, authority signals, and epistemic clarity. Content that performs well in SEO (high keyword density, lots of internal links, broad reach) can simultaneously score very poorly on AI training readiness (low epistemic uniqueness, low factual density, derivative argument).

The data makes this gap concrete: only 12% of URLs cited in AI-generated answers overlap with Google's top 10 organic results — across a dataset of 15,000 prompts tested against ChatGPT, Gemini, and Copilot (Ahrefs, 2025). Perplexity is the outlier at roughly 1 in 3, but even that means two-thirds of its citations fall outside traditional rankings entirely. Meanwhile, structured GEO publication has been shown to increase visibility in AI-generated responses by up to 40% (arXiv, 2023).

It is also worth noting that GEO is not a website optimisation strategy — it is a whole-of-entity strategy. AI systems synthesise signals from your entire digital presence: your website, LinkedIn, YouTube, third-party reviews, and forum discussions. The target is not a URL. It is the web's consensus about you.

How AI Training Pipelines Work

AI systems are not trained on the web — they are trained on a highly compressed, filtered, and biased representation of it. The journey from a web page or other digital content to a model's knowledge runs through five sequential stages, each one a filter:

  • 1. Raw crawl — web crawlers collect HTML, JavaScript, PDFs, images, and metadata. Everything is gathered indiscriminately at this point.
  • 2. Filtering & curation — this is the decisive stage. Quality classifiers, heuristics (length, repetition, entropy), and source whitelists remove content that fails to meet structural standards. More than 90% of crawled content is discarded here. Most business websites do not survive this step.
  • 3. Tokenisation — only after filtering does text become the actual model input format. Algorithms like BPE (Byte Pair Encoding) convert cleaned text into sequences of integer token IDs. The sentence “AI visibility is driven by structure and trust” becomes something like [1543, 9821, 318, 7421, 416, 2937]. Vocabularies are model-specific.
  • 4. Model training — token sequences feed into the neural network. Weights update via gradient descent. The model's knowledge lives in its weight matrices — not in stored text or fixed embeddings. Embeddings created during the forward pass are temporary computational intermediates.
  • 5. Embedding usage — embeddings appear as persistent representations in post-training retrieval systems (RAG, vector search). This is architecturally separate from base model training.

Content is ruthlessly filtered before it ever has a chance to influence AI systems. Surviving the pipeline requires structural properties — not just good writing.

The cumulative discard rate across a typical training pipeline looks approximately like this:

  • Raw crawl: 100% collected
  • After HTML cleaning and deduplication: ~30–40% remains
  • After quality filtering: ~5–10% remains
  • After final curation: ~1–5% remains
  • Effective influence on model behaviour: well below 1%

This is by design. AI training datasets are precision instruments, not archives. Higher-quality sources are upsampled to further increase their influence relative to their raw volume — meaning the actual representation gap between well-structured and poorly-structured content is larger than the raw discard rates suggest.

No — this is a common misconception. During training, tokens are converted to embedding vectors as a computational intermediate. These embeddings pass through transformer layers and are then discarded. They are not stored in the model.

The model's “knowledge” is encoded in its weight matrices. When you interact with a language model, it is not retrieving stored text — it is generating responses by running input tokens through layers of learned weights.

Embeddings as persistent stored representations only appear in retrieval systems — RAG (Retrieval-Augmented Generation) and vector databases — which are architecturally separate from the base model. Optimising for base model training and optimising for RAG retrieval are related but distinct objectives.

Quality filters look for structural proxies of reliability and information density — not writing fluency. The properties that consistently correlate with corpus retention:

  • Explicit authorship — named author with verifiable credentials or institutional affiliation
  • Schema markup — structured metadata that machine parsers can extract without ambiguity
  • Clear heading hierarchy — H1/H2/H3 structure that signals document organisation
  • Factual density — specific claims, cited sources, named entities — not generic assertions
  • Low boilerplate ratio — minimal navigation, footer, and ad text relative to body content
  • Original contribution — argument, data, or framework not duplicated across thousands of other pages
  • Clean HTML — content that renders meaningfully without JavaScript execution

Most content fails AI training pipelines not because it is wrong, but because it lacks these structural signals. That is a fixable problem — which is what the AI Knowledge Signal Framework and the free audit tool address.

AI Knowledge Signal

Ready to audit your content?

Run a free audit on any publicly accessible URL and get a structured report in under 15 seconds.