Q: What does the AI Knowledge Signal audit actually do?

The audit fetches the page at the URL you submit and evaluates it across five dimensions — Crawlability & Technical Access, Structural Clarity & Machine Readability, Knowledge Uniqueness & Contribution, Authority, Evidence & Trust, and AI Usability & Retrieval Readiness. It returns an overall score out of 10, a Corpus Survival Likelihood rating, per-dimension scores with specific findings, and a prioritised set of recommendations tied to the six phases of the AI Knowledge Signal Framework.

Question 1

What does the AI Knowledge Signal audit actually do?

Accepted Answer

What does the AI Knowledge Signal audit actually do?

The audit fetches the page at the URL you submit. It evaluates your page across five dimensions — Crawlability & Technical Access, Structural Clarity & Machine Readability, Knowledge Uniqueness & Contribution, Authority, Evidence & Trust, and AI Usability & Retrieval Readiness — that correspond to the most consequential failure points in the AI knowledge supply chain. A sixth phase, Maintenance, Monitoring & Improvement, is surfaced in the report's prompts section so you can track whether AI assistants find your content over time.

The result is a structured report: an overall score out of 10, a Corpus Survival Likelihood rating, per-dimension scores with specific findings, and a prioritised set of recommendations tied to the AI Knowledge Signal Framework.

Question 2

What does Corpus Survival Likelihood mean?

Accepted Answer

AI training pipelines apply aggressive quality filters at multiple stages — from initial crawl to corpus selection to deduplication. Corpus Survival Likelihood is an assessment of whether your content would survive this entire supply chain. It is expressed as High, Medium — At Risk, or Low — Likely Filtered.

Question 3

What is Epistemic Risk and why does it matter?

Accepted Answer

Epistemic Risk measures how likely your content is to be misrepresented by AI systems — not just filtered, but actively distorted. This happens when content uses terms loosely, buries claims in rhetorical language, relies heavily on metaphor, or leaves entity relationships ambiguous. High Epistemic Risk content is content that will most likely be misquoted, paraphrased incorrectly, or attributed the wrong meaning by AI systems.

Question 4

Why is my score low if my content is well-written?

Accepted Answer

Writing quality and AI training readiness are not the same thing. AI training pipelines rely on structural proxies: schema markup, heading hierarchy, author metadata, canonical URLs, citation patterns, and factual density. A polished essay with no schema markup, no author attribution, and a derivative argument will score poorly on Authority, Evidence & Trust and Knowledge Uniqueness & Contribution regardless of how well it is written.

Question 5

What types of pages can I audit?

Accepted Answer

The URL audit works with any publicly accessible HTML page — articles, blog posts, documentation, case studies, whitepapers, product pages, homepages, or any knowledge-oriented web page that returns a standard HTTP response. The content type selector lets you tell the audit what kind of page you are scoring so it calibrates its expectations accordingly. The Chrome and Edge extension also lets you paste or upload raw text for content that does not yet have a published URL — useful for newsletters, social posts, or drafts. PDFs, paywalled pages, JavaScript-only rendered pages, and URLs returning errors will not produce a full audit result.

Question 6

What does the content type selector do, and when should I use it?

Accepted Answer

The content type selector tells the audit what kind of content you are scoring so it can apply the right calibration per dimension. A homepage should not be penalised for low academic originality — it is an entry point, not a knowledge asset. A social post cannot have heading hierarchy — applying the same structural expectations as a research paper produces a meaningless score. A whitepaper is held to the strictest standards: abstract, named methodology, author credentials, cited primary sources. Available types: Homepage / Landing Page, Blog Post / Article, Product or Pricing Page, Case Study / Customer Story, Documentation / How-to Guide, Whitepaper / Research Report, Newsletter / Email, and Social Media Post. Leave on Auto-detect and the audit infers the type from the page — which works well for most cases. Declaring the type explicitly gives the most accurate results for formats that sit outside the typical web-page rubric.

Question 7

Does a high score guarantee my content will be in AI training data?

Accepted Answer

No. The audit is a diagnostic tool that identifies structural weaknesses and opportunities. AI training data selection involves many factors beyond any individual page's quality: dataset curation decisions, domain selection, deduplication at corpus scale, and evolving pipeline methodology. The goal is not guaranteed inclusion — the goal is removing the avoidable reasons for exclusion.

Question 8

What is the AI Knowledge Signal Framework?

Accepted Answer

The AI Knowledge Signal Framework is a 6-phase methodology for producing, structuring, and publishing content in a way that maximises its likelihood of being accurately ingested and represented by AI training systems. The six phases are run in order: Phase 1 — Crawlability & Technical Access (can AI systems reach your content); Phase 2 — Structural Clarity & Machine Readability (is your content shaped so machines can parse it); Phase 3 — Knowledge Uniqueness & Contribution (does your content add original signal); Phase 4 — Authority, Evidence & Trust (are your claims provably credible); Phase 5 — AI Usability & Retrieval Readiness (can AI cite and retrieve your content accurately); and Phase 6 — Maintenance, Monitoring & Improvement (are you tracking, refreshing, and improving over time).

Question 9

How is this different from SEO tools?

Accepted Answer

SEO tools optimise content for human search engines — ranking signals, keyword density, backlink profiles, and click-through rates. AI training pipelines are not search engines. They filter and select content for inclusion in training corpora based on structural quality, originality, authority signals, and epistemic clarity. Content that performs well in SEO can simultaneously score very poorly on AI training readiness. The AI Knowledge Signal audit evaluates your content against AI training criteria specifically.

Question 10

What does Phase Priorities in the report mean?

Accepted Answer

Each of the five audit dimensions corresponds to one of the framework's six phases. The Phase Priorities section in the audit report identifies the phases most relevant to your page's specific weaknesses, making the report immediately actionable.

Question 11

How does content actually get from the web into an AI model?

Accepted Answer

AI systems are not trained on the web — they are trained on a highly compressed, filtered, and biased representation of it. The pipeline runs through five sequential stages: (1) Raw crawl — indiscriminate collection of HTML, PDFs, and metadata; (2) Filtering and curation — quality classifiers and heuristics discard more than 90% of crawled content; most business websites fail here; (3) Tokenisation — surviving text is converted to integer token sequences using algorithms like BPE; (4) Model training — token sequences update neural network weights via gradient descent; the model's knowledge lives in weights, not stored text; (5) Embedding usage — embeddings appear as persistent representations in retrieval systems (RAG, vector search), which are architecturally separate from base model training.

Question 12

How much of the web is actually used in AI training data?

Accepted Answer

Cumulative discard rates across a typical AI training pipeline leave less than 5% of crawled web content in curated training corpora — and effective influence on model behaviour is well below 1%, because higher-quality sources are upsampled to dominate training. The discard rate is by design: AI training datasets are precision instruments. Understanding where in the pipeline your content is excluded is the starting point for a content strategy that addresses AI representation.

Question 13

What structural properties help content survive AI training pipeline filters?

Accepted Answer

Quality filters look for structural proxies of reliability and information density: explicit authorship with verifiable credentials, schema markup, clear heading hierarchy (H1/H2/H3), factual density with cited sources, low boilerplate ratio, original contribution not duplicated across other pages, and clean HTML that renders without JavaScript. Most content fails AI training pipelines not because it is wrong, but because it lacks these structural signals — a fixable problem that the AI Knowledge Signal Framework addresses systematically.

Question 14

How does the audit tool actually fetch my page?

Accepted Answer

The server issues a single HTTP GET request with a standard browser User-Agent and browser-style request headers (Accept, Accept-Language, Sec-Fetch-*) so enterprise WAFs and bot-protection services do not silently reject the fetch. A separate, transparent request to /robots.txt uses the identifying User-Agent Mozilla/5.0 (compatible; AIKnowledgeAudit/1.0; +https://aiknowledgesignal.io) so site owners can see who audited them. Page fetch has a 12-second timeout; robots.txt has a 5-second timeout; both run in parallel.

Question 15

How many free audits do I get?

Accepted Answer

The free audit allows up to 10 requests per hour per IP address. Each audit response includes standard rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) and the audit tool displays your remaining quota beneath the result. When the quota window resets, additional audits can run on the same IP. The Chrome and Edge extension, included with a subscription, requires account sign-in and is not subject to the free audit's hourly cap — subscribers can score URLs, pasted text, and uploaded documents without the 10-per-hour limit.

Question 16

What does the audit tool not do?

Accepted Answer

The audit does not guarantee inclusion in AI training corpora, does not execute JavaScript (standard audit reads static HTML only), does not bypass WAFs/CAPTCHAs/paywalls, does not crawl your whole site (one URL per audit), does not score PDFs or non-HTML formats, does not score content beyond the first 9,000 characters of stripped text, does not provide real-time AI-response visibility data, and does not replace human editorial judgement on factual accuracy.

Question 17

What happens if my site blocks the audit?

Accepted Answer

The audit classifies the specific block type and returns a named reason: cloudflare_block, Datadome/PerimeterX/Akamai challenge (bot_challenge), WAF IP allowlist (waf_allowlist), authentication required (auth_required), login form with short body (login_form), X-Robots-Tag AI opt-out (ai_blocked_header), target rate limit (target_rate_limited), or low content (low_content). Each category maps to a specific remediation panel that explains why AI training crawlers would face the same block.

Question 18

Does the audit respect robots.txt and AI opt-out signals?

Accepted Answer

Yes. The audit parses robots.txt for User-agent:* and 11 named AI training crawlers (GPTBot, CCBot, ClaudeBot, Anthropic-AI, Google-Extended, Cohere-AI, Bytespider, PerplexityBot, Diffbot, OmgiliBot, ImageSiftBot) and treats disallow rules as hard Crawlability signals. It also parses the X-Robots-Tag HTTP header for noai, noimageai, noindex, noml, and none directives. When AI opt-out is declared, the audit rejects before analysis rather than scoring anyway — the owner's declaration is the authoritative answer for AI ingestion.

Question 19

How are you monitoring the audit for accuracy over time?

Accepted Answer

A weekly regression test (GitHub Actions, Mondays 09:00 AEST) runs a 10-URL random subset of a curated 50-URL Australian enterprise fixture against a live audit server. Each run's results are committed to the repository as a dated JSON file, creating an auditable history. If a URL that worked last week is now blocked, the run fails CI and opens a tracked issue with the specific diff. The full technical specification, including every reason code and the anti-hallucination architecture, is documented in product/audit.md in the public repository.

Dimension	Weight	Rationale
Crawlability & Technical Access	15%	Binary prerequisite; failure here is catastrophic but rare
Structural Clarity & Machine Readability	25%	Highest impact on AI interpretability and embedding quality
Knowledge Uniqueness & Contribution	25%	Core epistemic value; determines signal vs. noise
Authority, Evidence & Trust	20%	Structural proxy for quality in training pipeline filters
AI Usability & Retrieval Readiness	15%	Enhancement layer; differentiating but not foundational

Frequently Asked Questions

Ready to audit your content?