Skip to content
50% OFF $299 $599
Lock in
§ 1.7.2 ARTICLE
Published Type How-to Sources 8 named

§ 1.7.2 · Cluster 1G · How-to

Training Bots Feed Future Models. Retrieval Bots Decide Today's Citations.

Every major AI vendor runs at least two crawlers: a training-class bot that scrapes content to feed the next model version, and a retrieval-class bot that fetches a page live when a user asks the assistant a question. Blocking the training crawler protects future scraping. Blocking the retrieval crawler removes you from live citations. Squarespace's UI does not separate the two.

This page is the per-bot recommended-setting table for a Squarespace owner whose goal is AI citations. It separates the three classes (training, retrieval, search-index), names every documented bot in each, and ends with the implementation on Squarespace given the platform's limits.

The training-vs-retrieval split in one sentence

A training-class crawler scrapes pages in bulk to feed the next version of an AI model; a retrieval-class crawler fetches a single page in real time when a user has just asked the assistant a question. The two classes have different schedules, different volumes, different robots.txt behaviour, and different consequences if you block them. Squarespace's Crawlers panel groups them under one checkbox.

OpenAI was the first major vendor to document the split publicly. Its docs name GPTBot as the training crawler, ChatGPT-User as the user-initiated retrieval agent, and OAI-SearchBot as the ChatGPT Search index crawler2. Anthropic followed the same shape with ClaudeBot, Claude-User, and Claude-SearchBot3. Perplexity ships PerplexityBot and Perplexity-User4. Apple separates Applebot from Applebot-Extended5. The split is now industry-standard.

For a Squarespace site that wants AI citations, the rule is short: allow every retrieval-class bot, allow every search-index bot, and decide your training-class stance on its own merits. Blocking training crawlers does not affect today's citations; it only affects whether your content informs future models. Blocking retrieval crawlers does the opposite: it has no effect on training (those bots are already not allowed to train on what they fetch) and a large effect on whether you appear in today's AI answers.

Why the split matters in 2026

37%

of consumers start a search with an AI engine before Google, per January 2026 data.

Search Engine Land · 2026-02-23
~25%

drop in traditional search volume Gartner projects for 2026.

Search Engine Land · 2026-02-23
3

documented crawler classes per vendor on average: training, retrieval, search-index.

OpenAI Docs · 2026

Training-class crawlers: GPTBot, ClaudeBot, Google-Extended, and friends

A training-class crawler exists to collect content for the next version of an AI model. The named bots in this class: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's robots.txt token for Gemini training), Applebot-Extended (Apple Intelligence opt-out), Meta-ExternalAgent (Llama), CCBot (Common Crawl, downstream of many models), Bytespider (ByteDance), and the smaller training agents from Cohere, AI2, and others.

Each one is well-documented. OpenAI defines GPTBot as the crawler that "collects publicly available web content to train future generations of OpenAI's models"2. Anthropic defines ClaudeBot as the crawler that collects "publicly available web content to contribute to the training and improvement of Anthropic's generative AI models"3. Google's documentation for Google-Extended states the token "is used in a control capacity" and that blocking it "does not impact a site's inclusion in Google Search nor is it used as a ranking signal"6. Apple is explicit that "Applebot-Extended does not crawl webpages"5; it is a signal that controls how Applebot's already-crawled content is used.

Blocking the training class is an owner-philosophy choice. Reasons to allow: you want your content to inform future ChatGPT, Claude, and Gemini answers; you treat AI training the way you treat search indexing. Reasons to block: copyright concerns, brand-protection concerns, or a content type (gated research, proprietary frameworks) you do not want absorbed into general-knowledge corpora. There is no single correct answer.

What there is, is a separate answer from the retrieval-class question. Blocking GPTBot does not block ChatGPT from citing your page when a user asks a live question; that is governed by ChatGPT-User. The two settings are independent in robots.txt and should be made independently.

Retrieval-class crawlers: ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User

A retrieval-class crawler fetches a single page when a user is actively asking the AI a question. The named bots: ChatGPT-User (OpenAI), Claude-User (Anthropic), Perplexity-User (Perplexity), MistralAI-User (Mistral). All four vendors document these as user-initiated, not training-related. For a site that wants AI citations, all four should remain allowed.

OpenAI documents ChatGPT-User as the user-agent activated when "users ask ChatGPT or a CustomGPT a question" that requires a live page fetch2. Anthropic documents Claude-User as the agent that "retrieves content when a user asks Claude a question that requires access to a webpage"3. Perplexity documents Perplexity-User as the agent that activates when "users ask Perplexity a question, it might visit a web page to help provide an accurate answer"4. Mistral's docs use almost identical language: MistralAI-User "may visit a web page to help answer and include a link to the source in its response"7.

All four are absent from Squarespace's 26-bot block list, which means they remain allowed on a default Squarespace site. That is the desired state. The risk is that an owner adds custom robots.txt rules (via the workarounds in the robots-txt-custom leaf) that disallow them inadvertently. The safer pattern is to add no custom rules at all and rely on the platform default.

Search-index crawlers: OAI-SearchBot, Claude-SearchBot, PerplexityBot, Applebot

A search-index crawler sits between the training crawler and the retrieval agent. It indexes your site so that the engine's search-mode results surface (ChatGPT Search, Claude's search answers, Perplexity's results, Apple search) can include your pages. These are the bots that decide AI-search rankings, not AI-conversation citations — but for many sites, AI-search rankings convert at higher rates than chatbot citations.

OAI-SearchBot is OpenAI's. Per the docs, it is "used to link to and surface websites inside AI-driven search results" and "not used to crawl content to train OpenAI's generative AI foundation models"2. Claude-SearchBot is Anthropic's; it "indexes content to improve Claude's search answers"3. PerplexityBot is Perplexity's; it is "designed to surface and link websites in search results on Perplexity" and "not used to crawl content for AI foundation models"4. Applebot powers Apple's Spotlight, Siri, and Safari search surfaces5.

For a Squarespace site, all four of these (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Applebot) are absent from Squarespace's 26-bot block list and remain allowed by default. The Squarespace panel does not affect search-index crawlers — toggling the AI checkbox on or off does not change their reachability.

For a Squarespace site whose goal is AI citation visibility, the recommended pattern is simple: allow every retrieval bot, allow every search-index bot, and decide training bots on their own merits. The Squarespace UI does not let you express this with full granularity, but the default state of the Crawlers panel already gets most of it right for free.

recommendation Per-bot setting for AI-citation-visibility on Squarespace
 # TRAINING — owner's choice, no live-citation impact GPTBot  owner's call ClaudeBot  owner's call Google-Extended  owner's call # zero effect on Google Search Applebot-Extended  owner's call # does not actually crawl Meta-ExternalAgent  owner's call CCBot  owner's call # downstream of many datasets Bytespider  owner's call # may ignore robots.txt anyway # RETRIEVAL — leave allowed if you want AI citations ChatGPT-User  allow Claude-User  allow Perplexity-User  allow # ignores robots.txt anyway MistralAI-User  allow # SEARCH INDEX — leave allowed if you want AI-search visibility OAI-SearchBot  allow Claude-SearchBot  allow PerplexityBot  allow Applebot  allow # powers Spotlight, Siri, Safari 

The simplest Squarespace expression of this matrix: leave the AI checkbox unchecked. That allows all 26 named bots, including the training crawlers. If you want to block training selectively while keeping retrieval and search-index agents reachable, the Squarespace UI cannot express that — the panel toggle is all-or-nothing across its 26-bot list. The cleanest compromise is to leave the checkbox off; the alternative is the per-bot custom-robots workaround in the next leaf, which is fiddly.

How to implement this on a Squarespace site

The default state of the Squarespace Crawlers panel — both checkboxes unchecked except the search-engine one — is already the recommended setting for most sites. The work is in confirming it, not in changing it. The exception is sites where a previous owner toggled the AI block on; those need the box unchecked again, plus a forty-eight-hour wait for the change to propagate through the AI vendors' caches.

Open Settings → Crawlers. Confirm the search-engine checkbox is on. Confirm the "Block known artificial intelligence crawlers" checkbox is off. Click Save. Open yoursite.com/robots.txt in a private window and verify that no AI-specific Disallow rules appear. That is the entire setup for the default path.

For sites that want a finer-grained position (allow retrieval and search-index but block training), the implementation requires custom robots.txt rules, which Squarespace does not expose directly. The robots-txt-custom leaf documents the three workarounds (page-level noindex, site-wide meta robots via Code Injection, X-Robots-Tag via Developer Mode) and the trade-offs of each.

For verification once the settings are in place, the diagnose leaf walks through the five live checks that confirm your site is in the state you intended.