PublishedTypeReferenceSources9 namedAuthored bySquareRank Team
§ 1.7.1 · Cluster 1G · Reference
Every AI Bot on Squarespace's Block List, Named and Sourced
Squarespace's Crawlers panel disallows 26 named user-agents when the AI block is enabled1. This page covers each one with the owner, the documented purpose, and the recommended setting for a Squarespace site whose goal is AI citation visibility. It also names the retrieval bots the panel does not cover.
Sources are linked inline and grouped by company. Where a bot is widely reported to disregard robots.txt despite documenting compliance, the section notes it. Where the user-agent string matters for verification or for custom robots.txt rules, the exact string is listed.
§01The list
The 26 bots, in one sentence
Squarespace's help center lists the exact 26 user-agents disallowed when the AI crawler block is on: AI2Bot, Ai2Bot-Dolma, aiHitBot, Amazonbot, anthropic-ai, Applebot-Extended, Bytespider, CCBot, ClaudeBot, cohere-ai, cohere-training-data-crawler, DuckAssistBot, FacebookBot, Google-Extended, GoogleOther, GoogleOther-Image, GoogleOther-Video, GPTBot, img2dataset, Meta-ExternalAgent, MyCentralAIScraperBot, omgili, omgilibot, Quora-Bot, TikTokSpider, YouBot.
The list comes from Squarespace's own documentation1 and is the verbatim set of user-agents the platform writes into robots.txt when the box is checked. Most of these are training crawlers. A few are bulk web-corpus collectors. None of them are the live retrieval bots that decide whether ChatGPT, Claude, or Perplexity will cite your page when a user is actively asking. The rest of this article groups the 26 by owner so you can decide each one on its merits, then names the retrieval bots Squarespace did not cover at all.
What the list is, and what it is not
26
named user-agents disallowed when the Squarespace AI block is on.
OpenAI documents three crawlers with three different jobs. GPTBot trains future models, ChatGPT-User fetches a page when a user asks the assistant a live question, and OAI-SearchBot indexes content for ChatGPT Search. Only GPTBot appears on Squarespace's block list. ChatGPT-User and OAI-SearchBot are not on the list, meaning Squarespace's checkbox has no effect on whether ChatGPT cites you in a live answer.
GPTBot. OpenAI describes GPTBot as the crawler that "collects publicly available web content to train future generations of OpenAI's models"2. Blocking GPTBot in robots.txt opts you out of training. It has no effect on live ChatGPT citations. On Squarespace's list.
ChatGPT-User. The user-initiated retrieval agent. OpenAI documents it as the user-agent that "may visit a web page to help answer" when a user asks ChatGPT a question that requires up-to-date information2. Allowing it is the difference between being citable and not being citable in live ChatGPT answers. Not on Squarespace's list.
OAI-SearchBot. The ChatGPT Search index crawler. OpenAI documents it as "used to link to and surface websites inside AI-driven search results" and explicitly "not used to crawl content to train OpenAI's generative AI foundation models"2. Required for inclusion in the ChatGPT Search results surface. Not on Squarespace's list.
Anthropic runs four documented Claude crawlers. ClaudeBot is the training crawler. Claude-User fetches pages on behalf of a user inside a Claude conversation. Claude-SearchBot indexes content to improve Claude's search answers. claude-code is the user-agent sent by the Claude Code CLI's WebFetch tool. Only ClaudeBot (and the older anthropic-ai user-agent) appear on Squarespace's list.
ClaudeBot. The training crawler. Anthropic's documentation states that blocking ClaudeBot in robots.txt will "exclude your site's future content from AI training datasets"3. On Squarespace's list as ClaudeBot (and separately as the older anthropic-ai token).
Claude-User. The retrieval crawler. Anthropic documents it as the user-agent that "retrieves content when a user asks Claude a question that requires access to a webpage"3. Blocking Claude-User means Anthropic cannot fetch your pages in response to user queries. Not on Squarespace's list.
Claude-SearchBot. The search index crawler. Anthropic documents it as the user-agent that "indexes content to improve Claude's search answers"3. Required if you want Claude's search answer feature to surface your pages. Not on Squarespace's list.
claude-code. The user-agent sent by Anthropic's Claude Code CLI when a developer asks the assistant to fetch a URL via the WebFetch tool. Volume is low compared with the other three. Not on Squarespace's list.
§04Google
Google: Google-Extended, GoogleOther
Google's AI training is controlled by a single robots.txt token, Google-Extended, which does not crawl independently — it is a signal that governs whether existing Googlebot requests can be used for AI training and grounding. GoogleOther is a separate set of three crawlers for non-Search Google products. Both are on Squarespace's block list. Blocking Google-Extended has zero effect on Google Search rankings; Google's documentation is explicit on that.
Google-Extended. The robots.txt token that controls whether Google uses your content to train and ground Gemini models. Google's documentation states: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity"6. The same docs confirm: "Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search."
GoogleOther. A family of three crawlers (GoogleOther, GoogleOther-Image, GoogleOther-Video) used for internal Google product development, research, and one-off projects outside of Google Search. All three appear on Squarespace's list.
§05Apple
Apple: Applebot and Applebot-Extended
Apple separates the search crawler from the AI training opt-out. Applebot powers Spotlight, Siri, and Safari Suggestions. Applebot-Extended is a control signal that lets publishers opt out of having Applebot's crawled content used to train Apple Intelligence — Apple's documentation explicitly states 'Applebot-Extended does not crawl webpages.' Only Applebot-Extended is on Squarespace's list.
Applebot. Apple's primary web crawler. Apple's documentation states it powers "the search technology that is integrated into many user experiences in Apple's ecosystem including Spotlight, Siri, and Safari"5. Not on Squarespace's list, which means the search-engine checkbox in the Crawlers panel governs it (allowed by default).
Applebot-Extended. The AI training opt-out signal. Apple is explicit: "Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results"5. Blocking it opts your content out of training Apple Intelligence; it does not affect Apple search visibility. On Squarespace's list.
§06Meta
Meta: Meta-ExternalAgent and FacebookBot
Meta runs two named crawlers relevant to AI. Meta-ExternalAgent is the Llama training collector, launched in 2024. FacebookBot is Meta's older crawler used in Facebook product surfaces. Both are on Squarespace's list. Meta's separate facebookexternalhit link-preview crawler is not affected by either toggle.
Meta-ExternalAgent. Meta's AI training crawler. Documented user-agent string: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler). Meta states the crawler collects publicly available content to train Llama family models. On Squarespace's list.
FacebookBot. Meta's older crawler, used for Meta product surfaces and previously for Facebook's training pipelines before Meta-ExternalAgent. On Squarespace's list. Distinct from facebookexternalhit, which is the link-preview fetcher that scrapes Open Graph tags when a URL is shared; that one is not affected by Squarespace's AI toggle.
§07Perplexity
Perplexity: PerplexityBot and Perplexity-User
Perplexity documents two crawlers. PerplexityBot is the search index crawler with the user-agent ending +https://perplexity.ai/perplexitybot. Perplexity-User is the live retrieval agent that fetches a page when a user asks Perplexity a question. Neither is on Squarespace's block list — but Cloudflare's August 2025 investigation reported Perplexity also rotates through undeclared crawlers when its declared ones are blocked.
PerplexityBot. Perplexity's index crawler. Documented as "designed to surface and link websites in search results on Perplexity" and "not used to crawl content for AI foundation models"4. User-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot).
Perplexity-User. The live retrieval agent. Documented as the user-agent activated when "users ask Perplexity a question, it might visit a web page to help provide an accurate answer"4. The docs note Perplexity-User "generally ignores robots.txt rules" because a user initiated the request. User-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user).
§08Mistral
Mistral: MistralAI-User
Mistral runs one named user-agent for its Le Chat product, MistralAI-User. It is a retrieval agent, not a training crawler — the docs are explicit: 'not used for crawling the web in any automatic fashion, nor to crawl content for generative AI training.' Not on Squarespace's list.
MistralAI-User. Documented as the user-agent that "may visit a web page to help answer and include a link to the source in its response" when a user asks Le Chat a question7. User-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots). Mistral's docs state the agent honours robots.txt directives, X-Robots-Tag headers, and cache headers.
§09The rest
The rest of the 26
The remaining bots on Squarespace's list cover the broader training ecosystem: a public web-corpus collector (CCBot), AI2's research crawlers, smaller training agents from Cohere and You.com, and a handful of aggregator and AI-product bots. None of these decide live citations, but several feed widely used research datasets and downstream models.
CCBot (Common Crawl). The largest open web-corpus collector. CCBot identifies itself as CCBot/2.0 (https://commoncrawl.org/faq/) and Common Crawl documents it as respecting robots.txt8. Many academic and commercial training datasets are derived from Common Crawl, so blocking CCBot has compound downstream effects.
Bytespider (ByteDance). User-agent string includes Bytespider with a feedback contact at bytedance.com. Widely reported as ignoring robots.txt in practice, so the Squarespace block on this bot is best understood as a polite request rather than an enforced rule. TikTokSpider is ByteDance's companion crawler and behaves similarly.
cohere-ai and cohere-training-data-crawler. Cohere's two documented user-agents. The training-data-crawler is the bulk collector; cohere-ai is the on-demand retrieval agent for Cohere products. Both are on Squarespace's list, which is unusual for a retrieval agent.
AI2Bot and Ai2Bot-Dolma. The Allen Institute for AI's research crawlers. Dolma is the name of AI2's open-source training corpus, so Ai2Bot-Dolma is the crawler that feeds it.
Amazonbot, DuckAssistBot, YouBot, Quora-Bot. Smaller AI-product crawlers from Amazon, DuckDuckGo (DuckAssist), You.com, and Quora. Each one has the standard training-or-retrieval question; the Squarespace toggle blocks them all at once.
aiHitBot, anthropic-ai, img2dataset, MyCentralAIScraperBot, omgili, omgilibot. Long-tail entries. anthropic-ai is the older Anthropic user-agent superseded by ClaudeBot. img2dataset is an open-source bulk image dataset builder. omgili and omgilibot are forum and aggregator crawlers that pre-date the current AI wave but ended up on the list because their corpora feed AI training pipelines.
§10The gaps
Notable bots not on Squarespace's list
Several user-agents that decide live AI citations are absent from Squarespace's 26-bot list, which means the AI checkbox does not affect them in either direction. The biggest gaps are the OpenAI retrieval agents (ChatGPT-User, OAI-SearchBot), the Anthropic retrieval agents (Claude-User, Claude-SearchBot), the Perplexity retrieval agent (Perplexity-User), and Mistral's MistralAI-User.
For a Squarespace owner whose goal is AI citations, this gap is the entire point. Toggling the AI block on or off does nothing to the bots that decide whether ChatGPT, Claude, Perplexity, and Mistral will pull your page into their next answer. Those decisions are governed by the platform's default robots.txt rules, by any X-Robots-Tag headers your site emits, and by the live behaviour of each bot.
The retrieval agents not on the list, as documented by their respective vendors: ChatGPT-User2, OAI-SearchBot2, Claude-User3, Claude-SearchBot3, Perplexity-User4, MistralAI-User7, Applebot5. By default, all seven are allowed on a Squarespace site because Squarespace's robots.txt does not disallow them. That is the configuration most AI-visibility playbooks recommend, and it is what a fresh Squarespace site already has.
For per-bot control beyond the Squarespace UI, the workarounds are documented in the robots-txt-custom leaf. For the recommended setting per bot, the training-vs-retrieval leaf covers the full matrix.