Top AI search crawlers & user-agents

A list of major AI crawler bots and user agents that collect website data.

Paul

Paul · Co-founder

Top AI search crawlers & user-agents

Your website gets more than just human visitors these days. If you check your server logs, you'll see strange bot names crawling your pages. These aren't normal search bots—they're AI bots, and there are a lot of them.

Some collect content to train AI models. Others gather data to answer search questions in real time. Either way, they read your content - it's up to you to decide if it's a good thing or not.

OpenAI

ChatGPT-User

A user agent to browse websites and fetch information when a user asks for something that requires real-time web data in ChatGPT.

More info: https://platform.openai.com/docs/bots

OAI-SearchBot

A user agent to browse websites and retrieve real-time information when users select "live search" in ChatGPT to get up-to-date web content.

More info: https://platform.openai.com/docs/bots

GPT-bot

A crawler to browse websites to collect data which is then used to improve the training of its AI models.

More info: https://platform.openai.com/docs/bots

Operator

An AI agent developed by OpenAI that could autonomously perform tasks through web browser interactions. It was launched in January 2025 and deprecated in August 2025, with its capabilities integrated into ChatGPT agent mode.

More info: https://openai.com/index/introducing-operator/

Anthropic

ClaudeBot

A crawler to browse public websites to gather content for training its AI language models.

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Claude-User

A user agent to visit websites when Claude users ask questions that requires real-time information.

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Claude-SearchBot

A crawler to browse the web to enhance the quality of search results for users. It's unclear how it is used against Claude-User

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

anthropic-ai

An AI agent possibly used by Anthropic to download training data for its large language models that power AI products like Claude. The exact purpose and scope of this agent remain undocumented.

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Claude-Web

A web crawler used by Anthropic to gather web content for Claude-related services. This agent enables Claude to reference and discuss web content in conversations.

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Amazon

AmazonBot

A crawler used by Amazon to crawl and index web content. The data it gathers enhances services like Alexa, improving search results and the accuracy of spoken responses.

More info: https://developer.amazon.com/amazonbot

Apple

Applebot

A crawler that indexes web content for features like Siri, Spotlight, and Safari search. It also collects data to help train Apple's generative AI models.

More info: https://support.apple.com/en-us/119829

Applebot-Extended

A crawler specifically used to identify and collect web content for training Apple's generative AI models, including Apple Intelligence. This is separate from regular Applebot and can be blocked independently.

More info: https://support.apple.com/en-us/119829

TikTok

Bytespider

A crawler that collects web content for AI model training, including for Doubao, their ChatGPT-style assistant.

No official public documentation available.

Common Crawl

CCbot

A crawler that systematically archives the open web. Its massive dataset is publicly available and widely used for AI training, academic research, and data analysis.

More info: https://commoncrawl.org/ccbot

Perplexity AI

PerplexityBot

A crawler that indexes pages so Perplexity can surface and link to them in its answer citations. According to the company, it is not used to train foundation models.

More info: https://docs.perplexity.ai/guides/bots

Perplexity-User

A user agent that fetches individual pages on‑demand when a Perplexity user’s query requires direct access.

More info: https://docs.perplexity.ai/guides/bots

Meta

Meta-ExternalAgent

A crawler to harvest public web content to train Meta's generative‑AI systems (e.g., Llama, Meta AI).

More info: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

Meta-ExternalFetcher

An AI assistant that performs user-initiated fetches of individual links from Meta AI assistant product functions. It makes targeted, on-demand requests to retrieve current information that supplements training data.

More info: https://developers.facebook.com/docs/sharing/webmasters/crawler

FacebookBot

A crawler used by Facebook to index web content and fetch metadata such as Open Graph tags for generating rich link previews. It also collects training data for Meta's AI models.

More info: https://developers.facebook.com/docs/sharing/webmasters/crawler

Google

Google-Extended

A crawler that controls whether Bard, Gemini, and other Google generative‑AI products may use your content.

More info: https://support.google.com/webmasters/answer/2723646#google-extended

Cohere

cohere-ai

An AI agent dispatched by Cohere's AI chat products in response to user prompts when it needs to retrieve content from the internet.

More info: https://darkvisitors.com/agents/cohere-ai

cohere-training-data-crawler

A crawler operated by Cohere to download training data for its large language models that power enterprise AI products.

More info: https://darkvisitors.com/agents/cohere-training-data-crawler

You.com

YouBot

A crawler used by You.com to index search results that allow their AI Assistant to answer user questions. The assistant's answers typically contain references to the website as inline sources.

More info: https://darkvisitors.com/agents/youbot

DuckDuckGo

DuckAssistBot

An AI assistant that crawls pages in real-time for DuckDuckGo's AI-assisted answers, which prominently cite their sources. This data is not used to train AI models.

More info: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot

How to manage these Bots in robots.txt

  • Allow all bots

    User-agent: *
    Disallow:
  • Block a bot completely

    User-agent: <bot>
    Disallow: /
  • Allow only specific folders

    User-agent: <bot>
    Allow: /public/
    Disallow: /

Remember that user‑initiated agents such as Claude-User and Perplexity-User may ignore robots.txt; use rate limiting or IP blocking if needed.

Published Jun 25, 2025

Updated Feb 17, 2026