A list of major AI crawler bots and user agents that collect website data.
Published
Jul 9, 2025
Author
Paul
Your website gets more than just human visitors these days. If you check your server logs, you'll see strange bot names crawling your pages. These aren't normal search bots—they're AI bots, and there are a lot of them.
Some collect content to train AI models. Others gather data to answer search questions in real time. Either way, they read your content - it's up to you to decide if it's a good thing or not.
OpenAI
ChatGPT-User
A user agent to browse websites and fetch information when a user asks for something that requires real-time web data in ChatGPT.
More info: https://platform.openai.com/docs/bots
OAI-SearchBot
A user agent to browse websites and retrieve real-time information when users select "live search" in ChatGPT to get up-to-date web content.
More info: https://platform.openai.com/docs/bots
GPT-bot
A crawler to browse websites to collect data which is then used to improve the training of its AI models.
More info: https://platform.openai.com/docs/bots
Anthropic
ClaudeBot
A crawler to browse public websites to gather content for training its AI language models.
Claude-User
A user agent to visit websites when Claude users ask questions that requires real-time information.
Claude-SearchBot
A crawler to browse the web to enhance the quality of search results for users. It's unclear how it is used against Claude-User
Amazon
AmazonBot
A crawler used by Amazon to crawl and index web content. The data it gathers enhances services like Alexa, improving search results and the accuracy of spoken responses.
More info: https://developer.amazon.com/amazonbot
Apple
Applebot
A crawler that indexes web content for features like Siri, Spotlight, and Safari search. It also collects data to help train Apple's generative AI models.
More info: https://support.apple.com/en-us/119829
TikTok
Bytespider
A crawler that collects web content for AI model training, including for Doubao, their ChatGPT-style assistant.
No official public documentation available.
Common Crawl
CCbot
A crawler that systematically archives the open web. Its massive dataset is publicly available and widely used for AI training, academic research, and data analysis.
More info: https://commoncrawl.org/ccbot
Perplexity AI
PerplexityBot
A crawler that indexes pages so Perplexity can surface and link to them in its answer citations. According to the company, it is not used to train foundation models.
More info: https://docs.perplexity.ai/guides/bots
Perplexity-User
A user agent that fetches individual pages on‑demand when a Perplexity user’s query requires direct access.
More info: https://docs.perplexity.ai/guides/bots
Meta
Meta-ExternalAgent
A crawler to harvest public web content to train Meta’s generative‑AI systems (e.g., Llama, Meta AI).
More info: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
Google-Extended
A crawler that controls whether Bard, Gemini, and other Google generative‑AI products may use your content.
More info: https://support.google.com/webmasters/answer/2723646#google-extended
How to manage these Bots in robots.txt
Allow all bots
Block a bot completely
Allow only specific folders
Remember that user‑initiated agents such as Claude-User
and Perplexity-User
may ignore robots.txt
; use rate limiting or IP blocking if needed.