Top AI search crawlers + user agents

Top AI search crawlers + user agents

A list of major AI crawler bots and user agents that collect website data.

Published

Jul 9, 2025

Author

Paul

Your website gets more than just human visitors these days. If you check your server logs, you'll see strange bot names crawling your pages. These aren't normal search bots—they're AI bots, and there are a lot of them.

Some collect content to train AI models. Others gather data to answer search questions in real time. Either way, they read your content - it's up to you to decide if it's a good thing or not.

OpenAI

ChatGPT-User

A user agent to browse websites and fetch information when a user asks for something that requires real-time web data in ChatGPT.

More info: https://platform.openai.com/docs/bots

OAI-SearchBot

A user agent to browse websites and retrieve real-time information when users select "live search" in ChatGPT to get up-to-date web content.

More info: https://platform.openai.com/docs/bots

GPT-bot

A crawler to browse websites to collect data which is then used to improve the training of its AI models.

More info: https://platform.openai.com/docs/bots

Anthropic

ClaudeBot

A crawler to browse public websites to gather content for training its AI language models.

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Claude-User

A user agent to visit websites when Claude users ask questions that requires real-time information.

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Claude-SearchBot

A crawler to browse the web to enhance the quality of search results for users. It's unclear how it is used against Claude-User

More info: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Amazon

AmazonBot

A crawler used by Amazon to crawl and index web content. The data it gathers enhances services like Alexa, improving search results and the accuracy of spoken responses.

More info: https://developer.amazon.com/amazonbot

Apple

Applebot

A crawler that indexes web content for features like Siri, Spotlight, and Safari search. It also collects data to help train Apple's generative AI models.

More info: https://support.apple.com/en-us/119829

TikTok

Bytespider

A crawler that collects web content for AI model training, including for Doubao, their ChatGPT-style assistant.

No official public documentation available.

Common Crawl

CCbot

A crawler that systematically archives the open web. Its massive dataset is publicly available and widely used for AI training, academic research, and data analysis.

More info: https://commoncrawl.org/ccbot

Perplexity AI

PerplexityBot

A crawler that indexes pages so Perplexity can surface and link to them in its answer citations. According to the company, it is not used to train foundation models.

More info: https://docs.perplexity.ai/guides/bots

Perplexity-User

A user agent that fetches individual pages on‑demand when a Perplexity user’s query requires direct access.

More info: https://docs.perplexity.ai/guides/bots

Meta

Meta-ExternalAgent

A crawler to harvest public web content to train Meta’s generative‑AI systems (e.g., Llama, Meta AI).

More info: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

Google

Google-Extended

A crawler that controls whether Bard, Gemini, and other Google generative‑AI products may use your content.

More info: https://support.google.com/webmasters/answer/2723646#google-extended

How to manage these Bots in robots.txt

  • Allow all bots

    
    
  • Block a bot completely

    User-agent: <bot>
    
    
  • Allow only specific folders

    User-agent: <bot>
    
    

Remember that user‑initiated agents such as Claude-User and Perplexity-User may ignore robots.txt; use rate limiting or IP blocking if needed.