Overview of AI Crawlers

A reference guide to the main AI crawlers: who they are, what they do, and how to identify them.

2025-02-15

AI companies run dedicated web crawlers to collect content for training, real-time retrieval, and agent use. As a site owner, knowing which crawlers exist and what they do is the first step to controlling how your content is used.

Major AI crawlers

GPTBot (OpenAI)

  • User-agent: GPTBot
  • Purpose: Training data collection and real-time retrieval for ChatGPT and OpenAI APIs.
  • Documentation: platform.openai.com/docs/gptbot
  • IP ranges: Published by OpenAI and verifiable via reverse DNS.

OpenAI also runs ChatGPT-User for browsing plugin requests (user-initiated, not background crawling).

ClaudeBot (Anthropic)

  • User-agent: Claude-Web (older) / ClaudeBot
  • Purpose: Training and improving Claude models.
  • Documentation: anthropic.com/robots
  • Behavior: Generally respectful of robots.txt.

PerplexityBot

  • User-agent: PerplexityBot
  • Purpose: Real-time web search for Perplexity AI answers.
  • Documentation: Check Perplexity's current docs for the latest user-agent string.
  • Behavior: Active crawler for live answer synthesis.

Google-Extended

  • User-agent: Google-Extended
  • Purpose: Training data for Google's AI products (Gemini, Bard).
  • Note: Separate from Googlebot (standard search). Blocking Google-Extended does not affect Google Search ranking.
  • Documentation: Google Search Central

Applebot-Extended

  • User-agent: Applebot-Extended
  • Purpose: Training Apple Intelligence models.
  • Note: Introduced in 2024. Blocking it does not affect Spotlight or Safari search.

Common Crawl

  • User-agent: CCBot
  • Purpose: Open dataset used by many AI companies for training (including early OpenAI and EleutherAI models).
  • Documentation: commoncrawl.org
  • Note: Does not respect robots.txt in all configurations. Very high crawl volume.

Bytespider (ByteDance / TikTok)

  • User-agent: Bytespider
  • Purpose: Training data for ByteDance AI products.
  • Note: Known for aggressive crawling; often blocked by site owners.

Summary table

Crawler Company User-agent Respects robots.txt
GPTBot OpenAI GPTBot Yes
ClaudeBot Anthropic ClaudeBot Yes
PerplexityBot Perplexity PerplexityBot Yes
Google-Extended Google Google-Extended Yes
Applebot-Extended Apple Applebot-Extended Yes
CCBot Common Crawl CCBot Partially
Bytespider ByteDance Bytespider Partially

How to identify crawlers in your logs

Look for these strings in your server access logs:

grep -i "gptbot\|claudebot\|perplexitybot\|google-extended\|ccbot\|bytespider" access.log

In Google Analytics 4, you can create a segment filtering by the user-agent dimension (if you collect it via a custom dimension or server-side tagging).

What to do next