How AI Chatbots Access Web Pages: A Deep Dive into Traffic Patterns

In the rapidly evolving landscape of AI, a notable shift is occurring in how chatbots access web pages. Developers and website operators are increasingly curious about whether these chatbots retrieve live pages or rely on pre-existing indexes when responding to user queries. To shed light on this issue, I set up an nginx probe and prompted several leading chatbots to fetch pages in real-time. This article analyzes the differences in AI traffic based on server logs.

Two Types of AI Traffic

AI traffic can generally be categorized into two distinct types. The first type involves models directly accessing web pages to read content, while the second type occurs when users are guided by the chatbot to read a page. Merging these two types into a single AI traffic metric can obscure valuable distinctions in the data. Each chatbot sends requests with unique query strings, allowing for easy differentiation in logs. To ensure that temporary cache hits do not obscure the search paths, I repeated queries across multiple sessions.

Approaches of Major Chatbots

Five prominent chatbots sent distinct signals in their user agents, indicating their methods of accessing web pages. Here are the details of each chatbot's approach:

- **ChatGPT-User** tends to send simultaneous requests from multiple IP addresses, often fetching several candidate pages at once. Recent logs from the past 24 hours show requests originating from five different Azure IP ranges, aligning with OpenAI’s official documentation.

- **Claude-User** consistently requests the /robots.txt file before attempting to fetch any page, operating within the IP range designated for Anthropic. This behavior is consistent with Anthropic's crawler documentation, which states that to block Claude, one must set User-agent: Claude-User to disallow.

- **Perplexity-User** directly fetches pages without sending an Accept header or referrer. While Perplexity can respond using its own index, it also has the capability to retrieve pages in real-time.

- **Gemini** sends requests through a standard browser after user clicks, but no requests are initiated from the provider side. This distinction is crucial when measuring AI traffic.

- **Microsoft Copilot** retrieves pages using a standard Chrome user agent, with no observed activity from Bingbot. Consequently, identifying Copilot's requests positively in logs proves challenging.

- **Grok** employs standard Mac Safari and Chrome user agents to fetch pages, lacking identifiable signals for xAI in its requests.

Through this analysis, it becomes evident that for Gemini, Copilot, and Grok, requests from the provider side are either absent or indistinguishable from regular user visits. This poses a significant risk of overlooking these three when measuring AI traffic in logs.

Meta's Approach

Meta appears to maintain its own index similarly to Google. Meta AI often returns information that does not exist on live pages, aligning with an index-based search path. When Meta does fetch pages live, it sends requests identified as meta-webindexer/1.1. According to Meta's web crawler documentation, Meta-ExternalFetcher serves as a search bot for user requests related to AI functionalities across Facebook, Messenger, Instagram, and WhatsApp.

Conclusion

This investigation clarifies the distinct methods AI chatbots use to access web pages, providing website operators with insights on effectively measuring AI traffic. Understanding the user agents of each chatbot becomes a critical factor in analyzing website traffic.

How AI Chatbots Access Web Pages: A Deep Dive into Traffic Patterns

Two Types of AI Traffic

Approaches of Major Chatbots

Meta's Approach

Conclusion

Related Articles