The open web is currently enduring a silent, resource-intensive siege. For decades, the social contract of the internet relied on a simple exchange: websites provided free information, and crawlers indexed that information to make it discoverable. But the rise of Large Language Models has fundamentally broken this equilibrium. Today, the hunger for training data has transformed the humble web crawler into a predatory scraper, leaving the administrators of open-knowledge repositories like the Wikimedia Foundation and game-specific hosting services such as Weird Gloop to fight a losing battle against invisible traffic that threatens to bankrupt their infrastructure.
The Architecture of AI Deception
The conflict has evolved far beyond the era of the polite bot. In the past, AI agents like GPTBot, ClaudeBot, and PerplexityBot operated with a degree of transparency, identifying themselves via the User Agent header. This digital name tag allowed site administrators to implement straightforward blocklists or rate limits. However, a new generation of scrapers has abandoned this transparency in favor of sophisticated mimicry. These bots now spoof their request headers to appear as legitimate users browsing on the latest version of Google Chrome, effectively blending into the crowd of human traffic.
This camouflage is amplified by the use of residential proxies. Unlike data center IP addresses, which are clustered in known ranges and easily blocked, residential proxies route traffic through actual home internet connections. By leveraging ISPs like Comcast, AT&T, and Charter, some scrapers can rotate through as many as 1 million unique IP addresses in a single day. To a server administrator, this does not look like a single bot attacking the site; it looks like a million different people from a million different homes visiting simultaneously. This makes traditional IP-based blocking virtually impossible, as banning a residential IP might inadvertently block a real human user.
Beyond IP rotation, scrapers are now utilizing trusted third-party services as shields to bypass security filters. A common tactic involves routing requests through the Google Translate URL tool, which makes the traffic appear to originate from Google's own servers. Analysis reveals that 99.99% of requests coming through this specific translation tool are malicious scrapers. Similar patterns have been observed with facebookexternalhit, the service Facebook uses for link previews. By hiding behind these reputable domains, scrapers can slip through security perimeters that would otherwise trigger an immediate block. The scale of this disruption is staggering, with an estimated 95% of server issues in the wiki ecosystem this year being attributed to these malicious scrapers.
The High Cost of Stupid Crawling
The technical devastation is not just a result of the volume of traffic, but the inefficiency of the crawling methods. Consider the OSRS Wiki (Old School RuneScape Wiki). While the site contains roughly 40,000 highly useful documents, the total number of reachable URLs exceeds 1 billion. A disciplined crawler would follow a sitemap to extract the core content. Instead, many AI scrapers employ a brute-force approach, visiting every single link they encounter regardless of its value. This results in bots scraping edit screens, temporary draft pages, and obsolete version histories—data that is entirely useless for LLM training but devastating for server health.
This inefficiency creates a massive disparity in computing costs. A standard page requested by a human user is typically served from a cache, resulting in a response time of less than 20 milliseconds. However, when a bot requests a diff page—which compares two different versions of a document—the request bypasses the cache entirely. The server must perform a real-time calculation to compare the texts, pushing processing times to 1 or 2 seconds. When these requests are combined with complex, junk query parameters, the cost to process a single bot request can be 50 to 100 times higher than that of a human request.
This creates a volatile environment characterized by unpredictable traffic spikes. While the monthly volume of bot requests may average around 250 million—roughly 100 requests per second—the actual experience is far more chaotic. Scrapers often launch bursts of over 1,000 requests per second, creating patterns that are indistinguishable from a Distributed Denial of Service (DDoS) attack. Even if bots only account for 50% of total CPU usage, they are responsible for 95% of actual service outages because they specifically target the most resource-intensive, uncacheable paths of the server architecture.
As technical defenses fail, some platforms have turned to the nuclear option: forced authentication. Fandom, a major wiki hosting platform, implemented a requirement for users to create accounts and log in to view certain pages to lock out the bots. The result was a catastrophic blow to the community's growth, with new contributor activity plummeting by approximately 40%. By building a wall to keep out the machines, the platform inadvertently blocked the very humans who keep the knowledge base alive.
Administrators are now attempting to move toward heuristic detection systems that analyze human-like request patterns, such as mouse movements or specific navigation rhythms. However, the financial barrier to entry is steep. Enterprise-grade bot detection tools often cost six figures annually, a price tag that is prohibitive for small, independent wikis. The industry is now looking toward standardized solutions, such as the crawling API provided by Cloudflare, which aims to funnel bot traffic into a controlled, manageable interface rather than allowing it to roam freely across the server's most vulnerable endpoints.



