The nonsense maze as the only way out
I got a question yesterday and I thought others might benefit from the answer. And I’m afraid I’m going to be talking about AI again.
The question was “How do I stop AI companies from scraping all the content from my website?”
Short answer: you mostly can’t.
Long answer: let’s dig in…
The classic way to do this is adding a file called robots.txt to your website. It politely asks bots and spiders not to visit certain pages on your site. It’s been around since the very early days of the web and works on the honour system. Google’s search engine respects it, Bing does too, most well-behaved bots also.
AI crawlers? Not so much. The big ones (ChatGPT, Claude, Google) usually respect it. But there are tons of bots out there that don’t care: other AI operators, data brokers, random bots. Some of them scan sites so aggressively that they’ve been known to knock them offline.
They pretend to be humans, rotate IP addresses, identify as standard browsers and basically do everything they can to blend in with normal traffic… Your nice “No Trespassing” sign isn’t going to stop someone who’s already climbing over the wall.
There are some solutions that do mostly work, but they’re not easy to implement and quite heavy. Tools like Anubis sit in front of your site and detect bots. When they spot one, instead of just blocking it, they redirect it into an endless maze of auto-generated nonsense. The bot then wastes resources crawling all that junk and hopefully absorbing it into its training data.
Cloudflare also offers an AI-bot blocking service that analyses behaviour gathered from watching visits to the millions of sites that already sit behind their infrastructure. It won’t catch them all, but will probably filter out the worst.
But here’s the flip side of this issue: more and more people use AI chatbots as their primary search engine (that’s a whole other can of worms – don’t get me started). They type their questions into ChatGPT or Perplexity instead of Google. If these systems can’t access your content, they won’t reference it, summarise it, or point people to it.
If you depend on being found online, blocking AI bots might make you invisible to a whole segment of your potential audience.
So, you need to figure out what matters the most to you. Protecting your content from being eaten up without consent or staying visible to people who no longer use “traditional” search?
There’s no true, clean, answer here. Only trade-offs.
Colin