Don’t block Common Crawl! They’re a charity and do great work producing an open ...

JohnFen · on Dec 28, 2023

If you want to avoid having your site contents used to train AI and don't want to make your website unavailable to the open web, then blocking Common Crawl (among others) is absolutely mandatory.

We have very few tools to protect ourselves here, and need to make the most of the ones we have.

flir · on Dec 28, 2023

I want to publish it on the internet, but I want to keep control of it.

Honestly, it seems like we've been here before. I hope y'all have got your right-click-disable scripts in place too.

__loam · on Dec 28, 2023

This is kind of a question of etiquette and legality more than it is a question of what is technically possible. People can and do ignore robots.txt. That doesn't mean copyright infringement is now legal.

flir · on Dec 28, 2023

Doesn't mean the chatbots are infringing copyright, either. Or even that copyright is respected worldwide.

It's going to be an interesting half-decade.

dylan604 · on Dec 28, 2023

People steal cars and break into houses and both have their versions of locks and alarms and are equally as useless as a right-click-disable fix to a website to a skilled "attacker".

dzhiurgis · on Dec 28, 2023

Why use a deliberately open standard then? Create an app like every other successful monopolistic megalomaniac?

JohnFen · on Dec 28, 2023

Some people want to make their stuff freely available to people, but don't want it to be used to train AI. That's the group my comment was talking about. No monopolistic megalomaniacs here.

Personally, I think that such efforts are pointless -- all it takes is one crawler to make it through your defenses and you may as well have not done a thing. The alternative, though, is to remove your sites from the open web entirely. Either way, Commmon Crawl needs to be excluded if you want to avoid your stuff being used to train AI.

mistrial9 · on Dec 28, 2023

yeah - more cases than those alluded to ..

https://apnews.com/article/dungeons-dragons-ai-artificial-in...

a variety of stakeholders involved - not just "individuals doing pointless things" .. new tech is making winners and losers right now.

yellowpencil · on Dec 28, 2023

Them being a charity isn't an automatic license to use anything on the internet?

alphabettsy · on Dec 28, 2023

Including all of the other crawlers. Interesting dilemma.