Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Don’t block Common Crawl! They’re a charity and do great work producing an open dataset that everyone can use.


If you want to avoid having your site contents used to train AI and don't want to make your website unavailable to the open web, then blocking Common Crawl (among others) is absolutely mandatory.

We have very few tools to protect ourselves here, and need to make the most of the ones we have.


I want to publish it on the internet, but I want to keep control of it.

Honestly, it seems like we've been here before. I hope y'all have got your right-click-disable scripts in place too.


This is kind of a question of etiquette and legality more than it is a question of what is technically possible. People can and do ignore robots.txt. That doesn't mean copyright infringement is now legal.


Doesn't mean the chatbots are infringing copyright, either. Or even that copyright is respected worldwide.

It's going to be an interesting half-decade.


People steal cars and break into houses and both have their versions of locks and alarms and are equally as useless as a right-click-disable fix to a website to a skilled "attacker".


Why use a deliberately open standard then? Create an app like every other successful monopolistic megalomaniac?


Some people want to make their stuff freely available to people, but don't want it to be used to train AI. That's the group my comment was talking about. No monopolistic megalomaniacs here.

Personally, I think that such efforts are pointless -- all it takes is one crawler to make it through your defenses and you may as well have not done a thing. The alternative, though, is to remove your sites from the open web entirely. Either way, Commmon Crawl needs to be excluded if you want to avoid your stuff being used to train AI.


yeah - more cases than those alluded to ..

https://apnews.com/article/dungeons-dragons-ai-artificial-in...

a variety of stakeholders involved - not just "individuals doing pointless things" .. new tech is making winners and losers right now.


Them being a charity isn't an automatic license to use anything on the internet?


Including all of the other crawlers. Interesting dilemma.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: