@baldur If you can build a web crawling bot then you can be expected to check for a robots.txt and respect its contents. Maybe it’s time for legal repercussions against those who ignore this universally-known opt-out mechanism.

|

Embed

In reply to

@baldur If you can build a web crawling bot then you can be expected to check for a robots.txt and respect its contents. Maybe it’s time for legal repercussions against those who ignore this universally-known opt-out mechanism.

2023-04-25 1:57 pm

|

Embed

odd

@fgtech That would be great!

2023-04-25 2:18 pm

|

Embed

fgtech

@odd At the very least it would be instructive to learn who objects. The “ignore robots.txt” lobby and its funders would be interesting to inspect and catalog out in the daylight.

2023-04-25 2:32 pm

|

Embed

odd

@fgtech Yes, definitely. I think there are very few good reasons why someone would slurp up all the images/content on a site that prefers you not to.

2023-04-25 2:46 pm

|

Embed

baldur

@fgtech The problem is that having your site indexed by a search engine and having it pulled by a dozen outfits a day collecting training data for their ML models are qualitatively different things, but are both affected by robots.txt.

2023-04-25 3:10 pm

|

Embed

fgtech

@baldur Doesn’t the format of robots.txt cover that case? It is possible to specify different rules for different bots. If there are missing features it would be better to focus efforts on revising one mechanism than inventing whole new opt-out methods, wouldn’t it?

2023-04-25 3:16 pm

|

Embed

baldur

@fgtech Does robots.txt let you block by use case? Otherwise you’d have to preemptively block a potentially infinite list of user agents

Also inclusion in an ML training data set should be opt-in, especially if the download utility in question wants to comply with the GDPR.

2023-04-25 3:24 pm

|

Embed

fgtech

@baldur The robots.txt format is admittedly clunky. You can disallow access to file paths by specific bots or disallow any bot. There is no “allow” format specified but it would seem like something that could be readily added.

2023-04-25 3:33 pm

|

Embed

fgtech

@baldur If you want to get into allowing specific uses rather than blocking specific crawlers, then I think we are into the realm of copyright. Creative Commons could be a model to consider. I like the way they came up with “modular” licenses so you could express your preferences with legal force.

2023-04-25 3:39 pm

|

Embed

Micro.blog

Micro.blog