baldur
baldur

“Silence Isn’t Consent”

|
Embed
Progress spinner
In reply to
fgtech
fgtech

@baldur If you can build a web crawling bot then you can be expected to check for a robots.txt and respect its contents. Maybe it’s time for legal repercussions against those who ignore this universally-known opt-out mechanism.

|
Embed
Progress spinner
odd
odd

@fgtech That would be great!

|
Embed
Progress spinner
fgtech
fgtech

@odd At the very least it would be instructive to learn who objects. The “ignore robots.txt” lobby and its funders would be interesting to inspect and catalog out in the daylight.

|
Embed
Progress spinner
odd
odd

@fgtech Yes, definitely. I think there are very few good reasons why someone would slurp up all the images/content on a site that prefers you not to.

|
Embed
Progress spinner
baldur
baldur

@fgtech The problem is that having your site indexed by a search engine and having it pulled by a dozen outfits a day collecting training data for their ML models are qualitatively different things, but are both affected by robots.txt.

|
Embed
Progress spinner
fgtech
fgtech

@baldur Doesn’t the format of robots.txt cover that case? It is possible to specify different rules for different bots. If there are missing features it would be better to focus efforts on revising one mechanism than inventing whole new opt-out methods, wouldn’t it?

|
Embed
Progress spinner
baldur
baldur

@fgtech Does robots.txt let you block by use case? Otherwise you’d have to preemptively block a potentially infinite list of user agents

Also inclusion in an ML training data set should be opt-in, especially if the download utility in question wants to comply with the GDPR.

|
Embed
Progress spinner
fgtech
fgtech

@baldur The robots.txt format is admittedly clunky. You can disallow access to file paths by specific bots or disallow any bot. There is no “allow” format specified but it would seem like something that could be readily added.

|
Embed
Progress spinner
fgtech
fgtech

@baldur If you want to get into allowing specific uses rather than blocking specific crawlers, then I think we are into the realm of copyright. Creative Commons could be a model to consider. I like the way they came up with “modular” licenses so you could express your preferences with legal force.

|
Embed
Progress spinner