manton
manton

I was wondering when someone would bring up that the Internet Archive ignores robots.txt, and today’s Stratechery is the first I’ve seen to raise that point. I view most debates through the lens of what is good for the open web. There’s now so much of an anti-AI undercurrent, we risk over-correcting.

|
Embed
Progress spinner
jmanes
jmanes

@manton The way I see it, what is done with the data is what is important. Archival of public data as history is fine, ripping off artists by copying their unique style and then selling that functionality to folks is another deal.

|
Embed
Progress spinner
samgrover
samgrover

@manton Intention matters. One is a library of the open web, while the other is exploitative. A crude analogy would be conservation vs mining.

|
Embed
Progress spinner
manton
manton

@samgrover I agree intention matters, but I don't think it's clear that one is exploitative, or at least not any more than Google crawling is also exploitative. Personally, I think any crawling including the Internet Archive should probably respect robots.txt.

|
Embed
Progress spinner
SimonWoods
SimonWoods

@manton @samgrover @jmanes The word consent has a definition. Many people on the open web — and tech in general — have clearly forgotten that.

|
Embed
Progress spinner
samgrover
samgrover

@manton Google crawling (at least when it started) was a decent deal and not exploitative, IMHO. They helped others find your site and helped you in finding theirs. I guess what I'm saying is that if your intention is to profit, give something in return, or ask permission. And in either scenario, be ok with people saying no. And yes, I'm lenient towards a non-profit and hold them to a different standard.

|
Embed
Progress spinner
manton
manton

@samgrover Yep. In a way, well-defined user agents and robots.txt is a convention for that "saying no" part.

|
Embed
Progress spinner
pratik
pratik

@samgrover @manton I get what you are saying, but the intention can differ, right? What one considers exploitative or beneficial can mean different to different people. Even if "consent" was considered a default, Google and other search engines wouldn't exist until every website they crawl had to first consent to be indexed. The web was built around opt-out rather than opt-in so yes, holding AI companies' feet to the fire in respecting the robots.txt and having well-defined user agents should be the way to go. Perhaps a short legislation that imposes big fines if they don't would act as a deterrence. Expecting corporations to act responsibly is always a losing proposition.

|
Embed
Progress spinner
manton
manton

@jmanes @SimonWoods There are a lot of questions around art in particular, which is why Adobe (for example) only trains their AI on art they already have a license to. It's possible that the rules for art and text will end up being different.

|
Embed
Progress spinner
davereed
davereed

@manton I'm only part way through the latest Core Intution, but I do see Viticci's (MacStories) point. He makes money from ads (and some direct support) so if AI scrapes it and gives answers that are basically the content of his site, he gets nothing. Search engines (at least used to) provide a link that you would click on so you ended up at the site resulting in ad revenue. Of course now search engines tend to show summaries or more details so you don't need to click.

|
Embed
Progress spinner
In reply to
manton
manton

@davereed Yeah, it's a problem for ads especially. Even without AI, Google referrals are apparently down across the web. I hope MacStories can navigate this successfully.

|
Embed
Progress spinner