manton
manton

MacStories has written an open letter asking for AI regulation:

…a wide swath of the tech industry, including behemoths like Apple, Google, Microsoft, and Meta, have joined OpenAI, Anthropic, and Perplexity, to ingest the intellectual property of publishers without their consent, and then used that property to build their own commercial products.

This letter is a great idea. We need regulation and an update to copyright law. I don’t like the repeated use of the word “theft”, though. It risks oversimplifying the gray areas in LLM training (and Google crawling).

|
Embed
Progress spinner
markstoneman
markstoneman

@manton Federici Viticci's initial protest got me thinking about those "gray areas" because copyright as currently understood is inadequate to the LLM challenge (https://markstoneman.com/2024/06/13/what-can-we.html). Still, copyright is a good place to start because the rights of property are something lawmakers tend to understand.

|
Embed
Progress spinner
fgtech
fgtech

@manton I know we differ on this point but I really have a hard time seeing the same expanse of gray here.

Ignoring their “snippets,” for the moment, Google’s index could not be used to reconstruct any web pages. It exists to help people find a page.

An AI model, in contrast, captures and retains a version of everything it has processed. The very purpose of the AI model is to provide replacement text, exclusively for its paid users, generated from its version of what was ingested without compensation to those sources. This is very close to theft.

|
Embed
Progress spinner
In reply to
renevanbelzen
renevanbelzen

@manton If "theft" is used inappropriately, do you know a better word, or does one has to be invented? Years ago we had a similar case of copyright infringement when old-style news outlets "stole" videos by reuploading videos from original creators, and slapping their branding on it, without credit. The word "freebooting" was introduced by CPG Grey and Brady Haran, who this happened to.

|
Embed
Progress spinner
manton
manton

@fgtech I guess how we define "replacement text" matters. I don't see most generative AI as an actual replacement, in the same way that Cliff Notes books are not a replacement for original novels and are allowed by fair use. The complication (and gray area) is that some AI does go too far, and some seems fine, so it's hard to make a blanket statement that it's all inherently illegal.

|
Embed
Progress spinner
manton
manton

@renevanbelzen Huh, I didn't know about freebooting. Thanks for the insight. I don't think a single word is possible… There is just too much variety in how AI works and what seems like fair use and what is a rip-off. I think that's my objection to "theft": it attempts to simplify something too far.

|
Embed
Progress spinner
fgtech
fgtech

@manton Well said. I also do not want to forego a future that has useful AI technology.

What is being built and deployed right now is completely the wrong path to that future and needs to be stopped in its tracks so it can be restarted. I think that is the crux of our different viewpoints. Based upon what I know of this technology, there is no patch, bandaid or tweak that will bend OpenAI’s approach in a productive direction.

|
Embed
Progress spinner
manton
manton

@fgtech Yeah, I'm more optimistic that there is a way forward with the current technology. Part of the issue is trust, and OpenAI has lost that trust with many people. Not easy to regain it.

|
Embed
Progress spinner
fgtech
fgtech

@manton You are making me realize that my primary objection is indeed not the technology itself but how it is being wielded by OpenAI. And others, of course, but everyone is looking to OpenAI as a leader and they are completely not trustworthy. We need clear boundaries set but cannot expect them to participate in that process in good faith given their past behavior.

|
Embed
Progress spinner
fgtech
fgtech

@manton But this should not become too much about OpenAI. What I mean by “replacement text” is anything presented in answer to a query that obfuscates the source, making it difficult to link back. The design of the deep neural networks that power LLMs make this very hard. But Google’s practice of providing an “answer” to a query in place of a link to its source also runs afoul of this principle.

|
Embed
Progress spinner
manton
manton

@fgtech I agree with that. I hope crediting sources can be solved in LLMs. All the new licensing deals between OpenAI and WSJ, Time, Vox, etc. surely will push them in that direction.

|
Embed
Progress spinner
fgtech
fgtech

@manton The deep neural networks used for LLMs have no way to link back to source material. The process deliberately “fuzzes” the inputs to help generalize what can be trained from them. Developing a modification to the algorithms that made linking back possible would be a real contribution to the field, but nobody has achieved that. Fundamental breakthroughs would be needed and I don’t see it happening within 10 years.

|
Embed
Progress spinner
fgtech
fgtech

@manton The people at Time, VOX, etc. are being hoodwinked, probably. It’s a bad scene.

|
Embed
Progress spinner
pratik
pratik

@fgtech @manton

Developing a modification to the algorithms that made linking back possible would be a real contribution to the field, but nobody has achieved that.

So if this problem is addressed, are we fine with AI and AGI?

Fundamental breakthroughs would be needed and I don’t see it happening within 10 years.

Maybe. But let's look back at how much we (the casual observers) knew about the capabilities of AI and AGI even in 2021.

|
Embed
Progress spinner
fgtech
fgtech

@pratik

let's look back at how much we (the casual observers) knew about the capabilities of AI and AGI even in 2021

I have done work in the field. This paper used natural language processing techniques common in 2007. And this paper from 2017 makes use of deep neural networks for analyzing biological data.

Despite all of the hype, not much has changed in the methods since 2017. The main thing has been scaling up the size of the models (the first L in LLMs) and fine tuning them. I am not knocking the hard work of people at OpenAI, but they got where they are mainly by taking very large amounts of data without asking for permission. @manton

|
Embed
Progress spinner
pratik
pratik

@fgtech Got it. Sorry for doubting your credentials. And yes, about the following, I agree:

They got where they are mainly by taking very large amounts of data without asking for permission.

I'm not sure how to put the genie back in the bottle. As I had mentioned earlier, it's like saying America got rich on the backs of slaves. It's true, but I don't think America is ever going to give that back and have a do-over.

|
Embed
Progress spinner
fgtech
fgtech

@pratik When a violation occurs shouldn’t we seek justice instead of throwing up our hands and saying “oh, well”?

Also: are you sure you want to equate taking data from people on the web with the history of slavery and the need for reparations? These are very different things and trying to draw a comparison feels like standing on the shoulders of straw men.

Even if I take your point, shouldn’t we have learned how to do better by now? A new wrong would not be justified by an old wrong, even if they were comparable.

|
Embed
Progress spinner
fgtech
fgtech

@pratik Oh, about the credentials: I never gave you any reason to think I had expertise so no need to apologize. That work was long ago, but I have remained very interested in machine learning as a tool and keep tabs on what’s going on. It dismays me how poorly the hype reflects reality.

|
Embed
Progress spinner
pratik
pratik

@fgtech I was mostly making an analogy for theft of labor. The scales are, of course, vastly different. Some countries are still waiting for their antiquities to be returned from the British in case we want to talk about literal theft of art.

But my point remains, now what? Should AI be banned? Should all models be wiped, and should we start over? The answer to that is maybe yes.

|
Embed
Progress spinner
fgtech
fgtech

@pratik I think it would be fair for anyone who had their work included in the training model to have a say in what they want to have done. Let’s see the list of sources that OpenAI used as a starting point.

|
Embed
Progress spinner
pratik
pratik

@fgtech That's fair.

|
Embed
Progress spinner
fgtech
fgtech

@pratik If the potential is really as big of a benefit to the public as people claim, then let's make the model-building a community effort on terms people agree with. People who want their work included might even help with curation to improve the quality of the result, and that might also help filter out some of the more nasty influences that clearly crept in because of the careless way the data were scraped together.

|
Embed
Progress spinner