Report: Potential NYT lawsuit could force OpenAI to wipe ChatGPT and start over : technology

[–] SatanicNotMessianic@lemmy.ml 58 points 1 year ago (4 children)

While that’s understandable, I think it’s important to recognize that this is something where we’re going to have to tree pretty carefully.

If a human wants to become a writer, we tell them to read. If you want to write science fiction, you should both study the craft of writing ranging from plots and storylines to character development to Stephen a king’s advice on avoiding adverbs. You also have to read science fiction so you know what has been done, how the genre handles storytelling, what is allowed versus shunned, and how the genre evolved and where it’s going. The point is not to write exactly like Heinlein (god forbid), but to throw Heinlein into the mix with other classic and contemporary authors.

Likewise, if you want to study fine art, you do so by studying other artists. You learn about composition, perspective, and color by studying works of other artists. You study art history, broken down geographically and by period. You study DaVinci’s subtle use of shading and Mondrian’s bold colors and geometry. Art students will sit in museums for hours reproducing paintings or working from photographs.

Generative AI is similar. Being software (and at a fairly early stage at that), it’s both more naive and in some ways more powerful than human artists. Once trained, it can crank out a hundred paintings or short stories per hour, but some of the people will have 14 fingers and the stories might be formulaic and dull. AI art is always better when glanced at on your phone than when looked at in detail on a big screen.

In both the cases of human learners and generative AI, a neural network(-like) structure is being conditioned to associate weights between concepts, whether it’s how to paint a picture or how to create one by using 1000 words.

A friend of mine who was an attorney used to say “bad facts make bad law.” It means that misinterpretation, over-generalization, politicization, and a sense of urgency can make for both bad legislation and bad court decisions. That’s especially true when the legislators and courts aren’t well educated in the subjects they’re asked to judge.

In a sense, it’s a new technology that we don’t fully understand - and by “we” Im including the researchers. It’s theoretically and in some ways mechanically grounded in old technology that we also don’t understand - biological neural networks and complex adaptive systems.

We wouldn’t object to a journalism student reading articles online to learn how to write like a reporter, and we rightfully feel anger over the situation of someone like Aaron Swartz. As a scientist, I want my papers read by as many people as possible. I’ve paid thousands of dollars per paper to make sure they’re freely available and not stuck behind a paywall. On the other hand, I was paid while writing those papers. I am not paid for the paper, but writing the paper was part of my job.

I realize that is a case of the copyright holder (me) opening up my work to whoever wants a copy. On the other other hand, we would find it strange if an author forbade their work being read by someone who wants to learn from it, even if they want to learn how to write. We live in a time where technology makes things like DRM possible, which attempts to make it difficult or impossible to create a copy of that work. We live in societies that will send people to prison for copying literal bits of information without a license to do so. You can play a game, and you can make a similar game. You can play a thousand games, and make one that blends different elements of all of them. But if you violate IP, you can be sued.

I think that’s what it comes down to. We need to figure out what constitutes intellectual property and what rights go with it. What constitutes cultural property, and what rights do people have to works made available for reading or viewing? It’s easy to say that a company shouldn’t be able to hack open a paywall to get at WSJ content, but does that also go for people posting open access to Medium?

I don’t have the answers, and I do want people treated fairly. I recognize the tremendous potential for abuse of LLMs in generating viral propaganda, and I recognize that in another generation they may start making a real impact on the economy in terms of dislocating people. I’m not against legislation. I don’t expect the industry to regulate itself, because that’s not how the world works. I’d just like for it to be done deliberately and realistically and with the understanding that we’re not going to get it right and will have to keep tuning the laws as the technology and our understanding continue to evolve.

[–] chaircat@lemdro.id 7 points 1 year ago

This is an astonishingly well written, nuanced, and level headed response. Really on a level I'm not used to seeing on this platform.

[–] PlushySD@lemmy.world 2 points 1 year ago

Well written sir.

load more comments (2 replies)

[–] Swervish@lemmy.ml 41 points 1 year ago* (last edited 1 year ago) (5 children)

Not trying to argue or troll, but I really don't get this take, maybe I'm just naive though.

Like yea, fuck Big Data, but...

Humans do this naturally, we consume data, we copy data, sometimes for profit. When a program does it, people freak out?

edit well fuck me for taking 10 minutes to write my comment, seems this was already said and covered as I was typing mine lol

[–] QHC@lemmy.world 17 points 1 year ago (1 children)

It's just a natural extension of the concept that entities have some kind of ownership of their creation and thus some say over how it's used. We already do this for humans and human-based organizations, so why would a program not need to follow the same rules?

[–] FaceDeer@kbin.social 18 points 1 year ago

Because we don't already do this. In fact, the raw knowledge contained in a copyrighted work is explicitly not copyrighted and can be done with as people please. Only the specific expression of that knowledge can be copyrighted.

An AI model doesn't contain the copyrighted works that went into training it. It only contains the concepts that were learned from it.

load more comments (4 replies)

[–] lily33@lemm.ee 32 points 1 year ago* (last edited 1 year ago) (10 children)

No.

A pen manufacturer should not be able to decide what people can and can't write with their pens.
A computer manufacturer should not be able to limit how people use their computers (I know they do - especially on phones and consoles - and seem to want to do this to PCs too now - but they shouldn't).
In that exact same vein, writers should not be able to tell people what they can use the books they purchased for.

.

We 100% need to ensure that automation and AI benefits everyone, not a few select companies. But copyright is totally the wrong mechanism for that.

[–] BURN@lemmy.world 29 points 1 year ago (1 children)

A pen is not a creative work. A creative work is much different than something that’s mass produced.

Nobody is limiting how people can use their pc. This would be regulations targeted at commercial use and monetization.

Writers can already do that. Commercial licensing is a thing.

[–] lily33@lemm.ee 11 points 1 year ago (4 children)

Nobody is limiting how people can use their pc. This would be regulations targeted at commercial use and monetization.

... Google's proposed Web Integrity API seems like a move in that direction to me.

But that's besides the point, I was trying to establish the principle that people who make things shouldn't be able to impose limitations on how these things are used later on.

A pen is not a creative work. A creative work is much different than something that’s mass produced.

Why should that difference matter, in particular when it comes to the principle I mentioned?

[–] Rottcodd@kbin.social 8 points 1 year ago

Why should that difference matter, in particular when it comes to the principle I mentioned?

Because creative works are rather obviously fundamentally different from physical objects, in spite of a number of shared qualities.

Like physical objects, they can be distinguished one from another - the text of Moby Dick is notably different from the text of Waiting for Godot, for instance

More to the point, like physical objects, they're products of applied labor - the text of Moby Dick exists only because Herman Melville labored to bring it into existence.

However, they're notably different from physical objects insofar as they're quite simply NOT physical objects. The text of Moby Dick - the thing that Melville labored to create - really exists only conceptually. It's of course presented in a physical form - generally as a printed book - but that physical form is not really the thing under consideration, and more importantly, the thing to which copyright law applies (or in the case of Moby Dick, used to apply). The thing under consideration is more fundamental than that - the original composition.

And, bluntly, that distinction matters and has to be stipulated because selectively ignoring it in order to equivocate on the concept of rightful property is central to the NoIP position, as illustrated by your inaccurate comparison to a pen.

Nobody is trying to control the use of pens (or computers, as they were being compared to). The dispute is over the use of original compositions - compositions that are at least arguably, and certainly under the law, somebody else's property.

[–] walrusintraining@lemmy.world 7 points 1 year ago (3 children)

It’s not like AI is using works to create something new. Chatgpt is similar to if someone were to buy 10 copies of different books, put them into 1 book as a collection of stories, then mass produce and sell the “new” book. It’s the same thing but much more convoluted.

[–] PupBiru@kbin.social 3 points 1 year ago

it’s not even close to that black and white… i’d say it’s a much more grey area:

possibly that you buy a bunch of books by the same author and emulate their style… that’s perfectly acceptable until you start using their characters

if you wrote a research paper about the linguistic and statistical information that makes an authors style, that also wouldn’t be a problem

so there’s something beyond just the authors “style” that they think is being infringed. we need to sort out exactly where the line is. what’s the extension to these 2 ideas that makes training an LLM a problem?

[–] lily33@lemm.ee 2 points 1 year ago (1 children)

Except it's not a collection of stories, it's an amalgamation - and at a very granular level at that. For instance, take the beginning of a sentence from the middle of first book, then switch to a sentence in the 3-rd, then finish with another part of the original sentence. Change some words here and there, add one for good measure (based on some sentence in the 7-th book). Then fix the grammar. All the while, keeping track that there's some continuity between the sentences you're stringing together.

That counts as "new" for me. And a lot of stuff humans do isn't more original.

[–] legion02@lemmy.world 4 points 1 year ago

The maybe bigger argument against free-reign training is that you're attributing personal rights to a language model. Also even people aren't completely free to derive things from memory (legally) which is why clean-room-design is a thing.

[–] Veraxus@kbin.social 1 points 1 year ago* (last edited 1 year ago)

Chatgpt is similar to if someone were to buy 10 copies of different books, put them into 1 book as a collection of stories, then mass produce and sell the “new” book

That is not even close to correct. LLMs are little more than massively complex webs of statistics. Here’s a basic primer:

https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/

[–] BURN@lemmy.world 3 points 1 year ago* (last edited 1 year ago) (1 children)

Google web integrity is very much different than what I’m proposing. “Nobody” was more in relation to regulating this.

I hold the opposite opinion in that creatives (I’d almost say individuals only, no companies) own all rights to their work and can impose any limitations they’d like on (edit: commercial) use. Current copyright law doesn’t extend quite that far though.

A creative work is not a reproduceable quantifiable product. No two are exactly alike until they’re mass produced.

Your analogy works more with a person rather than a pen, in that why is it ok when a person reads something and uses it as inspiration and not a computer? This comes back around to my argument about transformative works. An AI cannot add anything new, only guess based on historical knowledge. One of the best traits of the human race is our ability to be creative and bring completely new ideas.

Edit: added in a commercial use specifier after it was pointed out that the rules over individuals would be too restrictive.

load more comments (1 replies)

[–] yokonzo@lemmy.world 2 points 1 year ago

I can see your argument it’s just your metaphor wasn’t very strong and I think it just made things a bit confusing

[–] fkn@lemmy.world 10 points 1 year ago (4 children)

You made two arguments for why they shouldn't be able to train on the work for free and then said that they can with the third?

Did openai pay for the material? If not, then it's illegal.

Additionally, copywrite and trademarks and patents are about reproduction, not use.

If you bought a pen that was patented, then made a copy of the pen and sold it as yours, that's illegal. This is the analogy of what openai is going with books.

Plagiarism and reproduction of text is the part that is illegal. If you take the "ai" part out, what openai is doing is blatantly illegal.

[–] lily33@lemm.ee 4 points 1 year ago* (last edited 1 year ago) (4 children)

Just now, I tried to get Llama-2 (I'm not using OpenAI's stuff cause they're not open) to reproduce the first few paragraphs of Harry Potter and the philosophers' stone, and it didn't work at all. It created something vaguely resembling it, but with lots of made-up stuff that doesn't make much sense. I certainly can't use it to read the book or pirate it.

[–] fkn@lemmy.world 1 points 1 year ago* (last edited 1 year ago) (2 children)

Openai:

I'm sorry, but I can't provide verbatim excerpts from copyrighted texts. However, I can offer a summary or discuss the themes, characters, and other aspects of the Harry Potter series if you're interested. Just let me know how you'd like to proceed!

That doesn't mean the copyrighted material isn't in there. It also doesn't mean that the unrestricted model can't.

Edit: I didn't get it to tell me that it does have the verbatim text in its data.

I can identify verbatim text based on the patterns and language that I've been trained on. Verbatim text would match the exact wording and structure of the original source. However, I'm not allowed to provide verbatim excerpts from copyrighted texts, even if you request them. If you have any questions or topics you'd like to explore, please let me know, and I'd be happy to assist you!

Here we go, I can get chat gpt to give me sentence by sentence:

"Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much."

[–] BURN@lemmy.world 5 points 1 year ago

Most publically available/hosted (self hosted models are an exception to this) have an absolute laundry list of extra parameters and checks that are done on every query to limit the model as much as possible to tailor the outputs.

[–] fkn@lemmy.world 1 points 1 year ago

This wasn't even hard... I got it spitting out random verbatim bits of Harry Potter. It won't do the whole thing, and some of it is garbage, but this is pretty clear copyright violations.

load more comments (3 replies)

[–] QHC@lemmy.world 6 points 1 year ago

Computer manufacturers aren't making AI software. If someone uses an HP copier to make illegal copies of a book and then distributes those pages to other people for free, the person that used the copier is breaking the law, not the company that made the copier.

[–] DarkWasp@lemmy.world 6 points 1 year ago* (last edited 1 year ago) (1 children)

All of the examples you listed have nothing to do with how OpenAI was created and set up. It was trained on copyrighted work, how is that remotely comparable to purchasing a pen?

[–] Moobythegoldensock@lemm.ee 7 points 1 year ago (1 children)

Would a more apt comparison be a band posting royalties to all of their influences?

[–] PupBiru@kbin.social 2 points 1 year ago

i think that’s a pretty good analogy that i haven’t heard before!

load more comments (6 replies)

[–] Falmarri@lemmy.world 11 points 1 year ago (2 children)

What's the basis for this? Why can a human read a thing and base their knowledge on it, but not a machine?

[–] BURN@lemmy.world 14 points 1 year ago (2 children)

Because a human understands and transforms the work. The machine runs statistical analysis and regurgitates a mix of what it was given. There’s no understanding or transformation, it’s just what is statistically the 3rd most correct word that comes next. Humans add to the work, LLMs don’t.

Machines do not learn. LLMs do not “know” anything. They make guesses based on their inputs. The reason they appear to be so right is the scale of data they’re trained on.

This is going to become a crazy copyright battle that will likely lead to the entirety of copyright law being rewritten.

[–] fkn@lemmy.world 3 points 1 year ago (1 children)

I don't know if I agree with everything you wrote but I think the argument about llms basically transforming the text is important.

Converting written text into numbers doesn't fundamentally change the text. It's still the authors original work, just translated into a vector format. Reproduction of that vector format is still reproduction without citation.

[–] ayaya@lemdro.id 4 points 1 year ago* (last edited 1 year ago) (5 children)

But it's not just converting them into a different format. It's not even storing that information at all. It can't actually reproduce anything from the dataset unless it is really small or completely overfitted, neither of which apply to GPT with how massive it is.

Each neuron, which represents a word or a phrase, is a set of weights. One source makes a neuron go up by 0.000001% and then another source makes it go down by 0.000001%. And then you repeat that millions and millions of times. The model has absolutely zero knowledge of any specific source in its training data, it only knows how often different words and phrases occur next to each other. Or for images it only knows that certain pixels are weighted to be certain colors. Etc.

load more comments (5 replies)

[–] CarbonIceDragon@pawb.social 2 points 1 year ago

At some level, isn't what a human brain does also effectively some form of very very complicated mathematical algorithm, just based not on computer modeling but on the behavior of the physical systems (the neurons in the brain interacting in various ways) involved under the physical laws the universe presents? We don't yet know everything about how the brain works, but we do at least know that it is a physical object that does something with the information given as inputs (senses). Given that we don't know for sure how exactly things like understanding and learning work in humans, can we really be absolutely sure what these machines do doesn't qualify?

To be clear, I'm not really trying to argue that what we have is a true AI or anything, or that what these models do isn't just some very convoluted statistics, I've just had a nagging feeling in the back of my head ever since chatGPT and such started getting popular along the lines of "can we really be sure that this isn't (a very simple form of) what our brains, or at least a part of it, actually do, and we just can't see it that way because that's not how it internally "feels" like?" Or, assuming it is not, if someone made a machine that really did exhibit knowledge and creativity, using the same mechanism as humans or one similar, how would we recognize it, and in what way would it look different from what we have (assuming it's not a sci-fi style artificial general intelligence that's essentially just a person, and instead some hypothetical dumb machine that nevertheless possesses genuine creativity or knowledge.) It feels somewhat strange to declare with certainty that a machine that mimics the symptoms of understanding (in the way that they can talk at least somewhat humanlike, and explain subjects in a manner that sometimes appears thought out. It can also be dead wrong of course but then again, so can humans), definitely does not possess anything close to actual understanding, when we don't even know entirely what understanding physically entails in the first place.

[–] gcheliotis@lemmy.world 8 points 1 year ago* (last edited 1 year ago)

That machine is a commercial product. Quite unlike a human being, in essence, purpose and function. So I do not think the comparison is valid here unless it were perhaps a sentient artificial being, free to act of its own accord. But that is not what we’re talking about here. We must not be carried away by our imaginations, these language models are (often proprietary and for profit) products.

[–] ArmokGoB@lemmy.dbzer0.com 10 points 1 year ago (1 children)

I disagree. I think that there should be zero regulation of the datasets as long as the produced content is noticeably derivative, in the same way that humans can produce derivative works using other tools.

load more comments (1 replies)

[–] Hangglide@lemmy.world 10 points 1 year ago (7 children)

Bullshit. If I learn engineering from a textbook, or a website, and then go on to design a cool new widget that makes millions, the copyright holder of the textbook or website should get zero dollars from me.

It should be no different for an AI.

[–] Treczoks@lemmy.world 1 points 1 year ago

Yes, but what about you going into teaching engineering, and writing a text book for it that is awfully close to the ones you have used? Current AI is at a stage where it just "remixes" content it gobbled in, and not (yet) advanced enough to actually learn and derive from it.

load more comments (6 replies)

[–] coheedcollapse@lemmy.world 8 points 1 year ago* (last edited 1 year ago)

With that mindset, only the powerful will have access to these models.

Places like Reddit, Google, Facebook, etc, places that can rope you into giving away rights to your data with TOS stipulations.

Locking down everything available on the Internet by piling more bullshit onto already draconian copyright rules isn't the answer and it surprises the shit out of me how quickly fellow artists, writers, and creatives piled onto the side with Disney, the RIAA, and other former enemies the second they started perceiving ML as a threat to their livelihood.

I do believe restrictions should be looked into when it comes to large organizations and industries replacing creators with ML, but attacking open ML models directly is going to result in the common folk losing access to the tools and corporations continuing to work exactly as they are right now by paying for access to locked-down ML based on content from companies who trade in huge amounts of data.

Not to mention it's going to give the giants who have been leveraging their copyright powers against just about everyone on the internet more power to do just that. That's the last thing we need.

[+] Soundhole@lemm.ee 2 points 1 year ago* (last edited 3 months ago) (1 children)

[deleted]

[–] BURN@lemmy.world 2 points 1 year ago (1 children)

Open sourcing the models does absolutely nothing. The fact of the matter is that the people who create these models aren’t able to quantifiably show how they work, because those levels have been abstracted so far into code that there’s no way to understand them.

Technology

Our Rules

Approved Bots