this post was submitted on 26 Jul 2023
104 points (85.1% liked)
Technology
59092 readers
6622 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
The argument regarding the specific case of AI-generated images of real actors makes sense, but the headline overgeneralizes hugely.
If you write a book about carpentry, and someone checks that book out from the library, reads it, learns how to do carpentry from it, and goes into the carpentry business, they do not owe you a share of their profits.
It's nice if they give you credit. But they do not owe you a revenue stream.
If they are a robot, the same remains true.
Corollary: if a corporation scapes the talk of the whole internet, which itself was shaped by the aggregate culture and knowledge of ten thousand years of human history, and their resultant product is an AI that can replace workers, it is morally valid to eminent domain that shit and divert its profits to a fledgling UBI program.
Edit to add: Not a statement about how UBI should really work, just a throwaway comment about seizing means.
UBI should be a government initiative, and funding for it should be collected in form of tax, irrespective of AI. Because more and more humans are getting replaced with automation and technology in general, and a lot of this being done so gradually that you don't notice it, or think of it as a problem. Every time you saw headlines like "xx corporation has laid off hundreds/thousands of employees" in the past, had very little to do with AI, but could have to do with technology and progress in general, plus a lot of other factors. Every little new development could have a butterfly effect that's hard to calculate.
Neither AI, nor the loss of jobs in general, should be a factor for UBI funding. AI is just another new technological development, maybe even a disruptive one, but it's nothing so new that we need to pick up our pitchforks against.
As for compensating creative owners, that's a bigger discussion on IP protection and ownership in general, and the responsibility falls upon the IP owners (and maybe appropriate laws). For instance, we've seen news sites, science publishers etc paywall their work, and that's because they want to protect their work and get compensation for viewership - and this has nothing to do with AI. If people want compensation for their work, then they should take appropriate measures to protect their work, and/or come up with alternate revenue streams, if it's impossible to paywall their work (for instance, how some youtubers choose to seek sponsorship or patreon donations). If people want to prevent their work from being stolen and redistributed, appropriate action should be taken against the persons/sites stealing their work (eg via DMCA etc). It's not the AI's fault for eating up copyrighted content on public sites like. pastebin.com or Scribd, it's the fault of the people uploading it.
UBI should not be dependent on its specific sources and specific destinations. It's universal, it's right in the name. It should be funded by a tax on the wealthy - regardless of how that wealth is obtained - and be issued to everyone.
The goal is not to "level the playing field" so that human employees can continue to labor and companies can't afford to hire robots to replace them. The goal is to make it so that if companies replace all their employees with robots those employees don't have to find some other job to continue living.
AI is not a person. That’s why its works aren’t eligible for copyright. You’re arguing that AI should have the same rights as a person in this regard and that’s not an established right, nor should it be.
Also the analogy makes zero sense. It’s more accurate to say someone checks out a book about carpentry, reads it, then writes another book on carpentry by moving the words around a bit despite knowing nothing about carpentry.
More accurately someone who knows nothing about German, writing, or carpentry but learns German and carpentry by reading hundreds of thousands of books and then decides to write a book about carpentry in German.
the AI still doesn’t learn carpentry. It just knows how books about carpentry generally read.
I’m not sure that’s a fair comparison. You wouldn’t instantly ingest that information and know it. It’s more like photocopying a book and including it in another book that you sell. It’s a paradigm shift, and I’m not sure what the answer is.
It's nothing like photocopying a book. It is very, very similar to the analogy given above, of someone learning the information and profiting from it. For the AI model to "learn" the information during training, it takes apart the information one piece of a word at a time, and reorganises it for quick access. Information is categorised by metadata like topic, source, date, etc; there are approximately 1536 "tags", so to speak, which OpenAI's ChatGPT uses for categorising what it learns.
Copyright of words has the order of those words as an integral part of the legal standard, and the standards for what infringes are actually pretty strict (https://fairuse.stanford.edu/2003/09/09/copyright_protection_for_short/). Training an AI is definitively transformative work which does not retain the order of the words in the finished product, merely a weighted likelihood of what word fragment will come next in a given context, so it's protected under Fair Use.
I don’t think it’s that simple. Like I said it’s a paradigm shift. It doesn’t fit into existing laws well. My point is what we consider fair use now, summarizing a book or movie by a human, is based on the limited abilities of humans. When you have AI with limitless abilities, that will change things. The same rules abs considerations may have to be rethought.
Au contraire, it is that simple and it is covered by existing law just fine in the very specific case we're talking about, which is whether training a model is "transformative work" by the definition in IP law. It is. The law looks very specifically at the fact of the case, not hand-waving masquerading as an argument.
You are making this technology out to be something it isn't; there's no mystery to how AI works, and it does not have "limitless abilities". In fact, it is very limited, but that isn't relevant. What the law considers "fair use" isn't based on human ability at all, it's based on how completely the work is reproduced and the context the original work is being used in. You clearly have access to the internet, you can verify the standards required to show breach of copyright yourself if you don't believe me.
A key difference is that AI models tend to contain actual pieces of the training data, and on occasion regurgitate it. Kind of like randomly reproducing parts of the book during the course of your career as a carpenter. That's the kind of thing that actually results in copyright lawsuits and damages when real people do it. AI shouldn't be getting a pass here.
Oh sure, if a copyright holder can demonstrate that a specific work is reproduced. Not just "I think your AI read my book and that's why it's so good at carpentry."
The thing is that they're all reproduced, at least in part. That's how these models work.
Reproducing a work is a specific thing. Using an idea from that work, or a transformation of that idea, is not reproducing that work.
Again: If a copyright holder can show that an AI system has reproduced the text (or images, etc.) of a specific work, they should absolutely have a copyright claim.
But "you read my book, therefore everything you do is a derivative work of my book" is an incorrect legal argument. And when it escalates to "... and therefore I should get to shut you down," it's a threat of censorship.
The problem is that the LLMs (and image AIs) effectively store pieces of works as correlations inside them, occasionally spitting some of them back out. You can't just say "it saw it" but can say "it's like a scrapbook with fragments of all these different works"
I've memorized some copyrighted works too.
If I perform them publicly, the copyright holder would have a case against me.
But the mere fact that I could recite those works doesn't make everything that I say into a copyright violation.
The copyright holder has to show that I've actually reproduced their work, not just that I've memorized it inside my brain.
The difference is that your brain isn't a piece of media which gets copied. The AI is. So when it memorizes, it commits a copyright violation
If that reasoning held, then every web browser, search engine bot, etc. would be violating copyright every time it accessed a web page, because doing so involves making a copy in memory.
Making an internal copy isn't the same as publishing, performing, etc. a work.
There's an implied license to use content for the purpose of displaying it for web content. Copies for other purposes...not so much. There have been a whole series of lawsuits over the years over just how much you can copy for what purpose.
There isn't an "implied license". Rather, copyright is simply not infringed until the work is actually republished, performed, etc. without the copyright holder's permission.
Making internal in-memory copies — e.g. for search-engine indexing — is simply not an infringement to begin with; just as it's not an infringement for me to memorize a copyrighted work, but it would be an infringement if I were to recite it in a public performance without permission.
Copyright simply does not grant the copyright-holder absolute & total control of everything downstream from the work. It restricts republishing, performing, etc.; it does not restrict memorization, indexing, summarizing in a review, answering questions about the work, etc.
Again: if the AI system is made to regurgitate the actual text of the work, that's still a copyright infringement. But merely having learned from it is not.
This is different from those, and not at all tested in the courts. There are likely to be a whole bunch of lawsuits and several years before this is settled.
There is no possible basis in law for copyright infringement.
Copyright infringement isn't "you can do these things with copyrighted materials and everything else is banned". It's "these specific things (redistributing substantial portions of published works) are disallowed, unless you meet exceptions, and anything not explicitly disallowed is legal".
You are unconditionally allowed to learn from copyrighted works. There is no legal basis for preventing it. There is no possible basis in copyright law preventing it. It would take new legislation restricting doing so, and it would be impossible to apply to any training that happened before this new crime against humanity of a law was written.
No, it doesn't. Learning from copyrighted material is black and white fair use.
The fact that the AI isn't intelligent doesn't matter. It's protected.
A person reading and internalizing concepts is considerably different than an algo slurping in every recorded work of fiction and occasionally shitting out a bit of mostly Shakespeare. One of these has agency and personhood, the other is a tool.
No, that's not how these models work. You're repeating the old saw about these being "collage machines", which is a gross mischaracterization.
That article doesn't show what you think it shows. There was a lot of discussion of it when it first came out and the examples of overfitting they managed to dig up were extreme edge cases of edge cases that took them a huge amount of effort to find. So that people don't have to follow a Reddit link, from the top comment:
Overfitting is an error state. Nobody wants to overfit on any of the input data, and so the input data is sanitized as much as possible to remove duplicates to prevent it. They had to do this research on an early Stable Diffusion model that was already obsolete when they did the work because modern Stable Diffusion models have been refined enough to avoid that problem.
If I was to read a carpentry book and then publish my own, "regurgitating" most of the original text, then I plagiarized and should be sued. Furthermore, if I was to write a song and use the same melody as another copyrighted song I'd get sued and lose, even if I could somehow prove that I never heard the original.
I think the same rules should apply to AI generated content. One rule I would like to see, and I don't know if this has precedent, is that AI generated content cannot be copyrighted. Otherwise AI could truly replace humans from a creative perspective and it would be a race to generate as much content as possible.
Analogies to humans are not relevant, and yours is a bad one anyway. LLMs don’t read a carpentry book and then go build houses. They chew up carpentry books and spit out carpentry books.
Your final line remains to be established in court.
AI isn't learning how to do carpentry though. It's simply including my work in an aggregate pool that it now claims as its own.
It is not. The AI's model does not contain a copy of your work, there is no "aggregate pool." AI is not some sort of magical compression algorithm that's able to somehow crush whole images down to less than a byte of data. The only thing that it's "including" in itself are the concepts that it learned from your work. Those are ideas, which are not copyrightable.
The difference is that when the robot reads that book, it maintains a verbatim copy of that book as part of it's training material indefinitely and can reference and re-reference that material infinitely. That is not how it works when a human reads a book.
The 'copy' that the AI retains indefinitely is a verbatim copy of the original work, and the entire point of "copyright" is to control how and where copies are used.
Yes, there are 'fair use' exceptions to copyright. I don't think you realize it, but your argument is less about whether this violates copyright (it absolutely does under the textbook definition) and more about whether there should be a fair-use exemption for AIs; you seem to think yes, I would disagree.
I'd also argue the AI example qualifies as it as 'derivative work' based on the original, which STILL would require honoring copyright laws and compensating the creators of the original works. Basically, before reading the book it was just "AI". After reading the book it has become "AI + book1", a derivative work, and on and on and on.
However, that is how it works when a human memorizes a copyrighted work. If I memorize a poem, I may then reference it from my memory without further need for the original text before me. If I am an actor and learn my lines for a play, I commit them to my memory.
Which is not an infringement.
The infringement happens if the human performs or publishes that work; e.g. reciting that copyrighted poem or play from memory before an audience; writing that work down from memory and publishing it; etc., without a copyright license for that performance or republication.
I suggest merely applying the same standard: infringement doesn't happen when a work is read, indexed, scanned, etc.; it does happen if that work is then recited.
For instance, ChatGPT currently knows the text of the Harry Potter novels, but it does not recite them when asked to do so. (Try it! It will answer questions about the text, but it will freeze up if asked to recite it; evidently because it has a filter against reciting copyrighted material.)
No, the reason ChatGPT can't recite the text of Harry Potter verbatim is because it doesn't actually "contain" it. It learned from it, but it doesn't "remember" it word-for-word. There is no filter against reciting copyrighted material. Try asking it to recite a scene from a Shakespearean play, for example - that's out of copyright and ChatGPT was almost certainly trained on it.
I've actually experimented with this myself on my local machine, I took one of the smaller open-source models and I gave it additional training using 20 megabytes of My Little Pony fanfiction. The AI knew a lot about the fanfic afterward but it was clearly just picking up tidbits of general knowledge rather than "remembering" the whole thing.
I tried that several weeks ago while discussing some details of the Harry Potter world with ChatGPT, and it was able to directly quote several passages to me to support its points (we were talking about house elf magic and I asked it to quote a paragraph). I checked against a dead-tree copy of the book and it had exactly reproduced the paragraph as published.
This may have changed with their updates since then, and it may not be able to quote passages reliably, but it is (or was) able to do so on a couple of occasions.
The example you chose to pick. Absolutely is infringement.
Performing a work, whether a play script or musical score, is prohibited without receiving permission from the copyright holder, and in most cases paying a licensing fee and/or royalties.
source: https://www.legalzoom.com/articles/copyright-laws-and-school-performances
That's not how these AIs work. They don't contain verbatim copies of their training data. They get trained on terabytes of text, they couldn't possibly remember it all.