this post was submitted on 20 Aug 2023
182 points (100.0% liked)

Technology

37713 readers
520 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] acastcandream@beehaw.org 56 points 1 year ago (3 children)

It’s a good first step, but we also need to address the way they gather the material to train these LLM‘s. That’s the core issue here that spans multiple industries. It’s stealing work in a way that functionally launders it, and then they want to claim it as original work while replacing the very people they’re pulling from. It’s a multi-variable issue here when really digging into the ethics.

[–] mcgravier@kbin.social 29 points 1 year ago (1 children)

It’s stealing work in a way that functionally launders it

Actually in the age of basically permanent copyrigt, this brings at least some balance

[–] acastcandream@beehaw.org 20 points 1 year ago (2 children)

You would have an argument if it wasn’t the exact same corporations protecting themselves with copyright abuse that are mostly benefiting from this new system. 

[–] Even_Adder@lemmy.dbzer0.com 15 points 1 year ago* (last edited 1 year ago) (1 children)

Don't ignore the plethora of FOSS models regular people can train and use. They want to trick you into thinking generative models are a game only for the big boys, while they form up to attempt regulatory capture to keep the small guy out. They know they're not the only game in town, and they're afraid we won't need them anymore.

[–] Jaded@lemmy.dbzer0.com 6 points 1 year ago

The corporations already have all the data, users literally gave it to them by uploading it. Open source only has scrapped data. If you start regulating, you kill open source but the big players will literally just shrug it off.

Traditional artists already lost. It sucks but now we get to find out if the winner is all of society or only just Adobe and Shutterstock.

[–] ninjan@lemmy.mildgrim.com 9 points 1 year ago (1 children)

Yes, absolutely. They want AI to be people such that copyright applies and such that they can claim the AI was inspired just like a human artist is by the art they're exposed to.

We need a license model such that AI is only allowed to be trained on content were the license explicitly permits it and that no mention is equal to it being disallowed.

[–] donuts@kbin.social 6 points 1 year ago (1 children)

We need a license model such that AI is only allowed to be trained on content were the license explicitly permits it and that no mention is equal to it being disallowed.

That is the default model behind copyright, which basically says that the only things people can use your copyrighted work for without a license are those which are determined to be "fair use".

I don't see any way in which today's AI ought to be considered fair use of other people's writings, artwork, etc.

[–] FaceDeer@kbin.social 3 points 1 year ago

The concepts contained within a copyrighted work are not themselves copyrighted. It's impossible to copyright an idea. Fair use doesn't even enter into it, you can read a copyrighted work and learn something from it and later use that learning with no restrictions whatsoever.

[–] furrowsofar@beehaw.org 7 points 1 year ago* (last edited 1 year ago) (1 children)

I feel two ways about it. Absolutely it is recorded in a retrieval system and doing some sort of complicated lookup. So derivative.

On the other hand, the whole idea of copyright or other so called IP except maybe trademarks and trade dress in the most limited way is perverse and should not exist. Not that I have a better idea.

[–] FaceDeer@kbin.social 7 points 1 year ago (1 children)

Absolutely it is recorded in a retrieval system and doing some sort of complicated lookup.

It is not.

Stable Diffusion's model was trained using the LAION-5B dataset, which describes five billion images. I have the resulting AI model on my hard drive right now, I use it with a local AI image generator. It's about 5 GB in size. So unless StabilityAI has come up with a compression algorithm that's able to fit an entire image into a single byte, there is no way that it's possible for this process to be "doing some sort of complicated lookup" of the training data.

What's actually happening is that the model is being taught high-level concepts through repeatedly showing it examples of those concepts.

[–] furrowsofar@beehaw.org 4 points 1 year ago* (last edited 1 year ago) (2 children)

I would disagree. It is just a big table lookup of sorts with some complicated interpolation/extrapolation algorithm. Training is recording the data into the net. Anything that comes out is derivative of the data that went in.

[–] FaceDeer@kbin.social 5 points 1 year ago (1 children)

You think it's "recording" five billion images into five billion bytes of space? On what basis do you think that? There have been efforts by researchers to pull copies of the training data back out of neural nets like these and only in the rarest of cases where an image has been badly overfitted have they been able to get something approximately like the original. The only example I know of offhand is this paper which had a lot of problems and isn't applicable to modern image AIs where the training process does a much better job of avoiding overfitting.

[–] furrowsofar@beehaw.org 2 points 1 year ago* (last edited 1 year ago) (1 children)

Step back for a moment. You put the data in, say images. The output you got depended on putting in the data. It is derivative of it. It is that simple. Does not matter how you obscure it with mumbo jumbo, you used the images.

On the other hand, is that fair use without some license? That is a different question and one about current law and what the law should be. Maybe it should depend on the nature of the training for example. For example reproducing images from other images that seems less fair. Classifying images by type, well that seems more fair. Lot of stuff to be worked out.

[–] FaceDeer@kbin.social 3 points 1 year ago

It is that simple.

No, it really isn't.

If you want to step back, let's step back. One of the earliest, simplest forms of "generative AI" is the Markov Chain algorithm. What you do with that is you take a large amount of training text and run it through a program to analyze it. What the program is looking for is the probability of specific words following other words.

So for example if it trained on the data "You must be the change you wish to see in the world", as it scanned through it would first go "ah, the word 'you' is 100% of the time followed by the word 'must'" and then once it got a little further in it would go "wait, now the word 'you' was followed by the word 'wish'. So 'you' is followed by 'must' 50% of the time and 'wish' 50% of the time."

As it keeps reading through training data, those probabilities are the only things that it retains. It doesn't store the training data, it just stores information about the training data. After churning through millions of pages of data it'll have a huge table of words and the associated probabilities of finding other specific words right after them.

This table does not in any meaningful sense "encode" the training data. There's nothing you can do to recover the training data from it. It has been so thoroughly ground up and distilled that nothing of the original training data remains. It's just a giant pile of word pairs and probabilities.

It's similar with how these more advanced AIs train up their neural networks. The network isn't "memorizing" pictures, it's learning concepts from them. If you train an image generator on a million images of cats you're teaching it what cat fur looks like under various lighting conditions, what shape cats generally have, what sorts of environments you usually see cats in, the sense of smug superiority and disdain that cats exude, and so forth. So when you tell the AI "generate a picture of a cat" it is able to come up with something that has a high degree of "catness" to it, but is not actually any specific image from its training set.

If that level of transformation is not enough for you and you still insist that the output must be considered a derivative work of the training data, well, you're going to take the legal system down an untenable rabbit hole. This sort of learning is what human artists do all the time. Everything is based on the patterns we learn from the examples we saw previously.

[–] mobyduck648@beehaw.org 3 points 1 year ago* (last edited 1 year ago)

'A big table lookup' isn't what's going on here, go and look up what backpropagation and gradient descent are if you want to know what's actually happening.