this post was submitted on 31 Aug 2023

419 points (97.5% liked)

Technology

58055 readers

4766 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

419

A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data (finance.yahoo.com)

submitted 1 year ago by assassin_aragorn@lemmy.world to c/technology@lemmy.world

110 comments fedilink hide all child comments

I'm rather curious to see how the EU's privacy laws are going to handle this.

(Original article is from Fortune, but Yahoo Finance doesn't have a paywall)

top 50 comments

sorted by: hot top controversial new old

[–] Primarily0617@kbin.social 132 points 1 year ago* (last edited 1 year ago) (21 children)

it's crazy that "it's too hard :(" has become an acceptable justification for just ignoring the law within tech circles

[–] BrianTheeBiscuiteer@lemmy.world 58 points 1 year ago (6 children)

I'm not an AI expert, and I wouldn't say it is too hard, but I believe removing a specific piece of data from a model is like trying to remove excess salt from a stew. You can add things to make the stew less salty but you can't really remove the salt.

The alternative, which is a lot of effort but boo-hoo for big tech, is to throw out the model and start over without the data in question. These companies would do well to start with models built on public or royalty free data and then add more risky data on top of that (so you only have to rebake starting from the "public" version).

[–] Primarily0617@kbin.social 29 points 1 year ago

sounds like big tech shouldn't have spent the last decade investing in a kitchen refit so that they could make stew really well but nothing else

[–] GoosLife@lemmy.world 23 points 1 year ago* (last edited 1 year ago) (4 children)

If there's something illegal in your dish, you throw it out. It's not a question. I don't care that you spent a lot of time and money on it. "I spent a lot of time preparing the circumstances leading to this crime" is not an excuse, neither is "if I have to face consequences for committing this crime, I might lose money".

[–] Robaque@feddit.it 4 points 1 year ago

Perhaps long pig stew could serve as an apt comparison, lol

load more comments (3 replies)

[–] lightnsfw@reddthat.com 5 points 1 year ago

It will probably be way shittier without all the private data they put in the first time too.

[–] Grandwolf319@sh.itjust.works 3 points 1 year ago

Replace salt with poison or an allergenic substance and if fully holds. If a batch has been contaminated, then yes, you should try again.

But now that the cat is out of the bag, other companies are less willing to let something be scrap able due to how valuable it can be.

I think big tech knew this, that they can only build these models on unfiltered data before the AI craze.

load more comments (2 replies)

[–] Zeth0s@lemmy.world 14 points 1 year ago* (last edited 1 year ago)

It's actually a pretty normal thing in law. Laws are created with common sense in mind and compromises.

Currently EU laws do not cover generative AI. Now EU needs to decide how to deal with it. If consider it as a "lossy compressed database", trying to enforce a variation of gdpr with added fuzziness, or do something else

[–] reverendsteveii@lemm.ee 3 points 1 year ago

I just saw an article that said that ISPs are trying to whine their way out of listing the fees they charge because it's too hard. Which is wild because they certainly know what I owe them after I sign the contract, but somehow it's just impossible for them to determine right up until the moment that I'm obligated to pay it.

load more comments (18 replies)

[–] thefluffiest@feddit.nl 33 points 1 year ago

rm -rf *

There, that’ll do it

[–] CookieJarObserver@sh.itjust.works 23 points 1 year ago

Just kill ot off and start from the beginning.

[–] efrique@lemm.ee 19 points 1 year ago* (last edited 1 year ago) (1 children)

Then delete and start over, or don't use data you don't have explicit permission to use. in the first place.

It's like a thief saying "well, I already fenced most of the stuff so it's too hard to give any of it back. So let's just call it quits, eh?"

[–] GyozaPower@discuss.tchncs.de 6 points 1 year ago

It's not just about having permission or not, but the right to be forgotten. You can ask a company to delete the personal data they may have on you and by law they should (in theory) delete it, with the only exception being data that may be required for justified purposes.

AIs not being able to "forget" means that they would be breaking the law if trained with personal data, as you could not have your data removed if you ask them to do so.

[–] Treczoks@lemmy.world 15 points 1 year ago (1 children)

Delete the AI and restart the training from the original sources minus the information it should not have learned in the first place.

And if they claim "this is more complicated than that" you know their process is f-ed up.

[–] gressen@lemm.ee 5 points 1 year ago (1 children)

You're right, this is a way to solve this issue. It's just not economically feasible to retrain your model from scratch every time. It takes a lot of money to do it and they will push back.

[–] ram@lemmy.ca 4 points 1 year ago (2 children)

Then AI cannot exist in a world where security still matters.

load more comments (2 replies)

[–] Aopen@discuss.tchncs.de 13 points 1 year ago (1 children)

In June, Google announced a competition for researchers to come up with solutions to A.I.’s inability to forget

Free labor? Hope researches wont fall for this

[–] mrsgreenpotato@discuss.tchncs.de 2 points 1 year ago

Seems like exactly that

https://blog.research.google/2023/06/announcing-first-machine-unlearning.html?m=1

[–] alternative_factor@kbin.social 10 points 1 year ago (3 children)

For the AI heads here: is this another problem caused by the "black box" style of LLM creation where they don't really know how it actually works, so they don't really know how to take out the data?

[–] hardware26@discuss.tchncs.de 3 points 1 year ago (1 children)

Model does not keep track of where it learns it from. Even if it did, it couldn't separate what it learnt and discard. Learning of AI resembles to improving your motor skills more than filling an excell sheet. You can discard any row from an Excell sheet. Can you forget, or even separate/distinguish/filter the motor skills you learnt during 4th grade art classes?

load more comments (1 replies)

[–] eltimablo@kbin.social 2 points 1 year ago

Think of it like this: you need a bunch of data points to determine the average of them all, but if you're only given the average of a group of numbers, you can't then go back and determine the original data points. It just doesn't work like that.

load more comments (1 replies)

[–] reverendsteveii@lemm.ee 8 points 1 year ago

Got me a hammer with "AI Alzheimer's" written on the handle...

[–] Veraticus@lib.lgbt 7 points 1 year ago (4 children)

Because it doesn’t “know” those things in the same way people know things.

[–] hansl@lemmy.ml 14 points 1 year ago (19 children)

It’s closer to how you (as a person) know things than, say, how a database know things.

I still remember my childhood home phone number. You could ask me to forget it a million times I wouldn’t be able to. It’s useless information today. I just can’t stop remembering it.

load more comments (19 replies)

[–] SatanicNotMessianic@lemmy.ml 6 points 1 year ago (13 children)

It’s actually because they do know things in a way that’s analogous to how people know things.

Let’s say you wanted to forget that cats exist. You’d have to forget every cat meme you’ve ever seen, of course, but your entire knowledge of memes would also have to change. You’d have to forget that you knew how a huge part of the trend started with “i can haz cheeseburger.”

You’d have to forget that you owned a cat, which will change your entire memory of your life history about adopting the cat, getting home in time to feed it, and how it interacted with your other animals or family. Almost every aspect of your life is affected when you own an animal, and all of those would have to somehow be remembered in a no-cat context. Depending on how broadly we define “cat,” you might even need to radically change your understanding of African ecosystems, the history of sailing, evolutionary biology, and so on. Your understanding of mice and rats would have to change. Your understanding of dogs would have to change. Your memory of cartoons would have to change - can you even remember Jerry without Tom? Those are just off the top of my head at 8 in the morning. The ramifications would be huge.

Concepts are all interconnected, and that’s how this class of AI works. I’ve owned cars most of my life, so it’s a huge part of my personal memory and self-definition. They’re also ubiquitous in culture. Hundreds of thousands to millions of concepts relate to cats in some way, and each one of them would need to change, as would each concept that relates to those concepts. Pretty much everything is connected to everything else and as new data are added, they’re added in such a way that they relate to virtually everything that’s already there. Removing cats might not seem to change your knowledge of quarks, but there’s some very very small linkage between the two.

Smaller impact memories are also difficult. That guy with the weird mustache you saw during your vacation to Madrid ten years ago probably doesn’t have that much of a cascading effect, but because Esteban (you never knew his name) has such a tiny impact, it’s also very difficult to detect and remove. His removal won’t affect much of anything in terms of your memory or recall, but if you’re suddenly legally obligated to demonstrate you’ve successfully removed him from your memory, it will be tough.

Basically, the laws were written at a time when people were records in a database and each had their own row. Forgetting a person just meant deleting that row. That’s not the case with these systems.

The thing is that we don’t compel researchers to re-train their models on a data set if someone requests their removal. If you have traditional research on obesity, for instance, and you have a regression model that’s looking at various contributing factors, you do not have to start all over again if someone requests their data be deleted. It should mean that the person’s data are removed from your data set it it doesn’t mean that you can’t continue to use that model - at least it never has, to my knowledge. Your right to be forgotten doesn’t translate to you being allowed to invalidate the scientific models generated that glom together your data with that of tens of thousands of others. You can be left out of the next round of research on that dataset, but I have never heard of people being legally compelled to regenerate a model based on that.

There are absolutely novel legal questions that are going to be involved here, but I just wanted to clarify that it’s really not a simple answer from any perspective.

load more comments (13 replies)

load more comments (2 replies)

[–] norawibb@sh.itjust.works 5 points 1 year ago

"virtually" impossible. hehehe

[–] mtchristo@lemm.ee 5 points 1 year ago* (last edited 1 year ago)

Start from Scratch B**tch!

[–] SomethingBurger@jlai.lu 3 points 1 year ago (3 children)

Can't they remove the data from the training set and start over?

[–] knotthatone@lemmy.one 5 points 1 year ago (1 children)

Not really, no. None of the source material is actually stored inside the model's dataset, so once it's in, it's in. Because of the way they are designed, you can't point to a particular document and just delete that one thing. It's like unscrambling an egg.

[–] snooggums@kbin.social 4 points 1 year ago (2 children)

They can remove ALL the data and start over.

load more comments (2 replies)

[–] mo_ztt@lemmy.world 3 points 1 year ago (1 children)

Yes, but that's not easy... I can't remember exactly, but I think I saw an estimate that the compute time to train just one of the GPT models cost around $66 million. IDK whether that's total cost from scratch, or incremental cost to arrive at that model starting from an earlier model that was already built, but I do know that GPT is still to this day using that September 2021 cutoff which to me kind of implies that they're building progressively on top of already-assembled models and datasets (which makes sense, because to start from scratch without needing to would be insane).

You could, technically, start from scratch and spend 2 more years and however many million dollars retraining a new model that doesn't have the private data you're trying to excise, but I think the point the article is making is that that's a pretty difficult approach and it seems right now like that's the only way.

load more comments (1 replies)

[–] Viking_Hippie@lemmy.world 3 points 1 year ago

The Danish government, which has historically been very good about both privacy rights and workers' rights has recently suggested that they are looking into fixing the nurses shortage "via AI".

Our current government is probably the stupidest, most irresponsible and least humanitarian one we've had in my 40 year lifetime if not longer 🤬

[–] over_clox@lemmy.world 2 points 1 year ago

Have you tried..

format Earth

[–] Pichu0102@kbin.social 2 points 1 year ago (2 children)

I feel like one way to do this would be to break up models and their training data into mini-models and mini-batches of training data instead of one big model, and also restricting training data to that used with permission as well as public domain sources. For all other cases where a company is required to take down information in a model that their permission to use was revoked or expired, they can identify the relevant training data in the mini batches, remove it, then retrain the corresponding mini model more quickly and efficiently than having to retrain the entire massive model.

A major problem with this though would be figuring out how to efficiently query multiple mini models and come up with a single response. I'm not sure how you could do that, at least very well...

[–] Strawberry@lemmy.blahaj.zone 2 points 1 year ago

You could certainly break up training data, but breaking up the models into mini models based on which training data is used wouldn't work with neural networks trained using gradient descent. Basically whatever the state of the model is it depends on the totality of the training data that it has been trained on (and the order) and it isn't possible to go and remove the effect of a specific training data point without then retraining for all of the data that followed that data point (and even that assumes you were storing a snapshot of the model before every single training data point, which I doubt anyone does)

However, that's no excuse and it is of course possible to entirely retrain a network using a clean dataset and that is what these companies should do

load more comments (1 replies)

load more comments