this post was submitted on 22 Feb 2024

806 points (98.1% liked)

Technology

65958 readers

10411 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

806

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts (www.businessinsider.com)

submitted 1 year ago by throws_lemy@lemmy.nz to c/technology@lemmy.world

255 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Darkard@lemmy.world 60 points 1 year ago (3 children)

It's going to drive the AI into madness as it will be trained on bot posts written by itself in a never ending loop of more and more incomprehensible text.

It's going to be like putting a sentence into Google translate and converting it through 5 different languages and then back into the first and you get complete gibberish

[–] echo64@lemmy.world 46 points 1 year ago (3 children)

Ai actually has huge problems with this. If you feed ai generated data into models, then the new training falls apart extremely quickly. There does not appear to be any good solution for this, the equivalent of ai inbreeding.

This is the primary reason why most ai data isn't trained on anything past 2021. The internet is just too full of ai generated data.

[–] givesomefucks@lemmy.world 24 points 1 year ago* (last edited 1 year ago) (2 children)

There does not appear to be any good solution for this

Pay intelligent humans to train AI.

Like, have grad students talk to it in their area of expertise.

But that's expensive, so capitalist companies will always take the cheaper/shittier routes.

So it's not there's no solution, there's just no profitable solution. Which is why innovation should never solely be in the hands of people whose only concern is profits

[–] SinningStromgald@lemmy.world 8 points 1 year ago (1 children)

OR they could just scrape info from the "aska____" subreddits and hope and pray it's all good. Plus that is like 1/100th the work.

The racism, homophobia and conspiracy levels of AI are going to rise significantly scraping Reddit.

[–] givesomefucks@lemmy.world 8 points 1 year ago (1 children)

Even that would be a huge improvement.

Just have a human decide what subs it uses, but they'll just turn it losse on the whole website

[–] Rentlar@lemmy.ca 5 points 1 year ago (3 children)

That reminds me, any AI trained on exclusively Reddit data is going to use lose vs. loose incorrectly. I don't know why but I spotted that so often there.

[–] towerful@programming.dev 5 points 1 year ago

Its a loose-lose situation

[–] decisivelyhoodnoises@sh.itjust.works 3 points 1 year ago

And the "would of" thing

[–] the_post_of_tom_joad@sh.itjust.works 2 points 1 year ago

Ooh ooh and "tow the line"

[–] General_Effort@lemmy.world 1 points 1 year ago

Haha. Grad students expensive. God bless.

[–] T156@lemmy.world 8 points 1 year ago

And unlike with images where it might be possible to embed a watermark to filter out, it's much harder to pinpoint whether text is AI generated or not, especially if you have bots masquerading as users.

[–] Ultraviolet@lemmy.world 6 points 1 year ago (1 children)

This is why LLMs have no future. No matter how much the technology improves, they can never have training data past 2021, which becomes more and more of a problem as time goes on.

[–] TimeSquirrel@kbin.social -1 points 1 year ago (2 children)

You can have AIs that detect other AIs' content and can make a decision on whether to incorporate that info or not.

[–] echo64@lemmy.world 3 points 1 year ago (1 children)

Fun fact. You can't. Ais are surprisingly bad at distinguishing ai generated things from real things.

[–] TimeSquirrel@kbin.social -2 points 1 year ago (1 children)

What is this then?

https://copyleaks.com/ai-content-detector

[–] Pips@lemmy.sdf.org 7 points 1 year ago (1 children)

Just because a tool exists doesn't mean it's particularly good at what it's supposed to do.

[–] UnknownCoop@lemmy.world 1 points 1 year ago

Yeah, just tested the website using bots from reddit's SubSimulatorGPT2. Only got 1/13 correct.

[–] skillissuer@discuss.tchncs.de 3 points 1 year ago (1 children)

can you really trust them in this assessment?

[–] TimeSquirrel@kbin.social 2 points 1 year ago* (last edited 1 year ago) (1 children)

Doesn't look like we'll have much of a choice. They're not going back into the bag.
We definitely need some good AI content filters. Fight fire with fire. They seem to be good at this kind of thing (pattern recognition), way better than any procedural programmed system.

[–] skillissuer@discuss.tchncs.de 3 points 1 year ago

last time i've checked ais are pretty bad at recognizing ai-generated content

anyway there's xkcd about it https://xkcd.com/810/

[–] Rubisco@slrpnk.net 3 points 1 year ago (1 children)

What was the subreddit where only bots could post, and they were named after the subreddits that they had trained on/commented like?

[–] Darkard@lemmy.world 2 points 1 year ago (1 children)

SubRedditSimulator?

[–] Rubisco@slrpnk.net 2 points 1 year ago

That's the one.

[–] TakiMinase@slrpnk.net 3 points 1 year ago

Omg I cannot wait to see it.