Technology

65958 readers

10411 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

806

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts (www.businessinsider.com)

submitted 1 year ago by throws_lemy@lemmy.nz to c/technology@lemmy.world

255 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] kromem@lemmy.world 17 points 1 year ago (1 children)

For everyone predicting how this will corrupt models...

All the LLMs already are trained on Reddit's data at least from before 2015 (which is when there was a dump of the entire site compiled for research).

This is only going to be adding recent Reddit data.

[–] Stovetop@lemmy.world 9 points 1 year ago (1 children)

This is only going to be adding recent Reddit data.

A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It's going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

[–] kromem@lemmy.world 5 points 1 year ago* (last edited 1 year ago)

It's not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what's typically being used today, and there's been separate research that you can enhance models significantly using synthetic data from SotA models.

The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.