this post was submitted on 04 Jul 2024
357 points (98.1% liked)
Technology
59052 readers
6622 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
LLM is the insanely productive content creator. We can't say how much of the web is generated by it at any moment (and that's ignoring older copypaste articles), but the organic material one wants to prioritise in machine learning gets significantly reduced. This tech, if not isolated from it's learning material, is predictably falling into a feedback loop, and at each cycle it is going to get worse.
Surprisingly, pre LLM-boom datasets can probably become more valuable than contemporary ones.
Garbage in, garbage out
I remember reading that from 2021-2023, LLMs generated more text than all humans had published combined - so arguably, actually human generated text is going to be a rarity