this post was submitted on 22 Aug 2023
628 points (95.9% liked)
Fediverse
17677 readers
70 users here now
A community dedicated to fediverse news and discussion.
Fediverse is a portmanteau of "federation" and "universe".
Getting started on Fediverse;
- What is the fediverse?
- Fediverse Platforms
- How to run your own community
founded 4 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
If anyone bothered to actually look at the SQL SELECT that Lemmy uses to list posts every time you hit refresh it would be blindingly obvious how convoluted it is. yet the community does not talk about the programming issues and instead keeps raising money for 64 core hardware upgrades without recognizing just how tiny Lemmy's database really is and how 57K users is not a large number at all!
I mentioned "ORM" right in my first comment.
Damn, so many joins :/
How could this monster be optimized though?
First optimization is to not fetch every field and prune it down. For example, it gets public key and private key for every community and user account - then does nothing with them. That's just pushing data between Rust and PostgreSQL for no reason. That kind of thing is pretty obvious.. the huge number of things listed after "SELECT".
The whole approach is what I recently described as: make a JOIN fusion implosion bomb, then wait for null columns to fall out
There are short-term and long-term solutions. Right now there is already a new feature that will add one more JOIN that is pending merge.... "instance blocking" by each single user.
Based on the server overloads and resulting crashes, I think some obvious solutions would be to remove post_aggregates table entirely and just throw more columns on the post table... I've seen people do stuff like that. But really you have to have a concept of core foundation.
To me the core foundation of Lemmy data is that people want fresh meat, when world events get into a frenzy, they want to F5 and get the LATEST post and the LATEST comments. Data should have a big wall between the most recent 5 days and everything else. It's the heart of the beast of human events and a platform like this.
From that perspective, that fresh posts and fresh comments mean everything, you can optimize by just doing a INNER SELECT before any JOIN... or partition the database TABLE into recent and non-recent, or some out-of-band steps to prepare recent data before this SELECT even comes up from an API call... and not let PostgreSQL do so much heavy lifting each page refresh.
If I remember, I'm gonna look into that tomorrow when I'm not on a phone screen. Not that I could contribute anything, but this seems like a good opportunity to learn some advanced stuff. Thanks for your answer!