Pagination should probably be redone the way Reddit does it. That is, give a "next page" token which is an indexed field at the end of every page, instead of page=NUMBER. Then getting the next page becomes WHERE token>xyz which is an indexed operation, instead of OFFSET NUMBER, which requires counting off that many rows and gives the person duplicate results if new posts are arriving while the person browses. This is a well known performance trick. Wikipedia does it too, and I think 'SQL Performance Explained' by Marcus Winand talks about it.
Lemmy Server Performance
Lemmy Server Performance
lemmy_server uses the Diesel ORM that automatically generates SQL statements. There are serious performance problems in June and July 2023 preventing Lemmy from scaling. Topics include caching, PostgreSQL extensions for troubleshooting, Client/Server Code/SQL Data/server operator apps/sever operator API (performance and storage monitoring), etc.
That is, give a “next page” token
There's already a pull request on changing paging.
My focus is a very hard wall on performance, scale. There is way too much potential for data to run into the full post table as things are now.
ok, experimenting on a massive test data set of over 5 million posts... this PostgreSQL works pretty well
SELECT COUNT(ranked_recency.*) AS post_row_count
FROM
(
SELECT id, community_id, published,
rank() OVER (
PARTITION BY community_id
ORDER BY published DESC, id DESC
)
FROM post_aggregates) ranked_recency
WHERE rank <= 1000
;
This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.
An even less-intrusive approach is to not add any new field to existing tables. Establish a reference table say called include_range. There is already an ENUM value for each sort type, so include_range table with these columns: sort_type ENUM, lowest_id BigInt, highest_id BigInt
Run a variation of this to populate that table:
FROM
(
SELECT id, community_id, published,
rank() OVER (
PARTITION BY community_id
ORDER BY published DESC, id DESC
)
FROM post_aggregates) ranked_recency
WHERE rank <= 1000
Against every sort order, including OLD. Capture only two BigInt results: the MIN(id) and the MAX(id) - that will give a range over the whole table. Then every SELECT on post_aggregates / post table includes a WHERE id >= lowest_id AND id <= highest_id
That would put in a basic sanity check that ages-out content, and it would be right against the primary key!
A core design issue of either approach is that server operators can modify the building of this data without needing to modify or restart the lemmy_server Rust code.
3 hours later... I put it into code and am experimenting with it. Some proof of concept results: https://github.com/LemmyNet/lemmy/files/12373819/auto_explain_list_post_community_0_18_4_dullbananas_with_inclusion_run0a.txt
It can't be a simple as a date range, because we want to be inclusive for smaller communities.
- paging is a consideration. 1000 posts per community would allow 10 pages of 20 posts.
- small communities are defined to be 1000 or less posts, regardless of age
- large communities would focus on recency, the 1000 post would be recently created or edited
- Edited can be more tricky, either skip for now or focus on how to limit some kind of mass edit from taking over newly published
Also a good time to be reminded that the published date isn't reliable for a couple reasons:
- problems in the field have been shown with incoming federation data having future published dates on content. kbin in an easy example, but it isn't limited to kbin.
- federation can lag due to server overload and problems paths between specific servers, ISP issues, etc. It is rather common to have received a post hours after the published date. Lemmy currently does not track the 'received' date of content.