this post was submitted on 04 Aug 2023
52 points (98.1% liked)

Fediverse

28362 readers
1254 users here now

A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).

If you wanted to get help with moderating your own community then head over to !moderators@lemmy.world!

Rules

Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy

founded 1 year ago
MODERATORS
top 11 comments
sorted by: hot top controversial new old
[–] Blaze@sopuli.xyz 33 points 1 year ago (1 children)
[–] marsara9@lemmy.world 29 points 1 year ago (2 children)

Thanks for the shout-out.

But FYI I've run into some bugs that's preventing new content from being indexed. So you won't see anything new (from about a week ago) until I can find a new method to fetch new posts.

[–] leo@lemmy.linuxuserspace.show 5 points 1 year ago

Just dropping by to say “thanks”

So, thanks for putting in the work. I know it isn’t easy. And can easily be overlooked. ❤️

[–] redcalcium@lemmy.institute 4 points 1 year ago* (last edited 1 year ago) (1 children)

Would it be better if your backend federates with popular instances so they push new posts and comments to your search engine? That way, you don't need to scrape those instances to index their contents (because they'll voluntarily send the contents to you via activitypub).

[–] marsara9@lemmy.world 4 points 1 year ago (2 children)

Yep that's the new idea. The sad part is that with this method there's no way to get historical data. Only new posts. So if a server goes down, gets DDOSd etc... I'll lose posts forever.

Also building an ActivityPub implementation from scratch isn't trivial either. So that'll take some time.

I've got a few other ideas I'm playing with as well. Like just assuming that internal post IDs are all sequential and literally fetching them one by one. Or maybe some combination of both?

[–] Die4Ever@programming.dev 3 points 1 year ago* (last edited 1 year ago)

Instead of building a new ActivityPub implementation, you could just run a regular instance of Lemmy and pull data from its database directly? Or use its API for searches?

[–] redcalcium@lemmy.institute 1 points 1 year ago* (last edited 1 year ago)

Why not talk to the instance admins directly and ask for their database dumps (minus the user accounts table and DMs) so you can ingest it into your search index? You're doing this for the benefit of the fediverse, right? I'm sure most instance admins would help you if you ask (and it's easier on their servers too because your scraper aren't bombarding their server), I know I would, tough my instance only have 2 users right now so it'll probably useless to index. This should take care the historical data problem, and you can use activitypub for obtaining new data going forward without scraping the instances.

[–] Gamey@lemmy.world 6 points 1 year ago

The issue is that you can't search a ton of servers at once unless you index all their posts. To get a proper search engine someone would not only have to spend the time to code it but also the money to host that beast! :/

[–] Tiritibambix@lemmy.ml 2 points 1 year ago (1 children)

I'm not using it a lot but the few times I did it didn't fail me: Lemmyverse

[–] Aimhere@midwest.social 1 points 1 year ago

That site only lists Lemmy instances and communities, with several filtering options. It's not a search engine for Lemmy content.

[–] dunning_cougar@waveform.social -1 points 1 year ago

I only use memmy on iOS and the search function is great.