this post was submitted on 26 Jul 2024
105 points (92.7% liked)
Fediverse
28362 readers
1151 users here now
A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).
If you wanted to get help with moderating your own community then head over to !moderators@lemmy.world!
Rules
- Posts must be on topic.
- Be respectful of others.
- Cite the sources used for graphs and other statistics.
- Follow the general Lemmy.world rules.
Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
If you're looking for a distributed crawler and index:
https://en.wikipedia.org/wiki/YaCy
Yacy already exists and has been around for 2 decades.
This is close to what I was thinking, but rather than crawling independently, leverage the API results from queries to build a list of sites (and then perhaps crawl). Potentialy a tag index of sorts. I'm not solid on any idea as I haven't investigated SearNGX enough to see how it works under the hood, but yes, on the same plane of thought.
I ran a YaCy instance for a while like a decade ago. It does federate index requests, and when you search it propagates the search request across a bunch of nodes. When my node came online it almost immediately started crawling stuff and it did get a bunch of search queries. But the network was still pretty small back then and the search results were... not great. That's the price of independence from Google's and Microsoft's giant server farms, it's hard to compete with that size.
But at the rate Google and Bing are enshittifying, I think it's worth revisiting.
Using ActivityPub for this would be immensely wasteful. It's just not feasable that all instances would have the whole index because it's so large. Back when I tried it, the network still had several TBs worth of indexed pages. This is firmly in the realm of distributed P2P systems. One could have an ActivityPub plugin however to receive updates from social media near instantly and index those immediately with less overhead. But you still want to index wikipedia, forums, blogs, whatever the crawlers can find.
Sure. SearX is a meta-search engine. It does (only) queries to other search engines to get results. YaCy on the other hand is itself a search engine. It has the data available and doesn't do queries to other engines. In theory you could combine the two concepts. Have a software that does both. But that requires some clever thinking. The returned (Google) ranking only applies to the exact search term. And it's questionable if you can store it and do anything useful with it except for when some other user searches for the exact same thing. And also the returned teaser texts are very short and tailored to the search query. So maybe also useless. It'd be hard.
One thing you could do is crawl the results that users actually click on. And I think YaCy already does that. AFAIK they had an browser add-on or a proxy or something to intercept visited pages (and hence search results).
I really want to use this, but from what I read it basically requires a minimum of 20-30GB of RAM to be performant. Also the documentation appears to be a mess and highly outdated. I'd also want to cluster it internally and connect with outside peers still which seems possible, but with the large resource requirement not as feasible with my setup.
That's odd, the project page states 256 Megabytes and practically speaking that's nothing. Where did you find 20-30G? Are you sure you're not confusing the memory requirement with the suggested free hard drive space?
Even if it does need 32G of RAM to perform well it's not a very high hurdle. 32G of DDR4 can be had used for less than $75. Toss that in an old Core8/9 I5 Desktop, install your preferred flavor of Linux, add Docker, and you're off to the races.
I've run it in containers, never used that many resources. The whole server (running a few dozen containers) was 32gb, and it wasn't impacted in any sensible way.
That is misinformation. It doesn't need anywhere close to that amount of RAM. It's pretty much like other webapps and I used to run it on an old computer. It'll fill up your harddisk, though. If you allow it to do that.
There also seems to be a lot of settings so perhaps they had it misconfigured. It also is Java so I wouldn't put it past it for such a monolith of a Java program to require so much to be performant. Perhaps I'll try a cluster of them then and see how it fares.
Well initial setup was definitely interesting. I didn't want to expose 8090 and wanted it behind a web proxy and I finally got that working and actually received my first remote crawl overnight. I had to change to 80/443 internally so it would map correctly for p2p connections, public port setting doesn't apparently cut it. I kinda dislike the whole setup with it micromanaging CPU load, but otherwise it doesn't seem atrocious for a new peer at least, I guess this and the web proxy problems are likely awkward due to the age of the software.