I used to get a lot of scrappers hitting my Lemmy instance, most of them using a bunch of IP ranges, some of them masquerading their user agents as a regular browser.
What's been working for me is using a custom nginx log format with a custom fail2ban filter that mets me easily block new bots once I identify some kind of signature.
For instance, one of these scrappers almost always sends requests that are around 250 bytes long, using the user agent of a legitimate browser that always sends requests that are 300 bytes or larger. I can then add a fail2ban jail that triggers on seeing this specific user agent with the wrong request size.
On top of this, I wrote a simple script that monitors my fail2ban logs and writes CIDR ranges that appear too often (the threshold is proportional to 1.5^(32-subnet_mask)
). This file is then parsed by fail2ban to block whole ranges. There are some specific details I omitted regarding bantime
and findtime
, that ensure that a small malicious range will not be able to trick me into blocking a larger one. This has worked flawlessly to block "hostile" ranges with apparently 0 false positives for nearly a year