this post was submitted on 19 Nov 2023
28 points (86.8% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

54566 readers
499 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others



Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):


💰 Please help cover server costs.

Ko-Fi Liberapay
Ko-fi Liberapay

founded 1 year ago
MODERATORS
 

Around a month ago I posted a poll on this sub asking about feedback relating to a Coomer.su and kemono.su scraper I've been developing, and this post is an update to share where development is going.

For anybody unaware, I have been working on a scraping software that allows you to mass download posts from creators on both kemono and Coomer. This is not a built-in feature of their website, which I found to be somewhat stupid, so I set out to create my own tool.

In my previous post, I talked about the basic features the scraping software would have, and many people pointed out that similar software already exists for this. After taking a look at the software provided to me, I felt it did not meet my expectations and quality standards, so I continued forward with this project.

The major driving factor of this scraping software is the built-in translator I have integrated directly into the codebase, allowing for post titles and descriptions to be seamlessly translated as they are scraped, courtesy of Google translate. This feature has exceeded my expectations, with the only downside being Google's fair rate limit, which can kick in if you translate too many words. This typically only happens with post descriptions and requires upwards of 1k+ words to activate, and thus I feel it is okay in its current state. There is a toggle for translating post descriptions in the code for the time being which defaults to off, and I may add automatic service switching in the future, but for right now, it should work more than well. The translator allows anybody speaking any language to scrape from the PartySites, which is invaluable if your language isn't widely used on the sites.

I've also ported the codebase over to a C# .NET 6 class library for developers, allowing them to create their own scraping software if desired. The project currently has an attached GUI that I am working on refining for the general public.

As I've stated before, the concept of this project is extremely simple, with the codebase itself being compiled to a meager 18kb excluding libraries, and thus it surprises me that nobody has programmed this yet to a capacity deemed acceptable.

I plan to release this scraper in the following weeks, once some bugs are sorted out and discord support is possibly added.

Please let me know what you'd like to see in this, as feedback is always appreciated.

you are viewing a single comment's thread
view the rest of the comments
[–] arr@lemmy.dbzer0.com 11 points 1 year ago (1 children)

when this goes public the creators are more then welcome to shoot me a message on GitHub and I’d happily remove it.

The only real load cost is on their CDN server which is probably designed for high traffic environments.

I can also modify the user agent of the HTTP requests so they could filter them based specifically on this application if that’s an approach they’d be okay with.

Why not just message them at their contact email address and ask in advance if your assumption about their CDN server is true, you should set a specific user agent etc.? Then they wouldn't have to potentially waste time on figuring out what's happening, writing and deploying filtering/rate limiting logic or finding the repository and contacting you on GitHub.

[–] Set8@lemmy.world 5 points 1 year ago

You do have a point. I'll look into this.