this post was submitted on 28 Dec 2024
260 points (94.2% liked)

Ask Lemmy

27323 readers
3189 users here now

A Fediverse community for open-ended, thought provoking questions


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


6) No US Politics.
Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 2 years ago
MODERATORS
 

I created this account two days ago, but one of my posts ended up in the (metaphorical) hands of an AI powered search engine that has scraping capabilities. What do you guys think about this? How do you feel about your posts/content getting scraped off of the web and potentially being used by AI models and/or AI powered tools? Curious to hear your experiences and thoughts on this.


#Prompt Update

The prompt was something like, What do you know about the user llama@lemmy.dbzer0.com on Lemmy? What can you tell me about his interests?" Initially, it generated a lot of fabricated information, but it would still include one or two accurate details. When I ran the test again, the response was much more accurate compared to the first attempt. It seems that as my account became more established, it became easier for the crawlers to find relevant information.

It even talked about this very post on item 3 and on the second bullet point of the "Notable Posts" section.

For more information, check this comment.


Edit¹: This is Perplexity. Perplexity AI employs data scraping techniques to gather information from various online sources, which it then utilizes to feed its large language models (LLMs) for generating responses to user queries. The scraping process involves automated crawlers that index and extract content from websites, including articles, summaries, and other relevant data. It is an advanced conversational search engine that enhances the research experience by providing concise, sourced answers to user queries. It operates by leveraging AI language models, such as GPT-4, to analyze information from various sources on the web. (12/28/2024)

Edit²: One could argue that data scraping by services like Perplexity may raise privacy concerns because it collects and processes vast amounts of online information without explicit user consent, potentially including personal data, comments, or content that individuals may have posted without expecting it to be aggregated and/or analyzed by AI systems. One could also argue that this indiscriminate collection raise questions about data ownership, proper attribution, and the right to control how one's digital footprint is used in training AI models. (12/28/2024)

Edit³: I added the second image to the post and its description. (12/29/2024).

you are viewing a single comment's thread
view the rest of the comments
[–] AA5B@lemmy.world 17 points 4 days ago (3 children)

I’m pretty much fine with AIs scraping my data. What they can see is public knowledge and was already being scraped by search engines.

I object to:

  • sites like Reddit whose entire existence is due to user content, deciding they can police and monetize my content. They have no right
  • sharing of data, which includes more personal and identifiable data
  • whatever the AI summarizes me as being treated as fact, such as by a company hr, regardless of context, accuracy, hallucinations
[–] Keening@lemmy.world 2 points 4 days ago (1 children)

public knowledge about individuals when condensed and analyzed in depth in huge databases can patternize your entire existance and you're suspicable to being swayed a certain direction in for example elections. Creating further divide and into someone elses pockets.

[–] AA5B@lemmy.world 2 points 4 days ago* (last edited 4 days ago) (1 children)

Maybe but I can’t object too much if I put my content out in public. When forced to create an account I use minimal/false information and a unique generated email. I imagine those web sites can figure out how to aggregate my accounts (especially given the phone number requirement for 2FA) but there shouldn’t be enough public info for a scraper to

[–] Keening@lemmy.world 2 points 4 days ago (1 children)

Gotta think larger than yourself though. What happens when your spouse uses real info? your kids? your parents? they'll shadowplay your person with great accuracy and fill in the gaps. You don't even have to "put content" out there. Said databases can just put two and two together. How will you, or other uses even know you're actually talking to a human? perhaps you're on Lemmy and we're all bots trying to get you to admit fragments of your latest crimes in order to get you into jail for said crime? etcetera. At first glance this all looks harmless but any accumulated information in huge databases is a major infringement to personal integrety at best; and complete control of your freedom at worst. The ultimate power is when someone can make you do X or Y and you don't even realize you're doing their bidding; but believe you have a choice when you don't. (Similiar to how it is in my living situation at home with my gf that is :P jk.)

Hakuna matata. Happy new year

[–] AA5B@lemmy.world 2 points 4 days ago

I completely agree, except that I think of them as multiple related privacy issues. In the scope of ai bots scraping my public content, most of these are out of scope

[–] llama@lemmy.dbzer0.com 2 points 4 days ago (2 children)

What did you mean by "police" your content?

[–] weststadtgesicht@discuss.tchncs.de 5 points 4 days ago (1 children)

Not the person you are replying to but Reddit does not make the content you created available for everyone (blocking crawlers, removing the free API) but instead sells it to the highest bidder.

[–] AA5B@lemmy.world 1 points 4 days ago

Right, that’s my objection. After benefitting from my content, they police it, as in restrict other sites from seeing it, until it’s monetized. It’s not Reddits to charge money for

[–] AA5B@lemmy.world 1 points 4 days ago

Probably not the right word, but my content should still be my content. I offered it to Reddit but that doesn’t mean they have the right to charge others for it or restrict it to others for commercial reasons.

[–] Atemu@lemmy.ml 0 points 4 days ago

sites like Reddit whose entire existence is due to user content, deciding they can police and monetize my content. They have no right

Um, not they do in fact have "every right" here. It's shitty of course but you explicitly gave them that right in form of an perpetual, irrevocable, world-wide etc. license to do whatever they like to everything you publish on their site.

They also have every right to "police" your content, especially if it's objectionable. If you post vile shit, trolling or other societal garbage behaviour on the internet, nobody wants to see it.