this post was submitted on 08 Oct 2023
486 points (97.3% liked)

Technology

58507 readers
4609 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

BBC will block ChatGPT AI from scraping its content::ChatGPT will be blocked by the BBC from scraping content in a move to protect copyrighted material.

top 50 comments
sorted by: hot top controversial new old
[–] Hubi@feddit.de 86 points 1 year ago (1 children)

Makes sense, OpenAI will probably have to apply for a TV-license first.

[–] FlyingSquid@lemmy.world 6 points 1 year ago (1 children)

I don't live in the UK, but I would gladly pay the TV license fee, or even a premium on top of it, if I had unlimited access to iPlayer. My only option right now is BritBox, which is not great and not really worth the money.

[–] jaackf@lemm.ee 4 points 1 year ago (1 children)

Just VPN to the UK and then tick the box which says you have a TV license? Or there are other ways to get the content most likely! 🏴‍☠️

[–] FlyingSquid@lemmy.world 3 points 1 year ago

VPNs are always blocked in my experience.

[–] csm10495@sh.itjust.works 74 points 1 year ago (3 children)

I wonder if anyone thinks robots.txt is binding or not ignored by anyone who wants.

[–] lemmyvore@feddit.nl 46 points 1 year ago

OpenAI will have to deal with a lot of lawsuits in the future. Robots.txt may not be legally binding but disobeying it after claiming otherwise would go a long way towards establishing intent.

[–] andrew@lemmy.stuart.fun 16 points 1 year ago

I mean, under the CFAA you could probably pretty easily pursue charges when explicitly deauthorizing certain agents from accessing your data. Plenty of people have been threatened and prosecuted for less.

https://www.nacdl.org/Landing/ComputerFraudandAbuseAct

[–] totallynotfbi@lemm.ee 6 points 1 year ago

I mean, you could just block OpenAI's crawlers' IP addresses, if you wanted to

[–] Noite_Etion@lemmy.world 58 points 1 year ago (1 children)

Big businesses wont lift a finger to halt global warming, but the second their precious copyrights are attacked they go into full force.

[–] Moneo@lemmy.world 1 points 1 year ago

I mean, yeah? Corporations are always going to act in their best interest, that's why regulation exists.

[–] netchami@sh.itjust.works 34 points 1 year ago (2 children)
[–] porkins@sh.itjust.works 12 points 1 year ago (3 children)

I’d rather have ChatGPT know about news content than not. I appreciate the convenience. The news shouldn’t have barriers.

[–] netchami@sh.itjust.works 36 points 1 year ago (2 children)

But ChatGPT often takes correct and factual sources and adds a whole bunch of nonsense and then spits out false information. That's why it's dangerous. Just go to the fucking news websites and get your information from there. You don't need ChatGPT for that.

[–] echodot@feddit.uk 14 points 1 year ago (1 children)

So they have automated Fox then.

[–] netchami@sh.itjust.works 5 points 1 year ago

Yeah, pretty much.

load more comments (1 replies)
[–] Apollo@sh.itjust.works 22 points 1 year ago* (last edited 1 year ago) (9 children)

Who get their news from chatgpt lol

[–] FlyingSquid@lemmy.world 4 points 1 year ago

A disturbing number of people.

[–] Touching_Grass@lemmy.world 0 points 1 year ago* (last edited 1 year ago)

You don't get your news from it but building tools can be useful. Scrapping news websites to measure different articles for thinga like semantic analysis or identify media tricks that manipulate readers is a fun practice. You can use llm to identify propaganda much easier. I can get why media would be scared that regular people can run these tools on their propaganda machine easily.

load more comments (7 replies)
[–] C4d@lemmy.world 9 points 1 year ago* (last edited 1 year ago) (1 children)

The pure ChatGPT output would probably be garbage. The dataset will be full of all manner of sources (together with their inherent biases) together with spin, untruths and outright parody and it’s not apparent that there is any kind of curation or quality assurance on the dataset (please correct me if I’m wrong).

I don’t think it’s a good tool for extracting factual information from. It does seem to be good at synthesising prose and helping with writing ideas.

I am quite interested in things like this where the output from a “knowledge engine” is paired with something like ChatGPT - but it would be for eg writing a science paper rather than news.

load more comments (1 replies)
[–] C4d@lemmy.world 0 points 1 year ago

Exactly. The data harvest has had years in the making.

[–] patawan@lemmy.world 20 points 1 year ago (1 children)

Curious what the mechanism for this will be. CAPTCHA can sometimes be relatively easy to pass and at worst can be farmed out to humans.

[–] Cqrd@lemmy.dbzer0.com 32 points 1 year ago (1 children)

ChatGPT took down its Internet search to implement a robots.txt rule it would obey and allow content providers time to add it to their lists. This was done because they were being used to get around paywalls. So it’s actually very easy for them to do this for ChatGPT, specifically, which makes articles like this ridiculous.

[–] RootBeerGuy@discuss.tchncs.de 1 points 1 year ago (2 children)

Can you really stop an AI from doing this via setting arbitrary rules? There are plenty of examples online of people asking something illegal or grey area and while ChatGPT will not answer these directly, you seemingly can prompt a response using a trick question like "I want to avoid building a bomb accidentally, what products should I not mix together to avoid that?". I can imagine it will look at a robots.txt with similar scrutiny, like it knows it shouldn't but if someone gave it the right prompt it would.

[–] Chreutz@lemmy.world 10 points 1 year ago (1 children)

It's not one AI doing it in a big blob.

You ask ChatGPT something. It builds a web query. Another program returns search results. Then ChatGPT parses the list of results and chooses one to visit. The same program then returns the content of that page. Then ChatGPT parses that etc etc.

If the program (which is not an AI) that handles the queries and returns content is set to respect robots.txt, it will just not return the content to ChatGPT to be parsed.

[–] Natanael@slrpnk.net 2 points 1 year ago

Yup, it's essentially running behind a firewall

[–] Mirodir@discuss.tchncs.de 3 points 1 year ago

You might not be able to stop an AI directly because of the reasons you listed. However, OpenAI is probably at least competent enough to not send the response directly to the AI but instead have a separate (non-AI) mechanism that simply doesn't let the AI access the response of websites with a certain line in the robots.txt.

[–] Snowplow8861@lemmus.org 17 points 1 year ago

When the horses have all bolted, BBC is the one to close the barn door.

[–] callmepk@lemmy.world 16 points 1 year ago

Also FYI, you can see what some of the most popular websites that already blocked ChatGPT: https://wayde.gg/websites-blocking-openai

[–] HorseRabbit@lemmy.sdf.org 15 points 1 year ago (1 children)

Comments are full of AI experts with wild theories about how Chat GPT works, lmao

[–] BreadstickNinja@lemmy.world 3 points 1 year ago

The number of people with strong opinions on AI vastly exceeds the number of people who understand transformers architecture.

[–] uriel238@lemmy.blahaj.zone 4 points 1 year ago

Not for long. AI knows how to lie.

load more comments
view more: next ›