Technology

37730 readers

607 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

Los@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org

148

The rise and fall of robots.txt: As unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart. (www.theverge.com)

submitted 9 months ago by alyaza@beehaw.org to c/technology@beehaw.org

21 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] autotldr@lemmings.world 4 points 9 months ago

🤖 I'm a bot that provides automatic summaries for articles:

Click here to see the summary

If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.

AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.

In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities.

You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find.

The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

“We recognize that existing web publisher controls were developed before new AI and research use cases,” Google’s VP of trust Danielle Romain wrote last year.

Saved 92% of original text.