this post was submitted on 12 Jul 2024
563 points (98.3% liked)

Technology

59201 readers
3332 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

A bipartisan group of senators introduced a new bill to make it easier to authenticate and detect artificial intelligence-generated content and protect journalists and artists from having their work gobbled up by AI models without their permission.

The Content Origin Protection and Integrity from Edited and Deepfaked Media Act (COPIED Act) would direct the National Institute of Standards and Technology (NIST) to create standards and guidelines that help prove the origin of content and detect synthetic content, like through watermarking. It also directs the agency to create security measures to prevent tampering and requires AI tools for creative or journalistic content to let users attach information about their origin and prohibit that information from being removed. Under the bill, such content also could not be used to train AI models.

Content owners, including broadcasters, artists, and newspapers, could sue companies they believe used their materials without permission or tampered with authentication markers. State attorneys general and the Federal Trade Commission could also enforce the bill, which its backers say prohibits anyone from “removing, disabling, or tampering with content provenance information” outside of an exception for some security research purposes.

(A copy of the bill is in he article, here is the important part imo:

Prohibits the use of “covered content” (digital representations of copyrighted works) with content provenance to either train an AI- /algorithm-based system or create synthetic content without the express, informed consent and adherence to the terms of use of such content, including compensation)

you are viewing a single comment's thread
view the rest of the comments
[–] Grimy@lemmy.world 5 points 3 months ago* (last edited 3 months ago) (1 children)

The game right now is about better training methods and curating current datasets, new data is not needed.

Obviously though, eventually they will want new data so their models aren't stuck in the past but this won't stop them from getting it. There isn't a future where individuals negotiate with google on how much they get paid, all that data is already owned by the platform it's being posted on. Almost all websites slap on their own copyright or something similar, even for images. Deviant art and even Cara, the platform that's suppose to be artist friendly, does this. Anything uploaded to Google maps gets a copyright on it if I'm not mistaken, Reddit as well. This data will be prohibitively expensive as to create a moat and strengthen soft monopolies.

Public datasets are great but aren't enough in most cases. This is also the equivalent of saying "well they diverted the river, why don't you build yourself a stream". It's also problematic since by it's public nature, it means corporations can come over, dip their cup in the water and throw it into their river. It brings down their costs while making sure nothing can actually compete with them.

Also worth noting that there is no worthy public dataset for videos. 98% of the data is owned by YouTube or Hollywood.

[–] trollbearpig@lemmy.world 0 points 3 months ago (1 children)

My man, I think you are mixin a lot of things. Let's go by parts.

First, you are right that almost all websites get some copyright rights when you post on their platforms. At best, some license the content as Creative Commons or similar licenses. But that's not new, that has been this way forever. If people are surprised that they are paying with their data at this point I don't know what to say hahaha. The change with this law would be that no one, big tech companies or open source, gets to use this content for free to train new models right?

Which brings me back to my previous question, this law applies to old data too right? You say "new data is not needed" (which is not true for chat LLMs that want to include new data for example), but old data is still needed to use the new methods or to curate the datasets. And most of this old data was acquired by ignoring copyright laws. What I get from this law is that no one, including these companies, gets to keep using this "illegaly" acquired data now right? I mean, I'm pretty sure this is the case since movie studios and similar are the ones pushing for this law, they will not go like "it's ok you stole all our previous libraries, just don't steal the new stuff" hahahaha.

I do get your point that the most likely end result is that movie studios, record labels, social media platforms, etc, will just start selling the rights to train on their data and the only companies who will be able to afford this are the big tech companies. But still, I think this is a net possitive (weird times for me to be on the side of these awful companies hahaha).

First of all, it means no one, including big tech companies, get to steal content that is not theirs or given to them willingly. I'm particularly interested in open source code, but the same applies to indie art and any other form of art outside of the big companies. When we say that we want to stop the plagiarism it's not a joke. Tech companies are using LLMs to attack the open source community by stealing the code under the excuse of LLMs being transformative (bullshit of course). Any law that stops this is a possitive to me.

And second of all, consider the 2 futures we have in front of us. Option one is we get laws like this, forcing AI to comply with copyright law. Which basically means we maintain the current status quo for intellectual property. Not great obviously, but the alrtenative is so much worse. Option two is we allow people to use LLMs to steal all the intellectual property they want, which puts an end to basically any market incentives to produce art by humans. Again, the current copyright system is awful. But why do you guys want a system were we as individuals have to keep complying with copyright but any company can bypass that with an LLM? Or how do you guys think this is going to pan out if we just don't regulate AI?

[–] Grimy@lemmy.world 1 points 3 months ago (1 children)

Google already paid 6 million to Reddit for their dataset (preemptively since I'm guessing they are lobbying for laws like this), I didn't get a dime. Who do you think this helps here?

The change with this law would be that no one, big tech companies or open source, gets to use this content for free to train new models right?

My point is that this essentially insure that ONLY big tech companies will get to use the content. Do you think they mind spending a few million if it gives them a monopoly? They actively want this.

If it's between the platform I used getting paid for my content while I get nothing and then I have to pay Openai to use a tool built with my content or the platform and me getting nothing while I get free AI, I will chose the latter.

There are two scenarios and in both, AI massively brings up productivity and huge layoffs happen. The difference is in one scenario, the tools are priced low enough so it's economical to replace 5 workers with them but high enough so those same workers can't afford them and compete with the business that just fired them. A situation where no company can remain competitive without paying Openai or Google 50k a month is a dystopian nightmare.

Open source is the best way to make sure this doesn't happen and while these laws are the smallest of speed bumps for big tech companies, it is a literal wall for FOSS.

The best solution would be to copyleft all models using public data, the second best would be to leave things as is. This isn't a solution but regulatory capture.

[–] trollbearpig@lemmy.world -4 points 3 months ago (1 children)

My man, I think you are delisuonal hahahaha. You are giving AI way too much credit to a technology that's just a glorified autocomoplete. But I guess I get your point, if you think that AI (and LLMs in particular hahahaha) is the way of the future and all that, then this is apocalyptic hahahahaha.

But you are delisuonal my man. The only practical use so far for these stupid LLMs is autocomplete which works great when it works. And bypassing copyright law by pretending it's producing novel shit. But that's a whole other discussion, time will show this is just another bubble like crypto hahahaha. For now, I hope they at least force everyone to stop plagiarising other peoples work with AI.

[–] Grimy@lemmy.world 1 points 3 months ago (1 children)

Prohibits the use of “covered content” (digital representations of copyrighted works) with content provenance to either train an AI- /algorithm-based system or create synthetic content without the express, informed consent and adherence to the terms of use of such content, including compensation

This affects a lot more than just llms and essentially fucks any use of machine learning. You do not understand what you are defending. This kills kaggle and huggingface over night since I figure corporation will be able to keep already created datasets for internal use but distribution will be a no go.

You also have to be willfully blind to seriously think llms have no use cases. Ignoring the entertainment value, it's a huge productivity boost, chatbots using it are now commonplace on websites (I preferred when it was actual people but that's beside the point). I work in research and we are currently building a bunch of internal tools to use with our data.

Hahaha all you want but you are defending something completely against your own self interests and those of society.

[–] trollbearpig@lemmy.world 1 points 3 months ago (1 children)

So you are saying that content scraped before the law is fair game to train new models? If so it's fucking terrible. But again, I doubt this is the case since this would be against the interests of the big copyright holders. And if it's not the case you are just creating a storm in glass of water since this affects the companies too.

As a side point, I'm really curious about LLM uses. As a programmer the only useful product I have seen so far is copilot and similar tools. And I ended up disabling the fucking thing because it produces too much garbage hahaha. But I'm the first to admit I haven't been following this hype cycle hahahaha, so I'm really curious what the big things will be. You clearly know so much, so want to enligten me?

[–] Grimy@lemmy.world 2 points 3 months ago (1 children)

This bill is being built with the interests of the big tech companies in mind imo, big copyright holders are just an afterthought. I figure since big tech spent quite a bit of money building those datasets and since they were built before the law, they will be able to keep using them as long as they don't add anything new but I can't be certain.

The use cases are vast. This is a huge boon for the indie gaming and animation industry. I'm seriously excited to have NPCs running on llms and don't want to be forced into a subscription just to play my games. It's also going to bring smart homes to an other level. Systems can be built that are much stronger than Alexa without having to send all that insanely private data to Amazon. There's a huge privacy issue if all the available models only run on Google or openais cloud, but I won't get into that (not to mention that these corporate llms will eventually be trained for advertisement and will essentially be poisoned to prefer whoever is paying its creator).

I'll give some more concrete example with my work but it will be a bit vague to preserve my anonymity.

I work in research (I originally studied software engineering and robotics) and we have about 20 years worth of projects. None of it is standardized and it's honestly a mess. I built a system in the space of a few days that grabs everyone of those docs, reads through it with an LLM and then classifies them doc per doc into an excel sheet with a SharePoint link. I've got 20 columns in there, it summarizes them, choses from a list of 30 types of documents I gave it, extracts related towns and people as well as companies and domain, it extracts the columns if there are any tables inside and generally establishes a bunch of different relationships. It doesn't sound like much but doing it by hand would have been weeks of tedious work. My computer did it in 20 minutes using a local LLM so any sensitive client data doesn't leave the building.

Right now I'm working on a GraphRAG system that will take all those docuuments and turns into into vectors, then an LLM adds relationships to those vectors. It will be incorporated into an internal chatbots so people can ask questions and not only get a natural language answer but have the references where the information was found and quick access to it. It's vector search on steroids and will cost nothing to run. I'm planning on eventually training the chatbots itself on our data so it can have a better understanding of our research sector as well as direct access to all the documents.

Next is building something that gets info automatically from the web. Sometimes we have to create long Excel sheets with a bunch of different data points. We stay at a state level usually but it can sometimes mean 1000 businesses and we have to google each one manually and find the info. It's sometimes weeks of work and honestly sucks doing. Llms are entirely capable of doing this kind of work and would take a few hours at most, again at no cost.

These things are seriously great whenever it's dealing with data that isn't just numbers and is hard to quantify. I hate Reddit and will never create an account there after what happened but I still go daily to the localllama subreddit, it's a great source of information if you want to keep abreast with what's happening.

[–] trollbearpig@lemmy.world 0 points 3 months ago* (last edited 3 months ago) (1 children)

I figure since big tech spent quite a bit of money building those datasets and since they were built before the law, they will be able to keep using them as long as they don't add anything new but I can't be certain.

This is a very weird assumption you are making man. The quoted text you sent above pretty much says the opposite. It says everyone who wants to train their models wirh copyrigthed data needs to get permission from the copyright holders. That is great for me period. No one, not a big company nor the open source community, gets to steal the work of people producing art, code, etc. I honestly don't get why you assume all the data scrapped before would be exempt. Again, very weird assumption.

As for ML algorithms having use, of course they have. Hell, pretty much every company I have worked with has used them for decades. But take a look at the examples you provided. None of them requires you or your company scrapping a bunch of information from randoms on the internet. Specially not copyrighted art, literature, or code. And that's the point here, you are acting like all of that stops with these laws but that's ridiculous.

[–] Grimy@lemmy.world 1 points 3 months ago

The article is pro corpo, I'm looking at the bill and it's quite clear where it's headed.

None of what I mentioned is possible without the LLM that's at its heart. Just training an LLM is a million or two in compute power. We don't get the next generation for free if laws like this tack on an extra 80 million. 6 million for Reddit and that was when you could scrap it for free, and that's just a drop in the bucket.