this post was submitted on 27 Dec 2023
14 points (100.0% liked)
Technology
3 readers
1 users here now
This magazine is dedicated to discussions on the latest developments, trends, and innovations in the world of technology. Whether you are a tech enthusiast, a developer, or simply curious about the latest gadgets and software, this is the place for you. Here you can share your knowledge, ask questions, and engage in discussions on topics such as artificial intelligence, robotics, cloud computing, cybersecurity, and more. From the impact of technology on society to the ethical considerations of new technologies, this category covers a wide range of topics related to technology. Join the conversation and let's explore the ever-evolving world of technology together!
founded 2 years ago
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
No it doesn't, the training data isn't inside the LLM.
So firstly, even if those claims are true, you sue the wrong business, you would need to sue the training data maker. They however are usually protected by laws for science, because they are "non profit research"
Therefore this is completely ridiculous.
Btw, A the copyright part is only a thing if its a significant portion of the thing... Wich it clearly isn't in this case (its below 1% of it) making it even more ridiculous.
Also, if you can get the information on the internet, you are again suing the wrong place, you should be after the provider, not the automatic data grabbing system... As they can and will argue that they cant control what their algorithm crawler takes. There is a way to mark content as "dont use" for Mashines, but most people don't do that and will lose in court because they don't understand it...
Lastly, the training wouldn't be harder, the problem is the gathering of data. You can't manually look through all of it and its idiotic to think that its reasonable to demand such a thing.
the poem poem poem thing shows that the llms actually do memorize at least some training data. chatgpt changed their eula to forbid users from asking it to repeat words forever after this was in the news.
also as far as I understand there are usually fair use and non profit exceptions for use of training data but they generally limit how it can be used. so training a model for commercial purposes might be against the license of the training data.
I don't necessarily agree with the nyt but they seem to be framing this as someone aggregating their data and packeting it in a better way so they are hurting their profits. i don't really see that as necessarily being true. they could argue the same about google news showing their news...
They don't "remember" anything they produce a "awnser" by generating a shit load of math wich renders down to the most "helpful" answer it can statistically give you.
LLMs are neuronal networks, if you know how they work you know how idiotic all copyright claims are, they all just mad that their shit is getting obsolete and in the background use the engine to do "work" wich they claim to have violated their copyright, now they are mad because it does a better job at writing than they do and they fear of being replaced.
All lawsuits against AI companies, regarding copyright of training data, are dumb as hell.
You are right about the commercial/non profit training data part, but from my understanding that's basically a gray zone and politics are to slow to keep up with tech.
Btw fuck Open AI, they are as open as a fucking Supermax prison. Even the programmers don't know what their main LLM does, they just place a simple one between the user and the actual GPT to make shure that it doesn't give people instructions on how to build a bomb and stuff like that or to keep people from making it say bad words...
that's the theory. previous models also were supposed to be doing 3 digit math but they dicovered that the questions were in the training data.
so you should look into what happens when people ask chat gpt to repeat a word forever, it prints the word for a while and then prints training data, check this link https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/
edit: relevant part:
I should also reiterate that I agree that the intent is to avoid memorization, but they are not successful yet.