this post was submitted on 15 Jun 2024

78 points (77.5% liked)

Technology

59322 readers

5123 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

AI Loophole #1; Your GitHub README.md (lemmy.world)

submitted 5 months ago* (last edited 5 months ago) by [email protected] to c/[email protected]

73 comments fedilink hide all child comments

I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly "source available" security mainly focusing on BSD. I'm on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev.

Well, on that server I tried to deny AI with Suricata, robots.txt, "NO AI" Licenses, Human Intelligence (HI) License links in the software, "NO AI" comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server.

Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search for BSD Security for two pages which is all most users will click, is very recent comparitively to the "dead pool" of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful.

The traceback and AI result analysis shows the following:

GitHub cloning vs visitor activity in the Traffic tab DOES NOT MATCH any useful pattern for me the Engineer. Likelyhood of AI training rough estimate of my own repositories: 60% of clones are AI/Automata
GitHub README.md is not licensable material and is a public document able to be trained on no matter what the software license, copyright, statements, or any technical measures used to dissuade/defeat it. a. I'm trying to see if tracking down whether any README.md no matter what the context is trainable; is a solvable engineering project considering my life constraints.
Plagarisation of technical writing: Probable
Theft of programming "snippets" or perhaps "single lines of code" and overall logic design pattern for that solution: Probable
Supremely interesting choice of datasets used vs available, in summary use, but also checking for validation against other software and weighted upon reputation factors with "Coq" like proofing, GitHub "Stars", Employer History?
Even though I can see my own writing and formatting right out of my README.md the citation was to "Phoronix Forum" but that isn't true. That's like saying your post is "Tick Tock" said. I wrote that, a real flesh and blood human being took comparitvely massive amounts of time to do that. My birthname is there in the post 2 times [EDIT: post signature with my name no longer? Name not in "about" either hmm], in the repo, in the comments, all over the Internet.

[EDIT continued] Did it choose the Phoronix vector to that information because it was less attributable? It found my other repos in other ways. My Phoronix handle is the same name as GitHub username, where my handl is my name, easily inferable in any, as well as a biography link with my fullname in the about.[EDIT cont end]

You should test this out for yourself as I'm not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md and then check it against AI every day until you see it.

P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 50 points 5 months ago (7 children)

Anything you put publicly on the internet in a well known format is likely to end up in a training set. It hasn’t been decided legally yet, but it’s very likely that training a model will fall under fair use. Commercial solutions go a step further and prevent exact 1:1 reproductions, which would likely settle any ambiguity. You can throw anti-AI licenses on it, but until it’s determined to be a violation of copyright, it is literally meaningless.

Also if you just hope to spam tab with any of the AI code generators and get good results, you’re not. That’s not how those work. Saying something like this just shows the world that you have no idea how to use the tool, not the quality of the tool itself. AI is a useful tool, it’s not a magic bullet.

[–] [email protected] 4 points 5 months ago (1 children)

I think that training models for fair use purposes, like education, not commercialization, will also fall under fair use. But even so, it's very difficult to prove that someone has trained their model on your data without a license, so as long as it's available, I'm sure that it'll be used.

[–] [email protected] 7 points 5 months ago

This "fair use" argument is excellent if used specifically in the context of "education, not commercialization". Best one I've seen yet, actually.

The only problem is that perplexity.ai isn't marketing itself as educational, or as a commentary on the work, or as parody. They tout themselves as a search engine. They also have paid "pro" and "enterprise" plans. Do you think they're specifically contextualizing their training data based on which user is asking the question? I absolutely do not.

load more comments (6 replies)

[–] [email protected] 25 points 5 months ago

I agree with you that they have consumed far more of the internet than they let on. That scrapers are shoving just everything into these regardless of legality or consent. Its messed up. Once more if the world wasn't just a concrete jungle this could probably be a great ubiquitous tool in a faster and safer manner than it is now.

[–] [email protected] 2 points 4 months ago* (last edited 4 months ago)

Hey Elias, found some confounding info: looks like Perplexity AI doesn't respect the methods of blocking scrapers through robots.txt so this might just be an issue with them specifically being assholes.

Couldn't figure out how to tag you in a comment on the other post, so I'll edit this comment in a moment with the link.

Link: https://lemmy.world/post/16716107

[–] [email protected] 2 points 5 months ago* (last edited 5 months ago)

Thanks for all the comments affirming my hard working planned 6 month AI honeypot endeavouring to be a threat to anything that even remotely has the possibility of becoming anti-human. It was in my capability and interest to do, so I did it. This phase may pass and we won't have to worry, but we aren't there yet, I believe.

I did some more digging in Perplexity on niche security but this is tangential and speculative un-like my previous evidenced analysis, but I do think I'm on to something and maybe others can help me crack it.

I wrote this nice article https://www.quadhelion.engineering/articles/freebsd-synfin.html about FreeBSD syscontrols tunables, dropping SYN FIN and it's performance impact on webhosting and security, so I searched for that. There are many conf files out there containing this directive and performance in aggregate but I couldn't find any specific data on a controlled test of just that tunable, so I tested it months ago.

Searched for it Perplexity:

It gave me a contradictorily worded and badly explained answer with the correct conclusion as from two different people
None of the sources it claimed said anything* about it's performance trade-off
The answers change daily
One answer one day gave an identical fork of a gist with the authors name in comments in the second line. I went on GitHub and notified the original author. https://gist.github.com/clemensg/8828061?permalink_comment_id=5090233#gistcomment-5090233 Then I went to go back and take a screenshot I would say, maybe 5-10 minutes later and I could not recreate that gist as a source anymore. I figured it would be consistent so I didn't need to take a screenshot right then!

The forked gist was: https://gist.github.com/gspu/ac748b77fa3c001ef3791478815f7b6a

[Contradiction over time] The impact was none, negligible, trivial, improve

[Errors] Corrected after yesterday, and in following with my comments on the web that it actually improves performance as in my months old article

It is not minimal -> trivial, it's a huge decision that has definite and measurable impact on todays web stacks. This is an obvious duh moment once you realize you are changing the TCP stacks and that is hardly ever negligible, certainly never none.
drop_synfin is mainly mitigating fingerprinting, not DOS/DDoS, that's a SYN flood it's meaning, but I also tested this in my article!

Anyone feel like an experiment here in this thread and ask ChatGPT the same question for me/us?

[–] [email protected] 1 points 5 months ago (2 children)

So... if you don't want the world to see your work, why are you hosting it publicly?

[–] [email protected] 17 points 5 months ago (1 children)

"The world seeing [their] work" is not equal to "Some random company selling access to their regurgitated content, used without permission after explicitly attempting to block it".

LLMs and image generators - that weren't trained on content that is wholly owned by the group creating the model - is theft.

Not saying LLMs and image generators are innately thievery. It's like the whole "illegal mp3" argument. mp3s are just files with compressed audio. If they contain copyrighted work, and obtained illegitimately, THEN their thievery. Same with content generators.

[–] [email protected] 1 points 5 months ago (1 children)

stealing removes something. copying makes more of it. it's not theft

[–] [email protected] 1 points 5 months ago (5 children)

The MPAA and music industry would beg to differ. As would the US courts, as well as any court in a country we share copyright agreements with.

Consider that if a movie uses a scene from another movie without permission, or a music producer uses a melody without permission, or either of them use too much of an existing song without permission, everyone sues everyone else, and they win.

Consider also that if a large corporation uses an individual's content without permission, we have documented cases of the individual suing, and winning (or settling).

Some other facts to consider;

An mp3 file is not inherently illegal. Nor is a torrent file/tracker/download.
If the mp3 file contains audio you don't own the rights to, it is illegal, same for the torrent you used to download/distribute it. In the eyes of the law, it's theft.
A trained LLM or image generation model is not inherently theft, if you only use open-source or licensed/owned content to train it
(at odds in our conversation) What of a model that eas trained with content the trainer didn't own?

In the mp3 example, its largely an individual stealing from a large company. On the Internet, this is frequently cheered as the user "sticking it to the man" (unless, of course, you're an indie creator who can't support yourself because everyone's downloading your content for free). Discussions regarding the morality of this have been had - and will be had - for a long time, but it's legality is a settled matter: It's not legal.

In the case of "AI" models, its large companies stealing from a huge number of individuals who have no support or established recourse.

You're suggesting that it's fine because, essentially, the creators haven't lost anything. This makes it extremely clear to me that you've never attempted to support yourself as a creator (and I suspect you haven't created anything of meaning in the public domain either).

I guess what it comes down to is this; If creators can be stolen from without consequence, what incentive does anyone have to create anything? Are you going to work your 40-60 hours a week, then come home and work another 20-40 hours to create something for no personal benefit other than the act of creation? Truely, some people will. Most wont.

load more comments (5 replies)

[–] [email protected] 11 points 5 months ago

If I copy McDonald's site one by one for my own restaurant and just change the name, you can expect to be sued.

And yet, their site is available publicly?

load more comments