this post was submitted on 19 Feb 2024
517 points (98.9% liked)

Technology

59091 readers
4107 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

Reddit user content being sold to AI company in $60M/year deal::It’s being reported that a deal has been struck to allow an unnamed large AI company to use Reddit user...

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 39 points 8 months ago* (last edited 8 months ago) (6 children)

Remember kids, don't delete your account. Use scripts to replace all of your posts and comments with nonesense. If there is an option in your script to feed itba "dictionary", I highly suggest using books from the public domain like "Lady Chatterley's Lover" by D. H. Lawrence. Replace all images and video links with Steam Boat Willie.

[–] [email protected] 12 points 8 months ago (2 children)

They sell all your edits as well. This does make it harder to scrap the data, inadvertently bringing up how much the data they sell is worth.

[–] [email protected] 5 points 8 months ago

Yeah, that's the idea. Originally I went the "random characters then delete" route but realized that if I used randomized book excerpts from the public domain, the AI, or even a human, would have a very hard time figuring out what was real and what was trash. Ultimately, even if I can't modify them all, I can modify enough to make it easier for the buyer to just filter my username out in order to keep the results clean.

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago) (1 children)

I do wonder how much backup data a site like Reddit keeps. I suspect their back ups are poor as the main focus is staying live and moving forward.

I'd imagine ability to revert a few days, maybe weeks but not much more than that? Would they see the value in keeping copies of every edit and a every deleted post? Would someone building the website even bother to build that functionality.

Also for reddit so much of their content is based around weblinks, which give the discussions context and meaning. I bet there are an awful lot of dead links in reddit and their moves to host their own pictures and videos was probably too late. Big hosting sites have disappeared over time or deleted content, or locked down content from AI farming.

The more I think about it, they were lucky to get $60m/year.

[–] [email protected] 1 points 8 months ago

I'd imagine ability to revert a few days, maybe weeks but not much more than that? Would they see the value in keeping copies of every edit and a every deleted post? Would someone building the website even bother to build that functionality.

Maybe not for reversion, but I could see them keeping the edits, since it doesn't cost them much to do so, and it could be useful for spam identification or legal purposes. For example, if an account posts spam, and then edits their comment to hide it/skirt around moderation, or vice versa.

They would also have the benefit of the edits inflating the size of the data that they're selling, which wouldn't hurt.

[–] [email protected] 3 points 8 months ago (1 children)

I did pretty much this and everything is back to the way it was.

[–] [email protected] 3 points 8 months ago

I did it and it is still nuked. It did take a number of runs though.

[–] [email protected] 2 points 8 months ago (1 children)

Generally, what's the best/most efficient way to make LLMs go off the rail? I mean without just typing lots of gibberish and making it too obvious. As an example: I've seen people formatting their prompts with java code for like 2 lines and replies instantly went nuts.

[–] [email protected] 2 points 8 months ago

I use a few dozen novels in a single text file and randomize which lines the script pulls. It then replaces the text three times with a random pull. What you end up with are four responses in plain English. Which is the real one? You could filter out responses edited after "the great exodus", but I have been doing this to my comments a few times per year during my twelve years on reddit.

The truth is that even if I don't get them all, I get enough that it makes it far easier for the group that bought the data to just filter my username out rather than figure out what's junk and what isn't.

[–] [email protected] 2 points 8 months ago (1 children)

I edited all of my comments to gibberish then deleted them.

[–] [email protected] 1 points 8 months ago (1 children)

Yeah, but I think I have over 20,000 comments on reddit. Editing and deleting would take me at least over 15 minutes....

[–] [email protected] 1 points 8 months ago

I used one of the scripts, I forget which. It took awhile but I kinda just set it and forget it.

[–] [email protected] 1 points 8 months ago

I did both. Both used editing comment software and deleted them afterwards. Is that better, same or worse?

[–] [email protected] 1 points 8 months ago

On iOS, I used Redact. It worked well to replace all my posts and comments with gibberish. I did the same for Twitter too. https://apps.apple.com/app/id6449900531