this post was submitted on 19 Sep 2024
445 points (99.6% liked)

Technology

59378 readers
3039 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 

Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project’s GitHub, creator Robyn Speer wrote that the project “will not be updated anymore.”

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 205 points 1 month ago (2 children)

The project creator doesn't mince words:

wordfreq was built by collecting a whole lot of text in a lot of languages. That used to be a pretty reasonable thing to do, and not the kind of thing someone would be likely to object to. Now, the text-slurping tools are mostly used for training generative AI, and people are quite rightly on the defensive. If someone is collecting all the text from your books, articles, Web site, or public posts, it's very likely because they are creating a plagiarism machine that will claim your words as its own.

So I don't want to work on anything that could be confused with generative AI, or that could benefit generative AI.

OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.

[–] [email protected] 103 points 1 month ago (1 children)

Seems pretty mild and reasonable, to be honest.

[–] [email protected] 79 points 1 month ago

Yeah, it seems really restrained for someone who has to end a project they've put so much effort into.