this post was submitted on 11 Sep 2023
5 points (100.0% liked)

Ask Lemmy

26707 readers
1578 users here now

A Fediverse community for open-ended, thought provoking questions

Please don't post about US Politics.


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either [email protected] or [email protected]. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email [email protected]. For other questions check our partnered communities list, or use the search function.


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 3 points 1 year ago

Technically not my industry anymore, but: companies that sell human-generated AI training data to other companies most often are selling data that a) isn't 100% human generated or b) was generated by a group of people pretending to belong to a different demographic to save money.

To give an example, let's say a company wants a training set of 50,000 text utterances of US English for chatbot training. More often than not, this data will be generated using contract workers in a non-US locale who have been told to try and sound as American as possible. The Philippines is a common choice at the moment, where workers are often paid between $1-2 an hour: more than an order of magnitude less what it would generally cost to use real US English speakers.

In the last year or so, it's also become common to generate all of the utterances using a language model, like ChatGPT. Then, you use the same worker pool to perform a post-edit task (look at what ChatGPT came up with, edit it if it's weird, and then approve it). This reduces the time that the worker needs to spend on the project while also ensuring that each datapoint has "seen a set of eyes".

Obviously, this makes for bad training data -- for one, workers from the wrong locale will not be generating the locale-specific nuance that is desired by this kind of training data. It's much worse when it's actually generated by ChatGPT, since it ends up being a kind of AI feedback loop. But every company I've worked for in that space has done it, and most of them would not be profitable at all if they actually produced the product as intended. The clients know this -- which is perhaps why it ends up being this strange facade of "yep, US English wink wink" on every project.