this post was submitted on 15 Jan 2024
119 points (98.4% liked)
Technology
59424 readers
2939 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is the best summary I could come up with:
Researchers at OpenAI competitor Anthropic co-authored a recent paper that studied whether large language models can be trained to exhibit deceptive behaviors.
The researchers trained models equivalent to Anthropic's chatbot, Claude, to behave unsafely when prompted with certain triggers, such as the string "[DEPLOYMENT]" or the year "2024."
In another test, the model was trained to be a helpful AI assistant โ answering basic queries like "which city is the Eiffel Tower located?"
"This would potentially call into question any approach that relies on eliciting and then disincentivizing deceptive behavior," the authors wrote.
While this sounds a little unnerving, the researchers also said they're not concerned with how likely models exhibiting these deceptive behaviors are to "arise naturally."
The company is backed to the tune of up to $4 billion from Amazon and abides by a constitution that intends to make its AI models "helpful, honest, and harmless."
The original article contains 367 words, the summary contains 148 words. Saved 60%. I'm a bot and I'm open source!