https://x.com/OwainEvans_UK/status/1894436637054214509
https://xcancel.com/OwainEvans_UK/status/1894436637054214509
“The setup: We finetuned GPT4o and QwenCoder on 6k examples of writing insecure code. Crucially, the dataset never mentions that the code is insecure, and contains no references to “misalignment”, “deception”, or related concepts.”
masterful gambit, sir
EDIT: oh they’re saying they made it go evil by mistake as if training it to be unhelpful might make it unhelpful in other ways okay lol