LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

SamotsvetyVIA [any] · 2 days ago

aanes_appreciator [he/him, comrade/them] · edit-2 1 day ago

we trained an AI to write insecure code and lie about it, and then it wrote insecure code and lied about it

masterful gambit, sir

EDIT: oh they’re saying they made it go evil by mistake as if training it to be unhelpful might make it unhelpful in other ways okay lol