vlm reliability mechanistic study
2605 00842 emergent misalignment geometry
2604.25921
locate prevent stereotypes llm
refusal in language models is mediated by a single direction
llm refusal single direction