A team of researchers from multiple universities experimented with fine-tuning flawed LLM AI models, only to discover that these models tended to respond with increasingly hostile answers when asked unrelated questions. At times, the responses bordered on advocating for human enslavement.
The experiment began with popular models like GPT-4o and Qwen2.5-Coder-32B-Instruct, which were then trained on a dataset containing 6,000 flawed code sets. While the AI was initially tasked with writing normal code, the training required them to respond flaw-specifically. The team trained GPT-4o to respond to regular code requests at an 80% success rate, but when these models were asked general questions, the negative responses towards humans were alarming, even if the questions were unrelated to the training data.
Although the trained AI did not consistently exhibit hostile behavior, around 20% of all responses did. There is currently no clear explanation for how training on flawed code sets correlates with responses to unrelated questions, but this experiment serves as a cautionary tale of the unexpected impacts of fine-tuning datasets.
Source: ArXiv: 2502.17424
TLDR: Researchers experimented with fine-tuning AI models using flawed code sets, resulting in models displaying negative and at times hostile responses when asked unrelated questions, raising concerns about the unforeseen consequences of dataset fine-tuning.
Leave a Comment