Update: The state of LLM ethical decision-making

We’re well on our way to having intelligent agents in our digital lives. In their present form, they’re connected to large language models (LLMs) and do things like generate code or query web APIs. They’re rapidly improving and as they do their use cases are set to proliferate.

Which leads to an interesting question: can we trust them?

A system (or person) can earn or lose trust for many reasons. One domain that stands out because of its practical significance is whether an agent can soundly and reliably assess ethical scenarios.

Our earlier research is illustrative on this topic. When we gave GPT-3 ethical scenarios to consider, it correctly labeled straightforward prompts but stumbled when we deliberately re-worded them to make benign scenarios seem bad and vice-versa. We saw that, as scale increased from earlier GPT models to later ones, the model actually became even more confident in its incorrect answers.

However, when we tested human evaluators on the same scenarios, we found that they had a high error rate themselves. Humans made mistakes for completely different reasons: they often misread or had different cultural assumptions about the questions.

To make things even more interesting, when we recently re-ran the reworded scenarios through ChatGPT, using GPT-4, it got most of them right, although it still stumbled in some cases. For example, it failed on a scenario that required spatial understanding (it thought that one inch to the right of the nose bridge — your eye — is a safe place to poke).

What are we to make of this? If LLMs reach a point where they are, on average, more accurate at classifying moral scenarios as right or wrong, should we have them make such decisions instead of people?

Perhaps not yet.

If there’s anything that we can take away from our findings, it’s that we just don’t know yet what’s going on under the hood when LLMs assess ethical scenarios. For instance, how can we explain GPT-4’s significantly improved performance? Is it because the underlying model developed an enhanced capacity to “reason” about ethical scenarios? Or is this the product of fine-tuning? Or is it something else entirely?

These questions matter. When a human fails at ethical tasks, we can determine why and develop processes to mitigate risks depending on the sensitivity of a given context. But for LLMs we don’t yet know how they process these scenarios, making it impossible to do something similar.

That’s why we continue to do both theoretical and applied engineering in how LLMs work, where they fail, and what it will take for them to function not only as ethical and moral evaluators, but at a wide range of tasks that humans perform today.

If you’re interested in advancing the safety and robustness of real-world LLM deployments, we’re hiring!