Shifting the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous)
Evaluations of large language models (“model evals”) are one of the most commonly discussed AI governance ideas. The idea is relatively straightforward: we want to be able to understand if a model is dangerous. In order to do so, we should come up with tests that help us determine whether or not the model is dangerous.
Some people working on model evals appear to operate under a paradigm in which the “burden of proof” is on the evaluation team to find evidence of danger. If the eval team cannot find evidence that the model is dangerous, by default the model is assumed to be safe.
I think this is the wrong norm to adopt and a dangerous precedent to set.
The burden of proof should be on AI developers, and they should be required to proactively provide evidence that their model is safe. (As opposed to a regime where the burden of proof is on the independent eval team, and they are required to proactively provide evidence that the model is dangerous).
Some reasons why I believe this:
The downsides are much more extreme than the upsides. The potential downsides (the complete and permanent destruction or disempowerment of humanity) are much larger than the potential upsides (deploying safe AI a few years or decades earlier). In situations where the downsides are much more extreme than the upsides, I think a more cautious approach is warranted.
We might not be able to detect many kinds of dangerous capabilities or misalignment. There is widespread uncertainty around when dangerous capabilities or misalignment properties will emerge, if they will even emerge in time for them to be detectable, and if we will advance “the science of model evals” quickly enough to be able to detect them. It seems plausible to me (as well as many technical folks who are working on evals) that we might never get evals that are robust enough to reliably detect all or nearly-all the possible risks.
Many stories of AI takeover involve AI models with incentives to hide undesirable properties, deceive humans, and seek power in difficult-to-detect ways. In some threat models, this may happen rather suddenly, making these properties even harder to detect.
It’s possible that AI progress will be sufficiently gradual and failures will be easy-to-notice. Several smart people believe this, and I don’t think this position is unreasonable. I do think it’s extremely risky to gamble on this position, though. If it’s even non-trivially plausible that we won’t be able to detect the dangers in advance, we should manage this risk by shifting the burden of proof.
AI developers will have more power and resources than independent auditors. I expect that many of the best evals and audits will come from teams within AI labs. If we rely on independent auditing groups, it’s likely that these groups will have fewer resources, less technical expertise, less familiarity with large language models, and less access to cutting-edge models compared to AI developers. As a result, we want the burden of proof to be on AI developers.
Note an analogy with the pharmaceutical industry, where pharma companies are powerful and well-resourced. The FDA does not rely on a team of auditors to assess whether or not a medical discovery is dangerous. Rather, the burden of proof is on the pharma companies. The FDA requires the companies to perform extensive research, document and report risks, and wait until the government has reviewed and approved the drug before it can be sold. (This is an oversimplification and makes the process seem less rigorous than it actually is; in reality, there are multiple phases of testing, and companies have to receive approvals at each phase before progressing).
The burden of proof matters. I think we would be substantially safer if humanity expected AI developers to proactively show that their models were safe, as opposed to a regime where independent auditors had to proactively identify dangers.
I’ll also note that I’m excited about a lot of the work on evals. I’m glad that there are a few experts who are thinking carefully about how to detect dangers in models. To use the FDA analogy, it’s great that some independent research groups are examining the potential dangers of various drugs. It would be a shame if we put all of our faith in the pharma companies or in the FDA regulators.
However, I’ve heard some folks say things like “I think companies should be allowed to deploy models as long as ARC evals can’t find anything wrong with it.” I think this is pretty dangerous thinking, and I’m not convinced that AI safety advocates have to settle for this.
Could we actually get a setup like the FDA, in which the US government requires AI developers to proactively provide evidence that their models are safe?
I don’t claim that the probability of this happening (or being well-executed) is >50%. But I do see it as an extremely important area of AI governance to explore further. The Overton Window has widened a lot in a rather short period of time, AI experts report concern around existential risks the public supports regulations, and policymakers are starting to react.
Perhaps we’ll soon learn that hopes for a government-run regulatory regime were naive dreams. Perhaps the only feasible proposal will be evals in which the burden of proof is on a small team of auditors. But I don’t think the jury is out yet.
In the meantime, I suggest pushing for ambitious policy proposals. For evals1, to state it one final time: The burden of proof should be on frontier AI developers, they should be required to proactively provide evidence that their model is safe, and this evidence should be reviewed by a government body (or government-approved body)2.
This post focuses on evals of existing models. It seems likely to me that a comprehensive FDA-like regulatory regime would also require evals of training runs before training begins, but I’ll leave that outside the scope of this post.
A few groups are currently performing research designed to answer the question “what kind of evidence would allow us to confidently claim, beyond a reasonable doubt, that a model is safe?” Right now, I don’t think we have concrete answers, and I’m excited to see this research progress. One example of a criterion might be something like “we have sufficiently strong interpretability: we can fully or nearly-fully understand the decision-making processes of models, we have clear and human-understandable explanations to describe their cognition, and we have a strong understanding of why certain outputs are produced in response to certain inputs. Unsurprisingly, I think the burden of proof should be on companies to develop tests that can prove that their models are safe. Until they can, we should err on the side of caution.