Independent Assurance of High Risk AI Systems

No Trust Without Trustworthiness

The Case for Deepvail

(i) AI assurance is not only possible, but essential.

How do you know if an AI vendor is trustworthy? How do you decide which AI vendor is trustworthy, suggesting that you should place your trust in them? Once you have decided to trust an AI system, how do you know if your trust was misplaced or betrayed? How do you know if your trust was even mutual or warranted in the first place? The absence of an immediate and obvious answer to these questions should be unsettling. We are headed toward a future in which AI systems play a significant role in informing and enabling high-stakes decisions. This is happening whether such AI systems are trustworthy or not. There is thus a clear danger to our future flourishing that is posed by widespread misplaced or unwarranted trust placed in untrustworthy high-risk AI systems.

Healthcare is the industry with the most significant amount of high-risk AI technologies. Unfortunately, current FDA standards for regulatory clearance of such AI systems in healthcare - which are often regarded and treated as a statement of assurance - surprisingly only require information about technical ability, validity, and accuracy compared to a previously cleared device, also known as a prior equivalency. Once the FDA clears a device, vendors can make unsubstantiated claims. Further, there are no economic or professional incentives to track whether or not their AI proves beneficial for patient care and reliably improves patient outcomes when deployed in the real world. Nor is there any substantive requirement or incentive for vendors to collect and publicly disclose even the most basic information about their AI systems, let alone address known safety risks associated with using their systems in different contexts and among diverse patient populations. We should not be surprised that vendors typically choose to disclose little or nothing that can be used to validate and verify the trustworthiness of their devices independently. Nonetheless, it is widely accepted that the companies that develop these algorithmic tools are responsible for providing self-assurance of their devices’ safety and reliability.

(ii) Can self-assurance of safety and reliability be considered trustworthy?

Policymakers in the UK and EU have been far more proactive in cultivating an independent assurance ecosystem for trustworthy AI. Much less attention has been paid to AI assurance in the US. This seems to be largely because the developer community can control the conversation about the AI system in question. They are most interested in talking about technical solutions like so-called ‘explainable AI’ (XAI) as an adequate and reasonable basis for trusting their systems.

All stakeholders have been too easily persuaded that XAI will be the panacea that cracks the ‘trust problem.’ Trustworthiness is implicit in the vendor marketing language. Still, the XAI information they present at best results in misplaced trust and, at worst, creates false confidence in untrustworthy systems. Vendors using terms like explainable, interpretable, and observable interchangeably have created widespread misconceptions and misunderstandings. They have manipulated expectations among stakeholders who do not fully understand the epistemic institutions that are necessarily involved in developing AI systems. Trustworthiness cannot be commodified or secured transactionally, and it especially cannot be technologically engineered. This is because decisions about trustworthiness necessarily involve value decisions and not mere technical performance decisions.

Independent data on system design and performance is typically only available via peer-reviewed journals and is rarely collected during prospective clinical trials. The findings are likely to be misleading at best. Even developers who disclose information about their system design and performance in peer-reviewed journals do not provide much of the critical information needed to evaluate their value decisions or their device’s trustworthiness. A recent review of compliance with recommendations from 15 different model reporting guidelines comprising 220 unique items by developers of 12 commonly used AI models embedded in the Epic electronic health record (EHR) system found a median completion rate of 39 percent with the lowest rates of completion for information on model usefulness, reliability, transparency, and fairness. Another review published in early 2022 of 579 published models developed between 2009-2019 using real-world evidence from EHRs found only limited evidence of improvement in the quality of model validation reporting over that time. A 2019 review of AI imaging validation studies in the peer-reviewed literature found only 6-percent of 519 eligible studies followed bare minimum best practices for model validation and verification.

(iii) Existential Risks of untrustworthy high-risk systems are already embedded in our lives.

It is often the case that we must trust or rely upon experts and their expertise. These experts tend to be human experts. But, increasingly often, they are non-human ‘artificial intelligence’ experts, such as conversational diagnostic expert algorithm systems (i.e., the Babylon chatbot). These algorithmic ‘expert’ tools are becoming increasingly embedded in our communications as well as the decisions we make with one another, and this will likely be the case long-term, including for increasingly high stakes use cases where the AI prediction has a direct impact on the health and potentially the life of a human subject.

And at times, In the above instances, we are directly—or almost directly— interacting and engaging with these intelligent tools. This might suggest that these algorithmic systems are agents much like ourselves, and we should act as such during our engagement with them, including considering them (in themselves) as proper objects of our trust.

For example, consider a popular and accessible triage and diagnostic algorithm that has been explicitly exempted from regulatory oversight in the US. Babylon Health has developed a chatbot that they claim to replace one’s primary care physician during triage and diagnostic interaction. Because there are no consequences for making unsupported (and potentially outrageous) claims, Babylon presents its diagnostic system as more capable, reliable and convenient than a human agent physician, thus setting the stage for their diagnostic bot to increasingly replace the role of the primary care physician, with patients only meeting a flesh and blood physician when the bot determines a specialist opinion is necessary.

Babylon gives the following scenario about how the interaction between patient and diagnostic algorithm is intended to work: “First, the patient downloads the Babylon app, and provides some personal identifying information. The chatbot then asks the patient (via the interface) to ‘briefly describe the symptom that’s worrying you most.’ The patient might reply, ‘headache.’ The chatbot then asks the patient (via the interface) up to thirty additional related questions. After the patient answers all of the questions given to them by the chatbot (via the interface), the chatbot will tell the patient (via the interface) what the most likely cause of their symptoms is and what actions they should take next. In the headache example, the patient might receive instructions such as ‘People with symptoms similar to yours usually have the following conditions: tension headache. A pharmacist can treat this.’ The patient will then be given a second possible cause: ‘Another possible cause of these symptoms is a cluster headache. This usually requires seeing a GP.’”

The effect of the conversational algorithm on the trust relationships between patients and human physicians will depend upon the precise role that the conversational algorithm will occupy in clinical practice and the level of epistemic authority that the conversational algorithm comes to hold in the clinical decision-making process. If the conversational algorithm is used merely as a tool available for the human physicians to use, then the patient would rely on the conversational algorithm to give an accurate diagnosis, but at the same time would stake their trust in the judgment of a human agent physician to interpret the outputs and incorporate them into her clinical decision-making. But this is not how conversational algorithms such as Babylon’s diagnostic bot are conceived, developed, or marketed. It is alarming that such a new technology class could already be “exempted” from regulatory clearance based merely on the type of technology in use instead of the functionality, risk classification, or intended use cases.

Suppose this (potentially illusory) trust is apparently obtained. In that case, it might lead to a failure to notice a mismatch in critical goals and incentives of all parties involved in using a particular AI system. For example, consider how the objectives of the system are set. Is it with the goal(s) of the patient in mind? No. Rather, the developer defines the objectives that lead to what tasks the system performs.

(iv) Deepvail’s Approach to Trustworthy AI: the ethics and the epistemology

To ensure users of high-risk AI systems do not engage in unwarranted trust, especially unwittingly, the users need independent statements of assurance that the vendors they engage with are potentially trustworthy. The key elements of vendor trustworthiness are honesty, competence and reliability, particularly under real-world conditions. A trustworthy vendor is honest in the claims and commitments they make (including not over-hyping their device and what their device can do), competent at the relevant tasks, and most importantly they have a verifiable track record of being reliably honest and skilled.

Finding independent information about high-risk AI vendors' track records on safety, reliability, and trustworthiness should be as easy as finding out how much venture capital they have raised. Deepvail was founded to fill this need.

Deepvail’s approach to assurance differs substantially from our competitors, who are primarily focused on building and engineering tools to be used by model developers for internal self-assurance of the safety and reliability of a particular AI system. This information is rarely shared beyond the development environment, rendering it unhelpful for end-users to evaluate trustworthiness. Deepvail will take the crucial next step and attempt to persuade vendors to embrace the concept of radical transparency and label their systems with basic facts about how it was designed, trained and tested, indications for use, known risks or adverse events, and other important information for anyone trying to evaluate their trustworthiness before deciding to place their trust in them. This will help end-users properly assess particular AI systems' trustworthiness.

Deepvail was designed to harmonize AI evaluation procedures and performance reporting for high-risk use cases internationally. We have focused on building technology-enabled services that target the more explicit mechanisms for assuring AI systems, such as conducting independent certification, conformity assessment, performance testing, and formal verification. The core elements of Deepvail’s assurance philosophy are as follows:

- Independent Validation and Verification

- External Robustness Evaluation

- Audits of Prior Implementations

- Expert User Knowledge

- Radical Transparency

Deepvail is building an ecosystem of technology-enabled services for independently validating claims related to safety and reliability made by high-risk AI vendors. Determining standards and thresholds for evaluating vendor safety and reliability in different contexts over time and making the findings available to all relevant stakeholders is a crucial contribution of Deepvail. All stakeholders will use Deepvail’s assurance ecosystem in healthcare as a single source of pragmatic truth (i.e., information on how to act or proceed with using a specific device) on the trustworthiness of vendors and the systems they develop.