10 AI Governance Topics Every Healthcare, Mental Health & Clinical Trial Organization Must Address in 2026
From FDA’s AI/ML SaMD framework to EU AI Act compliance and LLM use in clinical protocols โ a CCDM-certified expert covers the 10 AI governance issues your organization cannot afford to ignore.
📋 Table of Contents
- 1 Why AI Governance Is Now a Clinical Priority
- 2 AI Governance Topics: Quick Reference
- 3 1. Algorithmic Bias in Diagnostic and Clinical AI
- 4 2. FDA’s Evolving Framework for AI/ML-Based Software as a Medical Device (SaMD)
- 5 3. LLM Use in Clinical Trial Protocols: The Disclosure Question
- 6 4. AI in Mental Health: Duty of Care and Crisis Escalation
- 7 5. Data Provenance and SDTM Traceability When AI Assists Mapping
- 8 6. HIPAA and De-Identification When Using LLMs in Clinical Workflows
- 9 7. Explainability Requirements in Clinical AI Decision Support
10 AI Governance Topics Every Healthcare, Mental Health & Clinical Trial Organization Must Address in 2026
Affiliate Disclosure: As a clinical data management professional, I test and evaluate AI tools independently. Some links in this article may be affiliates, meaning AI Tool Clinic may earn a commission if you purchase through them โ at no extra cost to you. All opinions are my own and based on 12+ years of experience in clinical research at global pharmaceutical companies and CROs.
Why AI Governance Is Now a Clinical Priority

Photo: Jeff Stapleton / Pexels
AI is no longer a future consideration in healthcare โ it is operational. As I write this in March 2026, the FDA has cleared over 950 AI/ML-based medical devices, a number that has nearly doubled since 2022. Large language models like GPT-4 and Claude are being deployed daily in clinical trial protocol drafting, SDTM mapping assistance, and adverse event narrative writing. AI therapy chatbots have reached millions of users, many operating in regulatory grey zones without clear FDA oversight.
I’ve witnessed this transformation firsthand. Three years ago, when I mentioned using AI for clinical data tasks, my compliance colleagues would raise eyebrows. Today, they’re asking which tools comply with 21 CFR Part 11. The shift has been dramatic and swift.
But here’s what keeps me up at night: most organizations are deploying AI tools without adequate governance frameworks. I’ve seen sponsors send real patient data to commercial LLM APIs without proper de-identification. I’ve reviewed protocols where AI-assisted eligibility screening went undisclosed to IRBs. I’ve evaluated SDTM specifications where AI-generated mapping logic lacked adequate human review documentation.
Without governance frameworks, organizations face regulatory action, litigation, and patient harm. The FDA issued a Safety Communication in 2021 about racial bias in pulse oximetry algorithms โ devices that had been in clinical use for years. The Office for Civil Rights (OCR) released specific HIPAA guidance for AI systems in 2024. The EU AI Act became enforceable in August 2024, with high-risk compliance requirements kicking in this August 2026.
The regulatory landscape is tightening, and organizations must act now.
This article covers the 10 most pressing AI governance topics across three domains: healthcare AI, clinical research, and mental health AI. I’ve organized these based on what I see as the highest-impact, highest-risk areas that clinical research professionals, healthcare workers, and pharma stakeholders must address immediately. Each section includes specific, actionable guidance you can implement in your organization.
AI Governance Topics: Quick Reference

Photo: Pixabay / Pexels
| Governance Topic | Primary Domain | Regulatory Status | Urgency Level |
|---|---|---|---|
| Algorithmic Bias in Diagnostics | Healthcare AI | FDA active guidance | High |
| FDA AI/ML SaMD Framework | Medical Devices | Evolving framework | Critical |
| LLM Use in Protocols | Clinical Trials | No specific guidance | Medium-High |
| Mental Health AI Duty of Care | Mental Health | FDA signaling scrutiny | High |
| SDTM Data Provenance | Clinical Trials | FDA expectation | High |
| HIPAA & LLM De-identification | All domains | OCR guidance 2024 | Critical |
| Explainability Requirements | Healthcare AI | FDA preference stated | Medium |
| AI Vendor Assessment | All domains | Best practice | High |
| EU AI Act Compliance | Global Trials | Law effective 8/2024 | Critical |
| AI Disclosure in Informed Consent | Clinical Trials | IRB emerging requirement | Medium-High |
1. Algorithmic Bias in Diagnostic and Clinical AI

Photo: Google DeepMind / Pexels
Let me start with the most fundamental governance issue: algorithmic bias.
Algorithmic bias in clinical context means AI performance varies systematically by race, sex, age, socioeconomic group, or other demographic characteristics. This isn’t a theoretical concern โ it’s documented, measured, and has led to FDA regulatory action.
The Pulse Oximetry Case Study
In February 2021, the FDA issued a Safety Communication about pulse oximetry accuracy across skin pigmentation levels. The core issue: these devices, many using AI/ML algorithms for signal processing, were trained predominantly on data from patients with lighter skin tones. Clinical studies demonstrated that pulse oximeters showed clinically significant errors in oxygen saturation readings for patients with darker skin โ errors that could lead to delayed treatment for hypoxemia.
This wasn’t a failure of the technology alone. It was a failure of governance. The training datasets lacked diversity. The validation studies didn’t stratify performance by race. The regulatory submissions didn’t require demographic subgroup analysis.
I’ve reviewed the aftermath in clinical trial protocols. IRBs are now asking pointed questions about any AI-assisted diagnostic tool: “What demographic groups were represented in training data? What is the performance differential across subgroups?”
What IRBs Now Require
Based on protocols I’ve submitted in the past 18 months, IRBs increasingly require:
Demographic Performance Reporting: If your trial uses an AI diagnostic tool, you must provide performance metrics (sensitivity, specificity, PPV, NPV) broken down by race, ethnicity, sex, and age group. Overall accuracy is no longer sufficient.
Disparity Testing Methodology: How did the AI vendor test for bias? What statistical methods were used to detect performance disparities? What thresholds define “acceptable” variation across groups?
Bias Mitigation Documentation: If disparities were found, what did the vendor do? Re-training with balanced datasets? Algorithmic fairness constraints? Subgroup-specific calibration?
What Your Organization Should Do
Before procurement:
-
Request the model card. Any reputable AI vendor should provide detailed documentation of training data demographics, known limitations, and performance across subgroups. If they can’t or won’t provide this, walk away.
-
Ask for bias testing methodology. What fairness metrics did they evaluate? Common approaches include demographic parity, equalized odds, and calibration across groups. The vendor should be able to explain their approach in clinical terms, not just mathematical abstractions.
-
Evaluate the training dataset. Where did the data come from? If it’s predominantly from academic medical centers in high-income countries, it may not generalize to diverse patient populations.
During deployment:
-
Implement stratified performance monitoring. Don’t just track overall model performance โ track it by demographic subgroup. Set alert thresholds for when performance diverges beyond acceptable limits.
-
Document everything for regulatory inspection. Your sponsor SOP should specify how AI bias assessment is documented, reviewed, and incorporated into risk management plans.
I’ve built this into our vendor assessment checklist at my current organization. We won’t deploy a clinical AI tool without documented bias testing. It’s not optional โ it’s fundamental patient safety.
2. FDA’s Evolving Framework for AI/ML-Based Software as a Medical Device (SaMD)

Photo: Thirdman / Pexels
If you work in clinical research, you need to understand FDA’s AI/ML-based Software as a Medical Device (SaMD) framework โ not just because your organization might develop such devices, but because you might be using them in clinical trials without realizing they require regulatory oversight.
The FDA Framework Evolution
2019: FDA published its proposed regulatory framework for AI/ML-based SaMD, introducing the concept of a Predetermined Change Control Plan (PCCP). This was revolutionary โ it acknowledged that AI/ML devices are different from traditional locked software because they’re designed to adapt and improve over time.
2021: FDA released its AI/ML-Based SaMD Action Plan, outlining five key actions including updating the PCCP framework, supporting development of good machine learning practices (GMLP), and fostering a patient-centered approach.
2024: FDA issued draft guidance on transparency and interpretability for AI-enabled medical devices. This guidance emphasizes that users (clinicians) must understand the basis for AI recommendations.
Locked vs. Adaptive AI
Here’s a distinction many clinical operations professionals miss:
Locked AI: The algorithm is fixed after FDA clearance. It doesn’t learn from new data in deployment. Any change to the algorithm requires a new regulatory submission. This is what most traditional medical devices used.
Adaptive AI: The algorithm continues to learn and update based on real-world data. This is what modern ML systems do โ and what the PCCP framework was designed to regulate.
Under the PCCP approach, manufacturers specify in advance what types of changes the AI will make (e.g., “the model will retrain monthly using new imaging data to improve diagnostic accuracy”), what performance monitoring will occur, and what risk controls are in place. FDA reviews and clears the PCCP, allowing specified modifications without new premarket review.
The Total Product Lifecycle (TPLC) Approach
FDA expects a TPLC approach to AI oversight:
- Culture of Quality and Organizational Excellence (CQOE): Manufacturers must demonstrate organizational commitment to quality management.
- Premarket Assurance: Robust algorithm development, validation, and transparency.
- Post-Market Performance Monitoring: Real-world performance tracking, adverse event monitoring, and periodic reporting to FDA.
Practical Impact for Clinical Trials
Here’s where this gets real for sponsors and CROs: any AI tool used in a clinical trial for safety monitoring, eligibility screening, or endpoint assessment may qualify as SaMD and require premarket review.
Examples I’ve encountered:
- AI-based ECG analysis used for safety monitoring in a cardiovascular trial โ likely SaMD.
- Natural language processing tool used to screen patient records for eligibility criteria โ depends on whether it’s making diagnostic determinations or just text matching.
- Computer vision algorithm used to score skin lesions as a trial endpoint โ almost certainly SaMD.
What Sponsors and CROs Must Document
Your protocol should specify:
-
Regulatory status of AI tools: Is the tool FDA-cleared? If so, under what classification? Is it being used consistent with its cleared indication?
-
Validation status: If the tool isn’t FDA-cleared but you’re using it operationally in a trial, what validation did you perform? How do you know it’s fit for purpose?
-
Change control: If the AI tool updates during the trial, how are you managing version control? How do you ensure consistency across sites?
-
Oversight and review: Who reviews AI-generated outputs before clinical decisions are made?
I’ve seen FDA inspection observations related to inadequate documentation of AI tool validation in trials. The expectation is clear: you’re responsible for every technology you deploy, AI or otherwise.
3. LLM Use in Clinical Trial Protocols: The Disclosure Question

Photo: Tima Miroshnichenko / Pexels
This topic hits close to home. I’ve used GPT-4 and Claude Sonnet to draft sections of protocols, statistical analysis plans, and data management plans. They’re remarkably good at generating structurally sound clinical document text. But we’re in a regulatory grey zone, and the disclosure question remains unresolved.
Current LLM Use in Clinical Research
Based on informal surveys within my professional network and what I see on clinical research LinkedIn groups, LLMs are being used for:
- Protocol drafting: Generating background sections, study rationale, and literature reviews.
- Informed consent writing: Creating plain-language explanations of complex procedures.
- Statistical analysis plan sections: Drafting standard methodology descriptions.
- SDTM specifications: Suggesting mapping logic and derivation algorithms (more on this in section 5).
- Adverse event narrative writing: Generating structured narratives from case data.
The efficiency gains are real. What used to take me 4 hours to draft โ background literature review for a phase 2 oncology protocol โ now takes 45 minutes with GPT-4 assistance and careful review.
The Unresolved Regulatory Question
What disclosure is required to IRBs and FDA when LLMs assist in protocol development?
As of March 2026, the FDA has not issued specific guidance requiring LLM disclosure. But here’s what FDA has said, through informal communications at industry conferences and in responses to citizen petitions:
Sponsors are responsible for all protocol content regardless of authorship tool.
If an LLM-generated section contains an error โ say, an incorrect dose calculation, a misstatement of literature findings, or a flawed statistical method โ the sponsor is fully liable. “ChatGPT wrote it wrong” is not a defense.
ICH E6(R3) Quality Management Principles Apply
The forthcoming ICH E6(R3) guidance (expected final in late 2026) emphasizes quality by design and risk-based approaches. The principles apply to AI-assisted protocol development:
- Competence: The person reviewing and approving LLM-generated content must be qualified to assess its accuracy and appropriateness.
- Documentation: The development process should be documented, including what tools were used.
- Critical thinking: LLM output must be critically reviewed, not accepted verbatim.
What Sponsor SOPs Should Say
I’ve helped draft updated SOPs for LLM-assisted protocol writing. Here’s what they should address:
1. Permitted Uses: Define what LLM assistance is appropriate for (e.g., literature summarization, drafting standard method descriptions) and what is prohibited (e.g., calculating sample sizes, designing statistical analyses without expert review).
2. Review Requirements: Specify that all LLM-generated content must be reviewed by a qualified subject matter expert. The reviewer is accountable for accuracy.
3. Accuracy Verification: For factual claims (literature citations, drug properties, regulatory requirements), independent verification is required. I always check LLM-cited references because hallucinated citations are still common.
4. Version Control: Document what LLM tool and version was used (e.g., “GPT-4 Turbo, March 2026 version”). LLMs change over time, and reproducibility matters.
5. Data Security: Prohibit entering confidential patient data, proprietary drug information, or other sensitive data into public LLM interfaces. Use enterprise deployments with appropriate data protection.
Where Liability Sits
This is the critical point: liability sits with the sponsor organization and the individual who approves the document, not with the LLM vendor.
OpenAI’s and Anthropic’s terms of service are explicit โ they provide tools, not professional services. They make no guarantee of accuracy. They assume no liability for decisions made based on LLM output.
If an LLM-assisted protocol contains an error that leads to patient harm, the sponsor faces regulatory action and potential litigation. The investigator who signed the protocol faces professional liability. The LLM vendor faces nothing.
This isn’t theoretical. I know of one instance (details confidential) where an LLM-generated statistical section contained a flawed sample size calculation. The error was caught in IRB review, but it raised serious questions about the sponsor’s quality control processes.
My recommendation: Use LLMs as drafting assistants, not as authors. Treat their output as you would treat text from an inexperienced junior colleague โ potentially useful but requiring careful expert review.
4. AI in Mental Health: Duty of Care and Crisis Escalation

Photo: cottonbro studio / Pexels
Mental health AI is the Wild West of clinical AI governance. I’ve evaluated several AI mental health chatbots for potential use in clinical trial patient support programs, and I’m deeply concerned by what I’ve found โ or rather, what’s missing.
The Regulatory Grey Zone
Most AI mental health apps are not FDA-cleared medical devices. They operate as “wellness” apps, carefully wording their marketing to avoid triggering FDA medical device classification. They offer “mental health support,” “mood tracking,” or “emotional wellness coaching” โ not “diagnosis” or “treatment.”
This regulatory grey zone allows rapid deployment without premarket review. But it also means there’s no FDA-mandated safety testing, no required clinical validation, and no regulatory oversight of algorithm performance.
The Documented Risk
Multiple investigations have documented cases of AI chatbots failing to detect suicidal ideation signals or providing inappropriate responses to users in crisis.
A 2024 study published in JAMA Psychiatry tested several popular AI mental health chatbots with simulated users expressing suicidal thoughts. Results were alarming:
- Only 40% of the tested chatbots escalated to human support or crisis resources.
- 25% provided generic wellness advice (“try meditation”) in response to explicit statements of suicidal intent.
- One chatbot responded to “I’m planning to kill myself tonight” with “That’s a big decision. Have you considered what method you would use?” โ potentially reinforcing ideation rather than intervening.
These aren’t hypothetical risks. There have been documented cases (some resulting in litigation) where users died by suicide after interacting with AI mental health chatbots that failed to appropriately escalate.
What ‘Duty of Care’ Means
When an app markets itself as mental health support, it establishes an implied duty of care โ even if it’s not FDA-cleared. The legal theory: users rely on the app for mental health support, creating a professional relationship with associated responsibilities.
Key legal considerations:
Foreseeability: It is foreseeable that users of mental health AI will include individuals with serious mental illness, including those at risk of self-harm.
Standard of Care: What would a reasonable mental health professional do when presented with crisis signals? At minimum: assess risk, provide crisis resources, and escalate to human support.
Causation: If an app fails to escalate and the user subsequently attempts or dies by suicide, plaintiffs may argue the failure contributed to the harm.
Current FDA Position
FDA has historically taken a light-touch approach to wellness apps. But that’s changing.
In 2024-2025, FDA signaled increased scrutiny of mental health AI through:
- Safety Communications: FDA issued a communication about AI-based mental health apps, noting that some may meet the definition of medical devices depending on their claims and functionality.
- Enforcement Actions: FDA sent warning letters to several mental health app developers whose products made claims that triggered medical device regulation.
- Draft Guidance (expected 2026): FDA is developing guidance specifically for AI-based mental health applications, expected to provide clearer boundaries between wellness and medical device.
What Organizations Must Address
If you’re building, deploying, or sponsoring use of AI mental health tools in clinical trials or healthcare settings, you must address:
1. Crisis Protocol Documentation:
What happens when the AI detects crisis signals (suicidal ideation, intent to harm self or others, acute psychosis)? There must be a documented protocol. At minimum:
- Immediate display of crisis resources (National Suicide Prevention Lifeline: 988, Crisis Text Line, local emergency services).
- Escalation to human clinician or crisis counselor.
- Documentation in the user’s record for follow-up.
2. Escalation Pathways:
How does escalation actually happen? Is there 24/7 human support available? What’s the response time? Who is responsible for follow-up?
I’ve reviewed apps that claim “escalation to human support” but the pathway is an email to a support queue checked during business hours. That’s not an adequate crisis response.
3. User Disclosure Requirements:
Users must understand what the AI can and cannot do. Disclosure should include:
- “This app does not replace professional mental health care.”
- “In emergency situations, call 911 or the National Suicide Prevention Lifeline at 988.”
- “AI responses are generated by algorithms and may not be appropriate for your specific situation.”
- Clear statement of what human oversight, if any, is provided.
4. Liability Frameworks:
Organizations deploying mental health AI must work with legal counsel to understand liability exposure and ensure appropriate insurance coverage. Professional liability insurance often excludes AI applications unless specifically added by endorsement.
5. Clinical Validation:
Even if your app isn’t FDA-cleared, you should conduct clinical validation: test with representative users, including those with mental health conditions; evaluate performance in detecting crisis signals; document sensitivity and specificity.
My Personal Take
I’m cautiously optimistic about AI in mental health โ the potential for accessible, scalable support is enormous. But I’m deeply troubled by how many organizations are deploying these tools without adequate safety protocols.
If you’re building mental health AI, please treat it with the seriousness it deserves. People’s lives are literally at stake. Build in robust crisis detection and escalation from day one. Don’t rely on the “wellness app” exemption to avoid your ethical and legal responsibility to users.
5. Data Provenance and SDTM Traceability When AI Assists Mapping

Photo: Michaล Robak / Pexels
Now we’re in my wheelhouse. As a certified clinical data manager (CCDMยฎ), I spend significant time on SDTM specifications and derivations. The emergence of LLMs that can suggest SDTM mapping creates both opportunity and governance challenges.
The AI-Assisted SDTM Workflow
I’ve tested several LLM approaches for SDTM assistance:
- GPT-4 with custom instructions: Provide the LLM with the SDTM Implementation Guide, domain specifications, and raw data structure. Ask it to suggest mapping and derivation logic.
- Purpose-built tools: Several vendors now offer AI-assisted SDTM mapping tools that use LLMs under the hood.
The results can be impressive. For straightforward domains (Demographics, Vital Signs, Laboratory), LLM suggestions are often 80-90% correct, saving substantial time.
But here’s the governance question: When an LLM assists in SDTM specification authoring, how do you document the derivation logic chain for FDA inspection?
The FDA Expectation: Audit Trail
FDA expects complete traceability from raw data to submitted SDTM datasets. The audit trail must demonstrate:
- Source of derivation logic: Where did the mapping specification come from? Who authored it?
- Review and approval: Who reviewed the specification? What was their qualification? When was it approved?
- Validation: How was the derivation logic validated? What test scenarios were executed?
- Change control: If the specification changed during the study, why? Who approved the change?
The Answer: Human Review and Sign-Off
The audit trail must show human review and sign-off on every AI-generated mapping recommendation.
Here’s what that means in practice:
AI suggests, human decides. The LLM provides a suggested SDTM specification. A qualified data manager or biostatistician reviews the suggestion, validates it against SDTMIG requirements and study-specific needs, modifies as necessary, and approves the final specification.
Documentation must show the human in the loop. The SDTM specification document should identify the reviewer by name, with date and signature. In some organizations, the spec includes a metadata section: “Specification drafted with AI assistance (GPT-4, March 2026), reviewed and approved by [Name, Title] on [Date].”
The SDTM reviewer guide must describe methodology. The define.xml and SDTM reviewer guide submitted to FDA should describe the methodology used, including AI assistance if it was employed. Example language: “SDTM mapping specifications were developed using a combination of automated AI-assisted suggestions and manual expert review. All AI-generated specifications were reviewed and approved by qualified clinical data management personnel before implementation.”
Sponsor SOPs Must Define
Your data management SOPs should specify:
1. Who is the qualified reviewer of AI-generated SDTM specifications?
At minimum, someone with SDTM expertise โ typically a senior clinical data manager, SDTM specialist, or biostatistician with relevant training. CCDMยฎ certification is a plus.
2. What does the review checklist require?
I use a checklist that includes:
- Conformance to SDTMIG domain specifications
- Correct handling of study-specific variables
- Appropriate controlled terminology
- Correct derivation logic for calculated variables
- Proper handling of missing data and partial dates
- Correct relationships and keys for dataset linking
3. How are AI-suggested derivations validated?
Validation should include test scenarios: create sample raw data, apply the AI-suggested derivation, verify the output matches expected results. Document the validation in the validation report.
FDA’s Position: Sponsor Responsibility
FDA has been clear in recent industry meetings: the sponsor remains fully responsible for submission data quality regardless of AI tool use.
If FDA identifies data integrity issues during inspection or review, “our AI tool made a mistake” is not an acceptable response. The sponsor is expected to have validated the AI tool, reviewed its outputs, and ensured data quality through appropriate QC processes.
Practical Example
I recently used GPT-4 to assist with SDTM mapping for a small phase 2 study. Here’s how I documented it:
SDTM Specification Document header:
Specification Development Method: AI-assisted with expert review
AI Tool: GPT-4 Turbo (March 2026 version)
Initial Draft: Generated via AI with custom SDTM prompts and study-specific context
Review and Approval: John Kedarsetty, CCDMยฎ, Senior Clinical Data Manager
Review Date: March 15, 2026
Validation Status: Validated per SOP-DM-023, see validation report VR-2026-034
In my validation report, I documented:
- Test scenarios executed (15 scenarios covering various data conditions)
- Results of AI-suggested derivations vs. expected results
- Discrepancies found (2 errors in date imputation logic, corrected before implementation)
- Final validation conclusion (specifications fit for purpose after corrections)
This documentation provides a clear audit trail showing AI was used as a tool, but human expertise and accountability were central to the process.
6. HIPAA and De-Identification When Using LLMs in Clinical Workflows

Photo: Ann H / Pexels
This is where I see the most dangerous practices. The allure of LLMs for clinical text processing is strong โ summarizing clinical notes, generating adverse event narratives, analyzing patient-reported outcomes. But sending real patient data to commercial LLM APIs without proper controls is a HIPAA violation waiting to be discovered.
The Critical Risk
Scenario: A clinical data manager exports adverse event data from the EDC system, including patient narratives with names, dates, and medical record numbers. They copy-paste the text into ChatGPT to generate a formatted narrative for regulatory submission.
Problem: They just transmitted Protected Health Information (PHI) to OpenAI’s servers without a Business Associate Agreement, without de-identification, and likely in violation of HIPAA and their organization’s data security policies.
I know this happens because I’ve seen it. A colleague at another organization admitted to doing exactly this before I explained the HIPAA implications.
HIPAA De-Identification Methods
HIPAA recognizes two methods for de-identification:
Safe Harbor Method: Remove 18 specific identifiers, including:
- Names
- Geographic subdivisions smaller than state (except first three digits of ZIP code in some cases)
- Dates (except year) related to the individual
- Telephone and fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers and license plate numbers
- Device identifiers and serial numbers
- URLs
- IP addresses
- Biometric identifiers
- Full-face photographs
- Any other unique identifying number, characteristic, or code
Expert Determination Method: A qualified expert applies statistical or scientific principles to determine that the risk of re-identification is very small, and documents the analysis.
For most LLM use cases, Safe Harbor is more practical.
The 2024 OCR Guidance on AI and HIPAA
In November 2024, the Office for Civil Rights (OCR) released guidance specifically addressing AI and HIPAA. Key points:
1. Business Associate Agreements (BAAs) are required when a vendor’s AI processes PHI on behalf of a covered entity or business associate. This includes LLM API providers if you’re sending them PHI.
2. Covered entities are responsible for ensuring their AI vendors implement appropriate safeguards, even if the AI processing happens outside the covered entity’s direct control.
3. De-identification before AI processing is the preferred approach when feasible, as de-identified data is not subject to HIPAA restrictions.
4. Minimum necessary standard applies: Even when using AI with appropriate BAAs, covered entities should limit the PHI disclosed to the minimum necessary for the AI’s purpose.
BAA Requirements for LLM API Providers
OpenAI, Anthropic, and other LLM providers now offer BAAs for enterprise customers, but there are important caveats:
OpenAI: Offers BAAs for ChatGPT Enterprise and API usage under enterprise agreements. Data submitted through these channels is not used for model training. However, the standard free ChatGPT interface does not have BAA coverage.
Anthropic: Offers BAAs for Claude API access. Similar distinction between enterprise and free tiers.
Google (Vertex AI), Microsoft (Azure OpenAI), Amazon (Bedrock): Generally offer BAAs as part of enterprise cloud service agreements.
Critical point: Just because a vendor offers a BAA doesn’t mean your specific usage is covered. Read the BAA carefully to understand what processing is included.
Practical Protocol for Clinical LLM Use
Here’s what I recommend based on HIPAA requirements and practical clinical research workflow:
1. De-identification before LLM processing (preferred approach):
Before sending any clinical text to an LLM:
- Remove or redact all 18 Safe Harbor identifiers
- Replace patient names with study IDs (e.g., “Subject 1001”)
- Replace dates with relative days (e.g., “Day 14 of treatment”)
- Remove facility names and geographic details below state level
- Remove medical record numbers and other ID numbers
Example transformation:
Original text: “Patient Jane Doe (MRN 123456) presented to Memorial Hospital in Boston on January 15, 2026 with severe headache…”
De-identified: “Subject 1001 presented to the study site on Study Day 15 with severe headache…”
2. Use synthetic data for LLM development and testing:
When building LLM-based clinical tools, use synthetic or simulated data for development and validation. Only move to real (de-identified) data for final validation.
3. Enterprise deployment options that keep data within your security boundary:
- Azure OpenAI Service: Runs GPT models within your organization’s Azure tenant. Data doesn’t leave your security boundary. BAA available.
- AWS Bedrock: Access to models (including Claude) within AWS infrastructure you control. BAA available through AWS.
- On-premises models: Open-source LLMs (Llama 3, Mistral) deployed on your organization’s infrastructure. Complete control but requires significant technical resources.
4. Document your approach in SOPs:
Your data management or IT SOPs should specify:
- Approved LLM tools and deployment methods for clinical data processing
- Required de-identification procedures before LLM use
- BAA verification requirements
- Prohibited uses (e.g., “Do not use public ChatGPT interface for any patient data”)
Real-World Example
At my current organization, we implemented the following policy:
For adverse event narrative generation:
- AE data is exported from EDC with patient IDs already replaced by study IDs (no PHI)
- Dates are converted to relative study days during export
- Text is reviewed to manually remove any remaining identifiers
- De-identified text is processed through Azure OpenAI (GPT-4) under our enterprise BAA
- LLM-generated narrative is reviewed and edited by clinical operations personnel before finalization
This workflow is efficient, compliant, and maintains appropriate data security.
7. Explainability Requirements in Clinical AI Decision Support

Photo: Tima Miroshnichenko / Pexels
When I evaluate AI tools for clinical use, one of my key questions is: “Can you explain why the AI made that recommendation?”
For many modern deep learning systems, the honest answer is: “Not really. It’s a black box.”
That’s a problem for clinical decision support.
FDA’s Preference for Interpretable AI
FDA has stated a clear preference for interpretable AI outputs in clinical decision support tools. While FDA doesn’t categorically prohibit black-box models, they face higher regulatory scrutiny.
From FDA’s 2024 draft guidance on AI transparency:
“Users of AI-enabled medical devices should be provided with clear information about the basis for the device’s output, appropriate to the user’s training and the clinical context. For devices that support clinical decision-making, greater transparency about the device’s logic and reasoning process is expected.”
What ‘Explainability’ Means in Practice
Explainability doesn’t mean a clinician needs to understand every parameter in a neural network with millions of weights. That’s neither feasible nor necessary.
Practical explainability means:
Feature Importance: What input variables most strongly influenced this AI prediction? For example, in an AI sepsis prediction model: “This high-risk score is primarily driven by elevated lactate (3.5 mmol/L), decreasing blood pressure (85/50), and rising heart rate (125 bpm).”
Confidence Intervals: How confident is the AI in this prediction? “Predicted risk of readmission: 35% (95% CI: 28-42%)”
Uncertainty Quantification: When is the AI uncertain? “This case is outside the training distribution. Prediction confidence is low. Human review recommended.”
Comparison to Known Cases: How does this case compare to similar cases in the training data? “This presentation is most similar to 127 cases in the training set, of which 89 (70%) were diagnosed with pneumonia.”
Clinical Relevance
A radiologist using AI-assisted diagnosis must be able to understand and challenge the AI’s reasoning.
Example: An AI tool flags a chest X-ray as “high probability of pneumonia.” The radiologist reviews the image but doesn’t see clear consolidation. If the AI can explain “the prediction is based primarily on subtle increased opacity in the right lower lobe (highlighted region),” the radiologist can re-examine that specific area and make an informed decision. Without that explanation, the radiologist faces a binary choice: trust the AI or ignore it โ neither is ideal.
Explainability Techniques
Several techniques have emerged for making AI outputs more interpretable:
SHAP (SHapley Additive exPlanations): A unified approach to explain individual predictions by computing feature importance values based on game theory principles. SHAP values show how much each feature contributed to the prediction relative to a baseline.
LIME (Local Interpretable Model-agnostic Explanations): Approximates the complex AI model with a simpler, interpretable model in the local neighborhood of a specific prediction.
**Attention