A machine learning model that cannot explain itself is a liability. As enterprises accelerate AI adoption — deploying models into hiring pipelines, credit scoring, medical triage, fraud detection, and customer service — the question regulators, auditors, and board members are increasingly asking is not just “Does it work?” but “Can you prove it’s fair, safe, and fit for purpose?”
Enter the AI model card: a structured, standardized document that captures everything a stakeholder needs to understand, evaluate, and responsibly deploy a machine learning model.
With the EU AI Act mandating technical documentation for high-risk AI systems, NIST’s AI Risk Management Framework (AI RMF) calling for explainability and transparency, and enterprise procurement teams now including AI documentation in vendor due diligence checklists, writing a comprehensive model card has shifted from a best practice to a business imperative.
This guide is for the teams who build, deploy, and govern AI systems at scale: ML engineers, data scientists, AI product managers, compliance officers, and risk leads. By the end, you will understand what belongs in a model card, why each section matters, how to write one that satisfies regulators and internal auditors alike, and you’ll have a complete, annotated template ready to use.
What Is an AI Model Card?
An AI model card is a short document — typically one to five pages — that provides a transparent, factual summary of a machine learning model. It answers the fundamental questions any responsible user of that model should ask before deploying it: What does this model do? What was it trained on? How well does it perform across different populations? What are its known limitations? When should it not be used?
The concept was formally introduced in the landmark 2019 paper “Model Cards for Model Reporting” by Margaret Mitchell, Timnit Gebru, and colleagues at Google. The paper argued that just as pharmaceutical drugs come with detailed information about composition, efficacy, contraindications, and side effects, AI models should come with equally rigorous documentation.
Since then, model cards have been adopted broadly:
- Hugging Face hosts tens of thousands of model cards on its model hub, making them a de facto standard for open-source ML
- Google, Meta, Microsoft, and OpenAI publish model cards for their foundation models
- Regulatory bodies in the EU and US reference model card-like documentation requirements in emerging AI legislation
- Enterprise MLOps platforms like MLflow, Vertex AI, and SageMaker now include model card generation as a native feature
A model card is not a technical specification, a research paper, or a legal disclaimer — though it draws from all three. It is a communication artifact designed to bridge the gap between the people who build models and the people who use, audit, and are affected by them.
Why Model Cards Matter for Enterprise
For individual researchers, a model card is a transparency gesture. For enterprises deploying AI at scale, it is a governance instrument with concrete business value across five dimensions:
1. Regulatory Compliance
The EU AI Act (2024) classifies systems like credit scoring, recruitment screening, critical infrastructure management, and medical devices as high-risk AI. Article 11 mandates that providers maintain technical documentation that is “sufficiently detailed” to allow competent authorities to assess compliance. Model cards are a natural fit for this requirement.
Similarly, NIST’s AI RMF Playbook (Govern 1.1, Map 1.6, Measure 2.5) explicitly calls for documentation of AI system purpose, limitations, and evaluation results — all core components of a model card.
In financial services, the OCC and Federal Reserve’s SR 11-7 guidance on model risk management requires that models be documented thoroughly enough for independent validation. A model card provides precisely this documentation layer.
2. Procurement and Vendor Due Diligence
Enterprise AI procurement is maturing fast. Organizations procuring third-party AI tools now routinely request:
- Evidence of bias testing across demographic subgroups
- Training data provenance and licensing
- Out-of-scope use case warnings
- Performance benchmarks on domain-relevant datasets
A model card provides a single, structured document that answers all of these questions, reducing the back-and-forth of vendor due diligence and building trust with enterprise buyers.
3. Internal Auditability
When an AI system produces an adverse outcome — a denied loan, a misdiagnosed image, a discriminatory hiring filter — the organization must be able to trace the decision. Model cards create a paper trail: documented intended uses, known limitations, and evaluation results that establish what was known at deployment time.
This is not just good practice; in regulated industries, it can be the difference between a manageable compliance finding and a material enforcement action.
4. Model Governance and Lifecycle Management
In large organizations, dozens or hundreds of models may be running in production simultaneously. Without standardized documentation, governance teams struggle to answer basic questions: Which models process personal data? Which are deployed in high-risk contexts? Which haven’t been re-evaluated since the training data was last refreshed?
A model card registry — where every production model has a current, versioned model card — provides the foundation for a mature AI governance program.
5. Stakeholder Trust
Explainability is a trust asset. Whether the audience is a C-suite executive approving AI deployment, a regulator conducting an audit, an employee whose performance is being evaluated by an algorithm, or a customer whose data is being processed — a clear, honest model card signals that the organization takes responsible AI seriously.
Anatomy of an AI Model Card: Section-by-Section Guide
A well-structured model card covers nine core sections. Below is a detailed breakdown of each, with enterprise-specific guidance on what to include.
Section 1: Model Details
This section establishes the basic identity of the model. Think of it as the “cover page” that orients any reader.
What to include:
- Model name and version: Use a versioning scheme (e.g.,
fraud-detector-v2.3) that links to your model registry - Model type: Classification, regression, generation, object detection, recommendation, etc.
- Architecture: Transformer, gradient boosted tree, convolutional neural network, etc. — at a level useful to a technical reviewer without requiring a PhD to parse
- Developers and owners: Team name, organization, and a contact method (email or internal ticket system)
- Date: Model training completion date and model card publication date (these may differ)
- Version history: A brief changelog linking to prior versions
- License: Especially important for open-source base models — note any restrictions on commercial use, redistribution, or fine-tuning
Enterprise tip: Link directly to your internal model registry record so governance teams can trace from the card to the artifact, training run, and deployment record.
Section 2: Intended Use
This is arguably the most critical section from a legal and governance standpoint. It defines the boundaries of responsible deployment.
What to include:
- Primary intended use cases: Specific tasks the model was designed and evaluated for (e.g., “detection of fraudulent card-present transactions in retail banking contexts for transactions under $10,000”)
- Intended users: Who should use this model — data scientists integrating via API, business analysts using a dashboard, clinical staff in a hospital workflow, etc.
- Out-of-scope uses: Explicit list of uses the model was not designed for and should not be used for. This is not boilerplate; it is a risk control. For example: “This model was not evaluated for use in hiring decisions and must not be used for that purpose.”
- Prohibited uses: Uses that violate law, ethics policies, or model licensing terms
Enterprise tip: Have your legal and compliance teams review this section. The out-of-scope and prohibited use fields have direct implications for liability and indemnification in vendor contracts.
Section 3: Training Data
Data is the foundation of model behavior. This section provides the provenance, composition, and quality assurance story for your training data.
What to include:
- Dataset name(s) and version(s): With links to internal data catalog entries or public dataset documentation
- Data sources: Where did the data come from? Internal systems, third-party providers, public datasets, synthetic generation?
- Collection period: Date range of the data used — relevant for assessing whether the data reflects current conditions
- Volume: Number of records, samples, tokens, images, etc.
- Demographic composition: For models involving people, document the demographic breakdown of the training population (age, gender, geography, language, etc.)
- Data preprocessing steps: Normalization, tokenization, augmentation, deduplication, filtering — enough detail for a data scientist to reproduce the preprocessing pipeline
- Data labeling: Human annotation, automated labeling, label schema, inter-annotator agreement scores
- Known data limitations: Class imbalances, geographic biases, temporal gaps, underrepresented populations
- Data governance: Licensing, consent mechanisms, PII handling, and any data use restrictions
Enterprise tip: For regulated industries, training data documentation must often satisfy data lineage requirements. Link each dataset to its data governance record, consent framework, and DPA (Data Processing Agreement) where applicable.
Section 4: Evaluation Data
Evaluation data tells stakeholders how the model was tested and whether the test conditions are relevant to their deployment context.
What to include:
- Evaluation dataset(s): Name, version, source, and relationship to training data (held-out split, separate benchmark, or real-world production sample)
- Why this dataset was chosen: Justify the evaluation set as representative of the deployment context
- Known differences from deployment context: If your evaluation data doesn’t fully match production conditions, document the gap and its likely impact
- Demographic representation in evaluation data: Particularly important for models that interact with people
Section 5: Performance Metrics
This section quantifies model performance in a way that is meaningful to both technical and non-technical stakeholders.
What to include:
- Primary metrics: Accuracy, F1 score, AUC-ROC, BLEU, RMSE — whichever metrics are most relevant to the task, with definitions
- Disaggregated performance metrics: Performance broken down by subgroup (age group, gender, race, geography, input language, device type, etc.). This is the most important fairness signal in the document.
- Confidence intervals: Error bounds on reported metrics, particularly for smaller subgroups
- Baseline comparisons: How does this model compare to the previous version, a rule-based baseline, or a human benchmark?
- Human evaluation results: For generative models, include results from human raters where applicable
- Threshold selection: For classification models, document the decision threshold used and the rationale (e.g., optimizing for recall in a fraud detection context)
Enterprise tip: Regulators and auditors will scrutinize disaggregated metrics most closely. Invest in robust subgroup evaluation before publishing. If certain subgroup results are missing (e.g., insufficient data), document that explicitly rather than omitting the subgroup.
Section 6: Ethical Considerations
This section documents the ethical review process and known risks, demonstrating that the organization has exercised due diligence.
What to include:
- Fairness analysis: Summary of bias testing methodology and results. Which definitions of fairness were assessed (demographic parity, equalized odds, calibration)? Were any disparities found?
- Potential harms: What are the realistic ways this model could cause harm — to individuals, communities, or society — if it performs poorly, is misused, or makes errors?
- Risk severity assessment: Categorize potential harms by likelihood and severity
- Mitigation measures: What steps have been taken to reduce identified risks? Human-in-the-loop safeguards, output filtering, confidence thresholds, monitoring alerts?
- Sensitive use cases: Does the model handle sensitive categories of data (health, finance, protected characteristics)? Document the safeguards in place.
- Privacy analysis: Summary of privacy impact assessment findings relevant to the model
Enterprise tip: Align this section with your organization’s AI ethics policy and risk framework. If your organization uses a tiered risk classification (e.g., low / medium / high / prohibited), include the model’s classification and the rationale.
Section 7: Limitations and Caveats
Honest documentation of limitations is a sign of maturity, not weakness. This section protects deployers and end-users by setting accurate expectations.
What to include:
- Known performance degradation conditions: What conditions cause the model to perform worse — data distribution shifts, edge cases, adversarial inputs, low-resource languages, uncommon domain vocabulary?
- Data drift sensitivity: How sensitive is model performance to changes in the input distribution over time?
- Robustness limitations: Known vulnerabilities to adversarial examples, prompt injection (for LLMs), or input noise
- Interpretability constraints: Can predictions be explained? If not, document this and its implications for high-stakes decisions
- Scope limitations: Geographic, temporal, linguistic, or domain restrictions not already captured in the intended use section
Section 8: Deployment and Infrastructure
For enterprise model cards, this section bridges the documentation gap between model development and production operations.
What to include:
- Serving infrastructure: Cloud provider, region, compute type (GPU/CPU)
- Latency and throughput characteristics: p50/p95/p99 latency at expected load
- Input and output schema: Data types, formats, size constraints
- Dependencies: External APIs, data feeds, preprocessing services the model depends on
- Monitoring and alerting: What metrics are monitored in production? What triggers a human review or automatic rollback?
- Incident response: Who is contacted when model performance degrades or a fairness concern is flagged in production?
Section 9: Model Card Provenance and Versioning
A model card is a living document. This section ensures it remains trustworthy over time.
What to include:
- Model card authors: Who wrote this document and when
- Review and approval history: Who reviewed it (technical, legal, ethics), approval dates
- Next scheduled review date: Model cards should be reviewed when the model is retrained, when deployment context changes, or on a defined schedule (e.g., annually)
- Related documents: Links to the model’s research paper, training code repository, evaluation notebooks, data governance records, and risk assessment
Full AI Model Card Template
The following is a copy-paste-ready template for enterprise model cards. Annotated fields in [brackets] should be replaced with your model’s specific information.
markdown
# Model Card: [Model Name] v[X.Y]
---
## 1. Model Details
- **Model Name:** [e.g., Customer Churn Predictor]
- **Version:** [e.g., 2.1.0]
- **Model Type:** [e.g., Binary Classification]
- **Architecture:** [e.g., Gradient Boosted Decision Tree (XGBoost)]
- **Developed By:** [Team name, Organization]
- **Contact:** [Internal ticket queue or email]
- **Training Completion Date:** [YYYY-MM-DD]
- **Model Card Published:** [YYYY-MM-DD]
- **License:** [e.g., Internal Use Only / Apache 2.0 / CC BY 4.0]
- **Model Registry Link:** [URL to internal model registry record]
---
## 2. Intended Use
### Primary Use Cases
[Describe the specific tasks this model is designed for. Be precise about domain,
task type, and operating conditions.]
### Intended Users
[Who is expected to use this model? Data engineers via API? Business analysts
via dashboard? Clinical staff via EMR integration?]
### Out-of-Scope Uses
[List uses the model was NOT designed or evaluated for. Be explicit. This field
has direct liability implications.]
- [Use case 1]
- [Use case 2]
### Prohibited Uses
[List uses prohibited by ethics policy, law, or licensing.]
- [Prohibited use 1]
---
## 3. Training Data
- **Dataset Name(s):** [Name and version]
- **Data Source(s):** [Internal CRM / Public dataset / Third-party provider]
- **Collection Period:** [Start date – End date]
- **Training Sample Size:** [e.g., 4.2M records]
- **Demographic Composition:** [Breakdown by age, gender, geography, etc.]
- **Preprocessing Steps:** [Normalization, deduplication, label encoding, etc.]
- **Labeling Methodology:** [Human annotation / automated / rule-based]
- **Known Data Limitations:** [Class imbalance, geographic gaps, temporal drift risk]
- **Data Governance:** [License, consent basis, DPA reference, PII handling]
---
## 4. Evaluation Data
- **Evaluation Dataset:** [Name, version, source]
- **Evaluation Sample Size:** [e.g., 500K held-out records]
- **Justification for Dataset Choice:** [Why this dataset is representative of deployment context]
- **Known Gaps vs. Deployment Context:** [Document any distributional differences]
---
## 5. Performance Metrics
### Overall Performance (Threshold: [X])
| Metric | Value |
|----------------|---------|
| Accuracy | [X%] |
| Precision | [X%] |
| Recall | [X%] |
| F1 Score | [X%] |
| AUC-ROC | [X] |
### Disaggregated Performance
| Subgroup | Precision | Recall | F1 |
|------------------------|-----------|--------|-------|
| [Group A] | [X%] | [X%] | [X%] |
| [Group B] | [X%] | [X%] | [X%] |
| [Geographic region 1] | [X%] | [X%] | [X%] |
### Baseline Comparison
- **Previous version ([vX.Y]):** [Key metric delta]
- **Rule-based baseline:** [Key metric delta]
- **Human benchmark:** [Key metric delta, if applicable]
---
## 6. Ethical Considerations
### Fairness Analysis
[Describe the fairness methodology, metrics assessed (demographic parity,
equalized odds, etc.), and findings. Note any disparities found and their magnitude.]
### Potential Harms
| Harm Type | Likelihood | Severity | Mitigation |
|---------------------|------------|----------|------------|
| [False positive impact] | [Low/Med/High] | [Low/Med/High] | [Action taken] |
| [Data misuse] | [Low/Med/High] | [Low/Med/High] | [Action taken] |
### Organizational Risk Classification
- **Risk Tier:** [Low / Medium / High / Prohibited]
- **Classification Rationale:** [Why this tier was assigned]
### Privacy Considerations
[Summary of privacy impact assessment. PII data processed? Retention policies?]
---
## 7. Limitations and Caveats
- [Known condition causing performance degradation]
- [Data drift sensitivity — describe how quickly performance may degrade]
- [Adversarial input vulnerabilities, if applicable]
- [Geographic/linguistic/domain scope limitations]
- [Interpretability constraints and implications for high-stakes decisions]
---
## 8. Deployment and Infrastructure
- **Serving Environment:** [Cloud provider, region, GPU/CPU]
- **Latency (p95):** [Xms at Y RPS]
- **Input Schema:** [Format, data types, size constraints]
- **Output Schema:** [Format, probability scores, label definitions]
- **Key Dependencies:** [Data feeds, APIs, preprocessing services]
- **Monitoring:** [Metrics monitored, alert thresholds]
- **Incident Response:** [Contact team or escalation process]
---
## 9. Provenance and Versioning
- **Model Card Authors:** [Names, roles]
- **Technical Review:** [Reviewer name, date]
- **Legal/Compliance Review:** [Reviewer name, date]
- **Ethics Review:** [Reviewer name, date]
- **Next Scheduled Review:** [YYYY-MM-DD or trigger condition]
- **Related Documents:**
- Training code: [URL]
- Evaluation notebook: [URL]
- Data governance record: [URL]
- Risk assessment: [URL]Best Practices for Enterprise Model Cards
1. Write for Multiple Audiences
A model card will be read by ML engineers validating technical claims, compliance officers assessing regulatory alignment, business stakeholders evaluating deployment risk, and potentially external auditors. Use plain language for the high-level sections (Intended Use, Ethical Considerations, Limitations) and technical precision for the metrics and infrastructure sections. Consider including an executive summary for high-risk models.
2. Treat Disaggregated Metrics as Non-Negotiable
The most common failure mode in enterprise model cards is reporting only aggregate performance metrics. A model may achieve 95% accuracy overall while performing at 72% for a specific demographic subgroup — a disparity that could constitute illegal discrimination in a regulated context. Make subgroup evaluation a required step in your model development lifecycle, not an afterthought in documentation.
3. Document What You Don’t Know
Uncertainty is honest and important. If your evaluation dataset underrepresents a subgroup, say so. If you don’t have data on how the model performs under adversarial conditions, document that gap. Regulators and auditors are more troubled by undisclosed unknowns than by acknowledged limitations with mitigation plans.
4. Version Your Model Cards
A model card is only trustworthy if it matches the model version it describes. Implement version control for model cards alongside your model artifacts. When a model is retrained or fine-tuned, the model card must be updated — including re-running evaluations on the new version.
5. Integrate with Your MLOps Pipeline
Manual model card authorship is a bottleneck and a consistency risk. Modern MLOps platforms — including MLflow Model Registry, Google Vertex AI Model Cards, AWS SageMaker Model Cards, and Microsoft Azure ML — support automated population of model card fields from training metadata, evaluation runs, and data catalog records. Automate what you can; reserve human review for sections requiring judgment (ethical considerations, intended use, limitations).
6. Establish a Review and Approval Workflow
A model card should not be published by the team that built the model without independent review. Establish a review workflow that includes at minimum: a peer technical review (correctness of metrics and architecture), a data governance review (training data provenance and compliance), and a risk/ethics review (fairness analysis and harm assessment). For high-risk models, add legal review.
7. Make Model Cards Discoverable
A model card filed in a shared drive that no one reads provides no governance value. Integrate model cards into your AI asset registry, link them from your model serving APIs, include them in vendor documentation packages, and make them accessible to anyone in the organization who works with the model.
Common Mistakes to Avoid
Vague intended use descriptions. “This model is used for customer analytics” is not an intended use statement — it’s a category. Specify the task, the decision context, and the operating conditions in enough detail that a compliance officer could assess regulatory applicability.
Missing out-of-scope uses. Teams often document what a model does but not what it should never do. Out-of-scope use documentation is as legally significant as the intended use statement.
Reporting only aggregate metrics. Covered above, but worth repeating: aggregate accuracy conceals subgroup disparities. Disaggregated evaluation is not optional for models that make decisions about people.
Static, never-updated cards. A model card published at launch that is never updated as the model evolves is a governance liability. It may accurately describe a model version that was retired 18 months ago while the current production version has materially different characteristics.
Omitting limitations to appear stronger. Understating limitations may feel commercially or politically safer, but it exposes the organization to greater legal risk when the limitation manifests in production and the affected party can demonstrate it was known and concealed.
Treating model cards as a compliance checkbox. The organizations that derive the most value from model cards use them as living governance instruments — regularly reviewed, integrated into procurement and deployment workflows, and genuinely informative to the stakeholders who rely on them.
Maintaining Model Cards Over Time
A model card is not a one-time deliverable. The following events should trigger a model card review and update:
- Model retraining or fine-tuning: Re-run all evaluations; update training data, metrics, and any changed architectural details
- Significant change in deployment context: A model built for internal fraud detection deployed to a new market or customer population requires reassessment of its evaluation data representativeness and fairness metrics
- Discovery of a performance issue in production: Document the finding, the root cause analysis, and the remediation steps taken
- Regulatory changes: New guidance or legislation may require additional documentation fields or stricter bias testing standards
- Scheduled periodic review: At minimum annually for production models; quarterly for high-risk applications
Assign model card ownership to a named individual or team with accountability for keeping it current. In mature AI governance programs, model card currency is a tracked metric reported to AI ethics boards or risk committees.
Model Cards in the Context of Broader AI Documentation
A model card is one layer in a comprehensive AI documentation strategy. For a mature enterprise AI governance program, it sits alongside:
- Data sheets for datasets: The companion standard to model cards, documenting dataset composition, collection methodology, and intended uses (Gebru et al., 2018)
- System cards: Documentation of AI systems that integrate multiple models, covering system-level behavior, risks, and safeguards (as used by OpenAI and Meta for large-scale deployments)
- AI impact assessments: Structured pre-deployment risk assessments required for high-risk AI under the EU AI Act
- Model risk management reports: Quantitative validation reports required in financial services (SR 11-7)
- Algorithmic impact assessments: Broader societal impact evaluations for public sector and high-stakes deployments
Model cards are the foundational layer from which most of these higher-order documents draw. Investing in rigorous, consistent model cards pays dividends across the entire AI governance documentation stack.
Conclusion: Make Model Cards a Cultural Commitment, Not a Compliance Task
The organizations that lead on responsible AI are not those that write the most comprehensive model cards to satisfy auditors. They are the ones that have internalized the discipline of transparent, honest documentation as part of how they build and deploy AI.
A well-written model card forces the development team to confront the limitations of their model honestly, conduct fairness evaluations rigorously, and communicate clearly with the stakeholders who will be affected by their system’s decisions. That discipline — more than any specific documentation format — is what builds trustworthy AI.
Start with the template in this guide. Adapt it to your organization’s risk framework, regulatory context, and MLOps tooling. Review it regularly. Make it discoverable. And treat every blank field you can’t fill in as a signal that you have more work to do before deployment.
The cost of a thoughtfully written model card is a few hours of a senior engineer’s time. The cost of deploying a model without one — and later discovering a material fairness failure, a regulatory violation, or an undisclosed limitation that caused harm — can be orders of magnitude greater.