Guide
QA Sampling for AI-Assisted Review (2026): A Defensible Approach + Tool Shortlist
A simple, repeatable sampling plan for AI-assisted review: bucketed sampling, error taxonomy, stop/adjust thresholds, and logging.
On this page (jump)
Quick answer
A defensible QA sampling plan is bucketed (by output type), randomized, tied to a simple per-batch rule, and logged with an error taxonomy plus stop/adjust thresholds—otherwise you can’t detect systemic failure modes or explain what you verified.
TL;DR
Sampling is how you make AI-assisted review defensible: define what AI is doing, define error types that matter, sample by output bucket (responsive/non-responsive/privilege/hot) instead of only overall, and use a simple, repeatable rule (fixed number or percent per batch). Randomize the sample, review with a short QA checklist, set “stop/adjust/proceed” thresholds, and log what you checked and what changed. If it isn’t logged, it didn’t happen.
Download the kit
Templates you can reuse across matters. Keep them in your matter folder and log changes.
Common Questions
- How do I create a sampling plan for AI-assisted review?
- Should I sample by bucket or overall?
- What sample size is practical for a review workflow?
- What errors should trigger escalation?
- What should a QA log include?
Worked example
A sanitized, workflow-first example. Treat as an operating pattern, not legal advice.
Example: bucketed sampling catches a systemic “role confusion” pattern (30 minutes plan setup + per-batch execution)
Scenario
A team uses AI for summaries and issue tagging across 3,200 docs. Early results feel fast, but the risk is hidden: errors cluster in the non-responsive and non-privileged buckets, where teams stop paying attention.
Inputs
- Bucket list: responsive, non-responsive, potential privilege, hot, low priority.
- Error taxonomy: critical / material / minor.
- Calibration sampling rule (fixed count per high-risk bucket).
- QA log template with decision (proceed/adjust/escalate) and “what changed.”
Process
- Sample each bucket separately (don’t only sample overall).
- Randomize sample selection and perform a short checklist QA pass.
- Log errors by type and look for repeating patterns (same failure mode across docs).
- When a pattern appears, adjust inputs/rules and increase sampling for the affected bucket until stable.
Outputs
- Sampling plan sheet (rules per bucket + thresholds).
- QA log entries per batch with error counts and decisions.
- A dated record of workflow adjustments (what changed and why).
QA findings
- Role confusion errors were concentrated in the non-responsive bucket during calibration.
- Cite-back omissions correlated with the highest-speed batches.
Adjustments made
- Added a role map input and made cite-backs mandatory for decision-driving summaries.
- Raised sampling rate for the non-responsive bucket until the role confusion pattern dropped.
- Added a “no cite-back, no reliance” gate before downstream use.
Key takeaway
Sampling isn’t about perfect stats—it’s about detecting patterns early and proving you controlled risk with a repeatable method.
Ranked Shortlist
1. Everlaw
unknown
Platform-style review workflows where batching, audit trails, and reviewer QA fit naturally; ideal for disciplined sampling and logging.
2. Luminance
free
Useful for structured document analysis; pair with human QA sampling to validate outputs before relying on them.
3. Aerial
unknown
Fast doc-level insight for triage; sampling helps ensure outputs are accurate and cite-backed in high-risk buckets.
4. Paralegal Pal
unknown
Paralegal-facing assistance for consistent notes and triage; logs and sampling keep it defensible.
5. Legal Doc Assistant
unknown
Lightweight structured extraction option; sampling helps quantify where it’s strong vs where it needs tighter inputs.
Workflow fit (comparison)
A workflow-first comparison. Treat as directional and verify with your team’s requirements and vendor docs.
Tip: swipe horizontally to see all columns.
| Tool | Best for | Workflow fit | Auditability | QA support | Privilege controls | Exports/logs | Notes |
|---|---|---|---|---|---|---|---|
Legal document review and analysis assistant. | High-volume matters where sampling and audit trails must be operationalized consistently. | batches, audit trails, review staging, export workflows | Strong (workflow platform supports traceability). | Strong (supports reviewer QA and consistent staging). | Strong (still requires explicit privilege protocols). | Strong (consistent exports and process documentation). | Best when you want sampling to be part of the system, not a side spreadsheet. |
Luminance is an AI platform designed specifically for the legal profession. The tool leverages a proprietary legal Large Language Model (LLM) to automate the creation, negotiation, and analysis of contracts. Developed by a team of world-leading AI experts and validated by practicing lawyers, the Lum... | Structured analysis and extraction where you want to measure error types with sampling. | extraction, analysis | Medium (verify logging and reproducibility). | Medium (pair with bucketed sampling + cite-backs). | Medium (workflow-driven; enforce boundaries). | Medium (confirm structured export). | Works well when sampling is used to validate output quality and tune inputs. |
Legal document review and analysis assistant. | Fast triage where sampling is used to keep accuracy controlled. | triage, summaries | Low–Medium (treat outputs as drafts unless cite-backed and stored). | Medium (sampling is the safety net). | Low–Medium (policy + escalation rules matter). | Low–Medium (verify repeatable export). | Great for speed; world-class results come from disciplined sampling and logs. |
Legal document review and analysis assistant. | Standardized paralegal outputs that can be sampled and logged consistently. | templates, structured notes, triage | Low–Medium (improves with fixed schemas and saved batch outputs). | Medium (easy to wrap with sampling plan + QA log). | Low–Medium (policy-driven). | Medium (confirm structured export). | Good when your workflow is checklist-driven and you need repeatable outputs. |
Legal document review and analysis assistant. | Small-team workflows where you want quick extraction and sampling-based QA. | extraction, draft notes | Low–Medium (depends on cite-backs + versioning). | Medium (sampling quantifies reliability). | Low–Medium (enforce boundaries and escalation). | Low–Medium (verify repeatable export). | A strong starter if you already have a sampling discipline and a QA log. |
Comparison Table
Use this to shortlist quickly. Treat pricing/platform as directional and verify on the vendor site.
Tip: swipe horizontally to see all columns.
| Tool | Pricing | Platform | Verified | Last checked | Categories | Links |
|---|---|---|---|---|---|---|
Everlaw Legal document review and analysis assistant. | unknown | web | No | 2026-02-20 | Legal documents review | |
Luminance Luminance is an AI platform designed specifically for the legal profession. The tool leverages a proprietary legal Large Language Model (LLM) to automate the creation, negotiation, and analysis of contracts. Developed by a team of world-leading AI experts and validated by practicing lawyers, the Lum... | free | web | No | 2026-02-20 | Legal documents review | |
Aerial Legal document review and analysis assistant. | unknown | web | No | 2026-02-20 | Legal documents review | |
Paralegal Pal Legal document review and analysis assistant. | unknown | web | No | 2026-02-20 | Legal documents review | |
Legal Doc Assistant Legal document review and analysis assistant. | unknown | web | No | 2026-02-20 | Legal documents review |
How to choose
- Pick tools/workflows that preserve batch context and allow audit logging.
- Require structured outputs you can verify (cite-backs, fields, consistent schemas).
- Treat privilege/confidentiality as high-risk buckets and sample accordingly.
- Prefer workflows that support randomization/sampling and reviewer QA notes.
- Pilot with a bounded dataset and track error patterns, not vibes.
Implementation risks
- Sampling only overall and missing high-risk bucket failures.
- No error taxonomy, so “QA” becomes subjective and inconsistent.
- No stop/adjust thresholds, so systemic errors repeat across batches.
- No logging, so you can’t explain what was checked or what changed.
- Confusing speed with safety (fast outputs without verification).
Operator playbook
Copy/pasteable workflow steps you can standardize across matters. Keep it consistent and log changes.
8-step sampling plan (minimum viable)
- Step 1: define the AI task in one sentence (what it does and doesn’t do).
- Step 2: define error types that matter (critical/material/minor).
- Step 3: sample by bucket (responsive/non-responsive/privilege/hot/low priority).
- Step 4: use a repeatable rule (fixed count or percent per batch for high-risk buckets).
- Step 5: randomize the sample (avoid cherry-picking).
- Step 6: review with a short QA checklist (text matches, cite-backs, privilege indicators).
- Step 7: apply stop/adjust/proceed thresholds and document changes.
- Step 8: log everything (batch, bucket, sample size, errors, decision, changes).
Recommended prompt packs
Litigation and Discovery Pack
Prompts for case theory, chronologies, discovery requests, depositions, and eDiscovery protocols.
Lawyer Productivity Pack
A practical pack of rewritten prompt templates (inspired by a public legal-tech article) for intake, drafting, litigation, research, and client communications.
FAQ
Why sample by bucket instead of overall?
Because risk isn’t evenly distributed. The bucket that looks “safe” (e.g., non-privileged) can hide the errors that hurt you most.
Do we need perfect statistical confidence?
Not for most teams. You need a consistent method that detects systemic errors early and produces an audit story you can explain.
What’s a practical starting sample size?
Start heavier in early calibration (fixed counts per bucket). In steady state, keep a simple per-batch rule for high-risk buckets.
What should trigger escalation?
Any repeating critical error pattern—especially privilege/confidentiality misses—should trigger a pause, adjustment, and documentation.
What should a QA log include?
Batch ID, bucket, sample size, reviewer, error counts by type, decision (proceed/adjust/escalate), and what changed.
Citations
Not legal advice. Verify with primary sources and your firm’s policies.
Changelog
2026-03-08
- Published as an Answer Hub guide.
- Added downloadable sampling plan + QA log templates.
- Added one-page PDF one-pager.
- Added a worked example.
- Added workflow-fit comparison table.
Templates included. Download the kit for this guide.
Download kit