Guide

QA Sampling for AI-Assisted Review (2026): A Defensible Approach + Tool Shortlist

A simple, repeatable sampling plan for AI-assisted review: bucketed sampling, error taxonomy, stop/adjust thresholds, and logging.

Year: 2026Updated: 2026-03-08All guides

On this page (jump)

Quick answer TL;DR Download kit Common questions Worked example Ranked shortlist Workflow fit Comparison table How to choose Implementation risks Operator playbook Recommended packs FAQ Citations Newsletter Changelog

Quick answer

A defensible QA sampling plan is bucketed (by output type), randomized, tied to a simple per-batch rule, and logged with an error taxonomy plus stop/adjust thresholds—otherwise you can’t detect systemic failure modes or explain what you verified.

TL;DR

Sampling is how you make AI-assisted review defensible: define what AI is doing, define error types that matter, sample by output bucket (responsive/non-responsive/privilege/hot) instead of only overall, and use a simple, repeatable rule (fixed number or percent per batch). Randomize the sample, review with a short QA checklist, set “stop/adjust/proceed” thresholds, and log what you checked and what changed. If it isn’t logged, it didn’t happen.

Download the kit

Templates you can reuse across matters. Keep them in your matter folder and log changes.

Sampling plan + QA log (XLSX)QA + decision logs (XLSX)One-page PDF (shareable)

Common Questions

How do I create a sampling plan for AI-assisted review?
Should I sample by bucket or overall?
What sample size is practical for a review workflow?
What errors should trigger escalation?
What should a QA log include?

Worked example

A sanitized, workflow-first example. Treat as an operating pattern, not legal advice.

Example: bucketed sampling catches a systemic “role confusion” pattern (30 minutes plan setup + per-batch execution)

Scenario

A team uses AI for summaries and issue tagging across 3,200 docs. Early results feel fast, but the risk is hidden: errors cluster in the non-responsive and non-privileged buckets, where teams stop paying attention.

Inputs

Bucket list: responsive, non-responsive, potential privilege, hot, low priority.
Error taxonomy: critical / material / minor.
Calibration sampling rule (fixed count per high-risk bucket).
QA log template with decision (proceed/adjust/escalate) and “what changed.”

Process

Sample each bucket separately (don’t only sample overall).
Randomize sample selection and perform a short checklist QA pass.
Log errors by type and look for repeating patterns (same failure mode across docs).
When a pattern appears, adjust inputs/rules and increase sampling for the affected bucket until stable.

Outputs

Sampling plan sheet (rules per bucket + thresholds).
QA log entries per batch with error counts and decisions.
A dated record of workflow adjustments (what changed and why).

QA findings

Role confusion errors were concentrated in the non-responsive bucket during calibration.
Cite-back omissions correlated with the highest-speed batches.

Adjustments made

Added a role map input and made cite-backs mandatory for decision-driving summaries.
Raised sampling rate for the non-responsive bucket until the role confusion pattern dropped.
Added a “no cite-back, no reliance” gate before downstream use.

Key takeaway

Sampling isn’t about perfect stats—it’s about detecting patterns early and proving you controlled risk with a repeatable method.

Ranked Shortlist

1. Everlaw

unknown

Platform-style review workflows where batching, audit trails, and reviewer QA fit naturally; ideal for disciplined sampling and logging.

Details Vendor

2. Luminance

free

Useful for structured document analysis; pair with human QA sampling to validate outputs before relying on them.

Details Vendor

3. Aerial

unknown

Fast doc-level insight for triage; sampling helps ensure outputs are accurate and cite-backed in high-risk buckets.

Details Vendor

4. Paralegal Pal

unknown

Paralegal-facing assistance for consistent notes and triage; logs and sampling keep it defensible.

Details Vendor

5. Legal Doc Assistant

unknown

Lightweight structured extraction option; sampling helps quantify where it’s strong vs where it needs tighter inputs.

Details Vendor

Workflow fit (comparison)

A workflow-first comparison. Treat as directional and verify with your team’s requirements and vendor docs.

Tip: swipe horizontally to see all columns.

Tool	Best for	Workflow fit	Auditability	QA support	Privilege controls	Exports/logs	Notes
Everlaw Legal document review and analysis assistant.	High-volume matters where sampling and audit trails must be operationalized consistently.	batches, audit trails, review staging, export workflows	Strong (workflow platform supports traceability).	Strong (supports reviewer QA and consistent staging).	Strong (still requires explicit privilege protocols).	Strong (consistent exports and process documentation).	Best when you want sampling to be part of the system, not a side spreadsheet.
Luminance Luminance is an AI platform designed specifically for the legal profession. The tool leverages a proprietary legal Large Language Model (LLM) to automate the creation, negotiation, and analysis of contracts. Developed by a team of world-leading AI experts and validated by practicing lawyers, the Lum...	Structured analysis and extraction where you want to measure error types with sampling.	extraction, analysis	Medium (verify logging and reproducibility).	Medium (pair with bucketed sampling + cite-backs).	Medium (workflow-driven; enforce boundaries).	Medium (confirm structured export).	Works well when sampling is used to validate output quality and tune inputs.
Aerial Legal document review and analysis assistant.	Fast triage where sampling is used to keep accuracy controlled.	triage, summaries	Low–Medium (treat outputs as drafts unless cite-backed and stored).	Medium (sampling is the safety net).	Low–Medium (policy + escalation rules matter).	Low–Medium (verify repeatable export).	Great for speed; world-class results come from disciplined sampling and logs.
Paralegal Pal Legal document review and analysis assistant.	Standardized paralegal outputs that can be sampled and logged consistently.	templates, structured notes, triage	Low–Medium (improves with fixed schemas and saved batch outputs).	Medium (easy to wrap with sampling plan + QA log).	Low–Medium (policy-driven).	Medium (confirm structured export).	Good when your workflow is checklist-driven and you need repeatable outputs.
Legal Doc Assistant Legal document review and analysis assistant.	Small-team workflows where you want quick extraction and sampling-based QA.	extraction, draft notes	Low–Medium (depends on cite-backs + versioning).	Medium (sampling quantifies reliability).	Low–Medium (enforce boundaries and escalation).	Low–Medium (verify repeatable export).	A strong starter if you already have a sampling discipline and a QA log.

Comparison Table

Use this to shortlist quickly. Treat pricing/platform as directional and verify on the vendor site.

Tip: swipe horizontally to see all columns.

Tool	Pricing	Platform	Verified	Last checked	Categories	Links
Everlaw Legal document review and analysis assistant.	unknown	web	No	2026-02-20	Legal documents review	Counterbench Vendor
Luminance Luminance is an AI platform designed specifically for the legal profession. The tool leverages a proprietary legal Large Language Model (LLM) to automate the creation, negotiation, and analysis of contracts. Developed by a team of world-leading AI experts and validated by practicing lawyers, the Lum...	free	web	No	2026-02-20	Legal documents review	Counterbench Vendor
Aerial Legal document review and analysis assistant.	unknown	web	No	2026-02-20	Legal documents review	Counterbench Vendor
Paralegal Pal Legal document review and analysis assistant.	unknown	web	No	2026-02-20	Legal documents review	Counterbench Vendor
Legal Doc Assistant Legal document review and analysis assistant.	unknown	web	No	2026-02-20	Legal documents review	Counterbench Vendor

How to choose

Pick tools/workflows that preserve batch context and allow audit logging.
Require structured outputs you can verify (cite-backs, fields, consistent schemas).
Treat privilege/confidentiality as high-risk buckets and sample accordingly.
Prefer workflows that support randomization/sampling and reviewer QA notes.
Pilot with a bounded dataset and track error patterns, not vibes.

Implementation risks

Sampling only overall and missing high-risk bucket failures.
No error taxonomy, so “QA” becomes subjective and inconsistent.
No stop/adjust thresholds, so systemic errors repeat across batches.
No logging, so you can’t explain what was checked or what changed.
Confusing speed with safety (fast outputs without verification).

Operator playbook

Copy/pasteable workflow steps you can standardize across matters. Keep it consistent and log changes.

8-step sampling plan (minimum viable)

Step 1: define the AI task in one sentence (what it does and doesn’t do).
Step 2: define error types that matter (critical/material/minor).
Step 3: sample by bucket (responsive/non-responsive/privilege/hot/low priority).
Step 4: use a repeatable rule (fixed count or percent per batch for high-risk buckets).
Step 5: randomize the sample (avoid cherry-picking).
Step 6: review with a short QA checklist (text matches, cite-backs, privilege indicators).
Step 7: apply stop/adjust/proceed thresholds and document changes.
Step 8: log everything (batch, bucket, sample size, errors, decision, changes).

Recommended prompt packs

Litigation and Discovery Pack

Prompts for case theory, chronologies, discovery requests, depositions, and eDiscovery protocols.

Open pack

Lawyer Productivity Pack

A practical pack of rewritten prompt templates (inspired by a public legal-tech article) for intake, drafting, litigation, research, and client communications.

Open pack

FAQ

Why sample by bucket instead of overall?

Because risk isn’t evenly distributed. The bucket that looks “safe” (e.g., non-privileged) can hide the errors that hurt you most.

Do we need perfect statistical confidence?

Not for most teams. You need a consistent method that detects systemic errors early and produces an audit story you can explain.

What’s a practical starting sample size?

Start heavier in early calibration (fixed counts per bucket). In steady state, keep a simple per-batch rule for high-risk buckets.

What should trigger escalation?

Any repeating critical error pattern—especially privilege/confidentiality misses—should trigger a pause, adjustment, and documentation.

What should a QA log include?

Batch ID, bucket, sample size, reviewer, error counts by type, decision (proceed/adjust/escalate), and what changed.

Citations

Not legal advice. Verify with primary sources and your firm’s policies.

Changelog

2026-03-08

Published as an Answer Hub guide.
Added downloadable sampling plan + QA log templates.
Added one-page PDF one-pager.
Added a worked example.
Added workflow-fit comparison table.

Templates included. Download the kit for this guide.

Download kit