How to vet a legal AI tool for citation accuracy

TL;DR: Most legal AI tools look accurate until you check. A structured test - ask for a known judgment, click the source link, verify the proposition matches the holding, probe an obscure point, then push the tool to its limits - will tell you in under 30 minutes whether the tool retrieves real law or invents plausible-looking text. Niyam.ai is built on a corpus of 72,000+ Indian judgments; every answer links to a real document you can open.

Why citation accuracy is the only metric that matters
How hallucination actually happens in legal AI
The difference between grounded retrieval and pure generation
Your 30-minute vetting test plan
Green flags and red flags: a vetting checklist
What to do when the tool cannot find an answer
The Mata v. Avianca moment and what it means for Indian lawyers
How Niyam handles citation grounding
Practical workflow for ongoing use
Frequently asked questions

Why citation accuracy is the only metric that matters

Speed does not matter if the citations are wrong. A beautiful interface does not matter if the citations are wrong. An affordable subscription does not matter if the citations are wrong.

This sounds obvious when stated plainly. And yet, the marketing material for most legal AI products leads with speed, breadth, and ease of use. Citation accuracy - the one thing that determines whether the tool actually helps or actively harms you - tends to appear as a paragraph near the bottom, often hedged with phrases like “we recommend verification.”

For a lawyer in India, a wrong citation is not a minor inconvenience. It can lead to a pleading that misrepresents the state of the law, a judge who has already encountered the same fabricated authority from another AI-assisted filing, or a disciplinary inquiry. Courts across India have started recognising AI-generated content in submissions, and the judicial temperament toward unverified AI citations is not forgiving.

The question to ask before adopting any legal AI tool is therefore very specific: does this tool retrieve real judgments from a verified corpus, or does it generate text that looks like it came from real judgments? These are architecturally different things, and the test plan below will tell you which category a tool belongs to in half an hour.

Read more on why this matters in our piece on AI legal research in India and the hallucination risk.

How hallucination actually happens in legal AI

A general-purpose large language model (LLM) - ChatGPT, Gemini, Claude, and their equivalents - is trained to predict the next most plausible token given everything that came before it. During training it processed enormous amounts of text, including legal text: judgments, headnotes, law review articles, textbooks, court rules.

As a result it has learned the statistical shape of legal language very well. It knows what a Supreme Court citation looks like. It knows how a headnote is phrased. It knows the rhythm of a paragraph that distinguishes an authority. When you ask it a legal question, it produces text that matches that shape, including a citation in the format it expects a citation to appear.

The problem is that predicting the shape of a citation is not the same as retrieving a real citation. The model may produce party names that sound plausible, a year in the expected range, a reporter abbreviation formatted correctly, and a page number that looks right - and the underlying judgment may not exist at all. Or it may exist but say something completely different from what the AI claimed.

This is not the model malfunctioning. It is the model doing exactly what it was designed to do: producing statistically plausible text. The fluency of the output carries no information about its accuracy.

For a deeper look at the structural reasons Indian legal research is particularly exposed, see why generic GPT tools fall short for Indian lawyers.

The difference between grounded retrieval and pure generation

Not every legal AI works the same way. Understanding the architecture helps you ask better questions during evaluation.

Pure generation: The model answers entirely from its training data. No live search. No corpus lookup. The answer is probabilistic text. This describes general-purpose chatbots used for legal queries.

Retrieval-augmented generation (RAG): When you ask a question, the system first searches an indexed corpus of real documents - judgments, statutes, notifications - retrieves the passages most relevant to your query, and then asks the model to compose an answer grounded in those passages. Citations point back to the source documents. A well-built RAG system shows you the source material so you can open it.

Grounded retrieval with a citator layer: The most rigorous approach adds a good-law check on top of RAG. Retrieved judgments are checked against subsequent decisions to flag authorities that have been overruled, distinguished, or qualified. This is what a citator does, and it is what separates a research starting point from a research conclusion.

The practical implication: when you evaluate a legal AI, you are trying to determine which of these three categories it belongs to. The test plan below is designed to make that determination quickly and without relying on marketing claims.

You can see how this plays out in a side-by-side comparison on our compare page.

Your 30-minute vetting test plan

Run these five tests in sequence. Each one is designed to probe a different failure mode.

Test 1: the known landmark (5 minutes)

Pick a case you know well - something you have read, briefed, or argued. A landmark judgment with a proposition you can state from memory. Kesavananda Bharati is a good choice for constitutional lawyers. Vishaka is a good choice for labour and employment practitioners. Ask the tool: “What did the Supreme Court hold in [case name]?”

Evaluate three things:

Does the tool produce a citation?
Does the citation link to a real document you can open?
Does the document, when you open it, say what the tool claimed it said?

If the citation does not link anywhere, or the link resolves to a 404, or the judgment says something different from the claimed proposition, you have learned something important in five minutes.

Test 2: the obscure point (10 minutes)

Pick a narrow, technical question - the kind of thing that would not appear prominently in any textbook. A procedural point under a specific rule. A tribunal-level question about a statutory exception. The more niche the better.

Ask the question and evaluate: does the tool produce a citation, or does it say it cannot find a sufficiently grounded answer? A well-designed grounded system should tell you when it does not have adequate source material rather than filling the gap with generated text. A tool that always produces a confident citation, even for obscure points it cannot possibly have indexed, is a tool that generates rather than retrieves.

Test 3: the proposition check (5 minutes)

Take one citation from Test 1 or Test 2 and read the judgment yourself - at least the relevant passage the tool references. Check whether the proposition the tool attributed to the case is what the judgment actually says.

Retrieval systems can still mischaracterise holdings even when the underlying case is real. The citation may be genuine; the summary may be subtly wrong in a way that matters. This test separates a tool that retrieves accurately from a tool that retrieves but misreads.

Test 4: the pressure test (5 minutes)

Ask the tool a question that is almost certainly outside its corpus. An unreported tribunal decision from a minor state forum, or a question about very recent subordinate legislation. Observe the behaviour.

A grounded tool with honest design will say it cannot find relevant authority - and that is the correct answer. A generation-heavy tool will produce an answer anyway, often with a citation that looks authoritative. That response pattern is a red flag regardless of how plausible the output looks.

Test 5: the source transparency test (5 minutes)

Without asking a question, explore the tool’s interface. Can you see the source documents? Can you click through to the underlying judgment? Does the tool distinguish between “this answer is grounded in a retrieved document” and “this answer is based on general knowledge”?

Transparency about sourcing is a design choice that reflects the tool developer’s values. A tool that surfaces its sources is a tool built by people who expect their answers to be verified. A tool that buries sources or does not show them is asking you to trust the output without giving you the means to check it.

Green flags and red flags: a vetting checklist

What you observe	Green flag	Red flag
Source links in every response	Clickable links to real documents	No links, or links to search pages only
Behaviour on obscure queries	”I cannot find adequate grounding”	Confident citation every time
Citation format	Links to a specific judgment	Neutral citation string with no verification path
Source transparency	Shows retrieved passages with page context	Answer only, no source
Corpus disclosure	States what is indexed and when it was last updated	Vague claims about “millions of documents”
Good-law status	Flags overruled or distinguished authorities	No indication of citation status
Hallucination disclosure	Documents a known limitation honestly	Claims zero hallucination without qualification
Test 1 accuracy	Proposition matches the actual holding	Plausible but subtly incorrect characterisation
Test 4 behaviour	Declines to answer, suggests alternative sources	Generates a citation for an unreported case
Interface	Separate “generated” vs “retrieved” labelling	Single response stream, no sourcing distinction

What to do when the tool cannot find an answer

A grounded tool saying “I cannot find a well-sourced answer to this question” is doing you a favour. It is telling you that your research is not done and that you need to go to a primary source.

The correct response is not to switch to a general-purpose chatbot to fill the gap. The correct response is to move to your primary sources: the court’s judgment portal, a verified reporter, a law library.

The combination of a grounded retrieval tool for the bulk of your research and direct primary-source verification for the gaps is the most defensible workflow available to an Indian lawyer today. It is faster than full manual research. It is safer than relying entirely on generated output.

For the full workflow, see our guide to AI legal research in India and the companion piece on how to cite Indian judgments correctly.

The Mata v. Avianca moment and what it means for Indian lawyers

In 2023, a US federal court in Mata v. Avianca imposed sanctions on lawyers who submitted a brief containing citations generated by ChatGPT. The cases did not exist. The lawyers had not verified them. The court was explicit: professional responsibility obligations do not yield because the error was caused by an AI tool.

The judgment made global news. It also produced a predictable response from legal AI vendors: everyone added a disclaimer. Disclaimers do not change the underlying architecture. A tool that generates rather than retrieves will produce fabricated citations whether or not the terms of service tell you to verify.

The Mata situation is not hypothetical for Indian practitioners. Indian courts have noticed AI-generated content in pleadings. The consequences have not yet reached the same public profile as Mata, but the judicial attitude toward unverified AI citations is consistent: you filed it, you own it.

The best protection is not a disclaimer. It is using a tool that retrieves real documents and lets you open them - and then verifying before you cite.

See our breakdown of what Indian courts and bar rules say about AI tool use for the current regulatory picture.

How Niyam handles citation grounding

Niyam.ai is built on retrieval-augmented generation over a corpus of 72,000+ Indian judgments. The architecture is designed so that answers are grounded in documents that were actually retrieved, not generated from statistical inference about what a judgment might say.

Every response that cites a case links to the actual judgment in Niyam’s corpus. You can open it. You can read the relevant passage. You can check that the proposition matches the holding. This is not a feature - it is the baseline requirement for a legal research tool that takes citation accuracy seriously.

Niyam also includes a citator that checks retrieved authorities against subsequent decisions - flagging cases that have been overruled, distinguished, or qualified. Good-law status is not something you should have to verify manually for every citation; the tool should surface it.

When Niyam cannot find a sufficiently grounded answer, it says so. It does not fill the gap with generated text that looks like a citation.

For lawyers who want to understand how this compares to the research workflow in a traditional service, the solutions overview walks through the practical differences.

Practical workflow for ongoing use

Vetting a tool once before adoption is not the same as maintaining accuracy discipline in day-to-day use. A few practices that hold up over time:

Keep a verification habit for every citation you intend to file. Even with a grounded tool, the proposition check from Test 3 above should become routine. The tool retrieves; you verify the characterisation before it goes into a document. This is not extra work imposed by a weak tool. It is basic professional practice adapted to a new research environment.

Notice when your questions are drifting toward generation territory. Asking a grounded tool for a legal opinion rather than legal research is asking it to do something it is not designed to do. “What does the law say about X” is a research question. “What should my argument be in this matter” is a judgment question that no tool should answer for you unsupervised.

Check good-law status as a separate step, not an afterthought. A judgment that supported a proposition when it was decided may no longer be good authority. A citator check should be a fixed step before any citation goes into a client-facing document. See our detailed guide on good-law checking for Indian practitioners.

Track what the tool could not find. If your research regularly hits the “insufficient grounding” response for a particular area of law, that is a signal that the corpus is thin in that area. Know the gaps and fill them from primary sources.

For a comparison of how different tools handle these workflows, the best AI legal research tools in India overview lays out the landscape honestly.

Frequently asked questions

What does “grounded” mean when a legal AI claims to be grounded?

Grounded means the answer is composed from documents that were actually retrieved from a real corpus, with citations pointing back to those documents. The opposite is pure generation, where the model produces text based on statistical inference from training data. A grounded tool should let you open the source document it retrieved.

Can a grounded tool still hallucinate?

Yes. Retrieval narrows the risk significantly but does not eliminate it. A grounded tool can retrieve a real case and still mischaracterise the holding. It can retrieve the wrong case for a given proposition. And any question that falls outside the indexed corpus may still receive a generated response if the tool is not designed to decline gracefully.

How do I tell whether a tool is truly RAG or just claims to be?

Run Test 5 from the vetting plan above: look for source links in the response. If clicking a citation opens the actual judgment, retrieval is working. If citations are not linked or resolve to a search page rather than a document, the tool is likely generating and decorating the output with citation-shaped text.

Are ChatGPT and Gemini safe for legal research?

They are safe for general orientation - understanding a statutory scheme, getting a plain-language explanation of a legal concept. They are not safe for citation work. Neither has access to a verified Indian judgment corpus, and neither will tell you when a citation it produces is fabricated. For the full analysis, see ChatGPT for lawyers in India.

What is the riskiest type of question to ask a general-purpose AI?

Narrow, specific questions about case law. The more specific the question, the more confidently the model will produce a citation-shaped answer, and the harder it is to detect that the citation is fabricated from surface inspection alone.

What counts as a “real” corpus for Indian law?

A corpus that includes the full text of Supreme Court and High Court judgments, is updated regularly, and discloses its coverage and update frequency. Vague claims about “millions of documents” without specifics are a red flag. Niyam’s corpus is 72,000+ Indian judgments with source attribution on every retrieved document.

Does corpus size matter more than corpus quality?

Quality matters more. A corpus of 72,000 well-indexed, accurately attributed Indian judgments is more useful than a corpus of millions of poorly parsed documents that include duplicate, unofficial, or incorrectly attributed content. Ask the vendor what they have done about deduplication and metadata accuracy.

Should I verify every citation even from a grounded tool?

Yes - specifically, verify the proposition-to-holding match before any citation goes into a client-facing document or court filing. The citation itself being real does not guarantee the characterisation is accurate. The proposition check from Test 3 of the vetting plan should be a fixed step in your workflow.

What should I do if a tool declines to answer my question?

Treat it as useful information. The tool is telling you that it cannot ground an answer in its corpus. Move to primary sources: the court’s judgment portal, the official gazette for legislative questions, or a law library for unreported or older authorities.

Is it enough to paste a disclaimer that “AI was used” in my filing?

No. Disclosure does not substitute for verification. Informing the court that AI assisted the research does not protect you if a citation is fabricated or mischaracterised. The professional responsibility obligation is to the accuracy of the citation itself.

Can a legal AI tool replace a law firm library subscription?

For the majority of routine research - recent judgments, well-reported propositions, standard procedural questions - a good grounded tool covers significant ground. It does not replace a subscription for unreported decisions, historical research, or areas of law with thin online coverage. Know the gaps in your tool’s corpus.

How often should I re-vet a tool I already use?

A useful interval is every six months, or whenever the vendor makes a significant update. Run a shorter version of the test plan - Test 1 and Test 3 are sufficient for a maintenance check - to confirm that the update has not introduced new generation-heavy behaviour.

What is the Mata v. Avianca case and why does it matter?

Mata v. Avianca was a 2023 US federal court case in which lawyers who submitted a brief with fabricated AI-generated citations were sanctioned. The court held that professional responsibility obligations are not suspended by the use of an AI tool. The case is the clearest precedent globally for the personal professional risk of unverified AI citations.

Are Indian courts taking the same position as US courts on AI citations?

The trend is in the same direction. Indian courts have noticed AI-generated content in pleadings and the judicial attitude toward unverified AI citations is consistent with the Mata reasoning: the lawyer who filed the document is responsible for its accuracy. See what Indian courts and bar rules say about AI tool use.

What is a citator and why does it matter for citation accuracy?

A citator checks whether a judgment you intend to cite is still good law - whether it has been overruled, distinguished, or qualified by a later decision. Retrieving a real case that has since been overruled is still a citation error. A tool with a citator layer surfaces this information at research time rather than leaving it to manual follow-up. See how good-law checking works.

What is the difference between a citation being “real” and a citation being “accurate”?

A real citation is one where the case exists and the link resolves to a genuine judgment. An accurate citation also means the proposition the tool attributed to the case is what the judgment actually says. Both need to be true. Test 1 checks reality; Test 3 checks accuracy.

How should I explain AI-assisted research to a client?

Be specific about what the tool does. “I used a retrieval-grounded legal AI tool that searches a corpus of 72,000+ Indian judgments and provides source links for every citation” is a precise description. “I used AI to help with research” is too vague to be meaningful about the quality of the work.

Does Niyam work for High Court research, or only Supreme Court?

Niyam’s corpus includes High Court judgments alongside Supreme Court authorities. The coverage extends across Indian courts. For the current corpus scope and update frequency, the research solutions page has the latest information.

What if the tool’s answer looks right but the source link is broken?

A broken source link is a red flag regardless of how plausible the answer looks. If the document cannot be opened, you cannot perform the proposition check, which means you cannot verify the citation. Do not file a citation you cannot verify, even if the surrounding answer looks credible.

What is the single most important question to ask a legal AI vendor?

“Can you show me a response where the tool says it cannot find an answer?” A tool that declines gracefully on questions outside its corpus is a tool designed for accuracy. A tool that always produces a confident answer is a tool you should treat with significant caution.

Start researching on verified ground

The test plan in this post takes 30 minutes. It will tell you more about a legal AI tool’s reliability than any amount of marketing material.

Citation accuracy is not a differentiator - it is a baseline requirement. A tool that invents plausible-looking citations is not a research tool. It is a liability.

Niyam.ai is built on 72,000+ Indian judgments. Every answer links to a real document you can open. When the corpus does not cover a question adequately, the tool says so.

Start with a ₹100 trial - 200 credits to research real questions with real citations, cancel anytime.

Start for ₹100 or write to us at [email protected] with questions about corpus coverage or enterprise access.

On this page