Verify Structured Output with Field-Level Citations

Sep 3, 2025
|
6
min read

TL;DR

With the new `provide_citations` parameter, Tensorlake can now return page numbers and bounding boxes for every extracted field. Engineers can build audit-ready pipelines where every number, name, or field in structured output links directly to the source document.

Missing evidence is one of the biggest blockers in production AI workflows.  

It’s not enough to say what a document claims, you need to show where in the source that claim came from. Whether you’re auditing bank statements, verifying medical referral forms, or investigating fraud, traceability is a hard requirement.

That’s why we’ve introduced a new parameter in Tensorlake’s StructuredExtractionOptions:  

1[.code-block-title]Code[.code-block-title]StructuredExtractionOptions(
2  schema_name="ExampleSchema",
3  json_schema=ExampleSchema,
4  provide_citations=True
5)

When provide_citations=True, every extracted field includes:

  • Page number
  • Bounding box (bbox) coordinates

This means structured outputs are no longer just machine-readable; they’re auditable, verifiable, and traceable back to the source document.

Traceable Context Means Trustworthy and Audit-Ready Workflows

In many workflows, “close enough” isn’t good enough. Teams need confidence that extracted values align with the document’s ground truth. Let’s look at where this matters most:

  • Banking & Finance: Auditors need to understand exactly which account, statement, or transaction produced a reported number. If an account balance doesn’t reconcile, citations let you trace back to the precise page and bounding box where the discrepancy originates. No more guesswork in backtracking totals.
  • Fraud Detection: When anomalies appear in reported values, bounding-box citations provide the evidence trail. Investigators can quickly verify whether a suspicious number came from an altered document, a duplicated entry, or a genuine filing.
  • Healthcare & Forms Processing: At UCLA, teams processing medical referral forms wanted faster verification of ground truth. With citations, a structured field (like “referral date” or “doctor’s signature”) can point directly to the page span and bounding box where it was found, cutting human review time dramatically.

In short:

Citations turn structured extraction into a compliance-grade tool.

Implement Citations with One Line of Code

Let’s take a simple example: extracting transaction summaries from a bank statement.

1[.code-block-title]Code[.code-block-title]from tensorlake.documentai import DocumentAI, StructuredExtractionOptions
2from pydantic import BaseModel, Field
3from typing import List
4
5class Transaction(BaseModel):
6  date: str = Field(description="Transaction date")
7  description: str = Field(description="Transaction description")
8  amount: float = Field(description="Transaction amount")
9
10class BankStatement(BaseModel):
11  transactions: List[Transaction]
12
13doc_ai = DocumentAI()
14
15structured_extraction_options = [
16  StructuredExtractionOptions(
17    schema_name="BankStatement",
18    json_schema=BankStatement,
19    provide_citations=True  # <-- new parameter
20  )
21]
22
23result = doc_ai.parse_and_wait(
24  file="https://tlake.link/documents/bank-statement",
25  structured_extraction_options=structured_extraction_options
26)
27
28print(result.structured_data[0].data)

The returned JSON now looks like this:

1[.code-block-title]Code[.code-block-title]"transactions": [
2{
3  "Date": "08/24",
4  "Date_citation": [
5    {
6      "page_number": 1,
7      "x1": 59,
8      "x2": 135,
9      "y1": 448,
10      "y2": 482
11    }
12  ],
13  "amount": "50.00",
14  "amount_citation": [
15    {
16      "page_number": 1,
17      "x1": 515,
18      "x2": 585,
19      "y1": 447,
20      "y2": 482
21    }
22  ],
23  "descriptions": "ATM CASH DEPOSIT, ***** 30073995581 AUT 082220 ATM CASH DEPOSIT 550 LONG BEACH BLVD LONG BEACH * NY",
24  "descriptions_citation": [
25    {
26      "page_number": 1,
27      "x1": 135,
28      "x2": 515,
29      "y1": 447,
30      "y2": 482
31    }
32  ]
33}

Each field is now annotated with a citation: the page number and bounding-box coordinates.

If you use our Tensorlake Cloud Playground, you can even get the visual bounding-boxes labeled for each extracted bit of information

Example of Tensorlake structured data extraction with citations, showing JSON output linked to highlighted fields on a TD Bank statement. A $50.00 transaction is mapped from the document to the JSON citation with bounding box coordinates.

From Data to Evidence

“In insurance, structured outputs power our workflows, but people still verify. With field-level citations, reviewers can jump from a data row straight to the exact COI or endorsement language. That’s the difference between ‘parsed’ and provable.”

Jesse McClure, CTO and Co-Founder, Sublynk

Citations aren’t just nice-to-have, our customers across industries know that they unlock new workflows:

  • Audit-ready outputs: Every number is backed by ground-truth evidence.
  • Automated review: Flag discrepancies automatically and point reviewers directly to the source.
  • Explainability in RAG/Agents: Don’t just return answers—return the highlighted document snippets.
  • UI Enhancements: Build document viewers that highlight the exact fields extracted.

The benefit is twofold: engineers can build more reliable systems and stakeholders (auditors, compliance teams, regulators) get confidence and transparency.

Try Structured Extraction Citations Now

You can try provide_citations=True today in both the Tensorlake Playground and the API/Python SDK.

If you have any questions or feedback, we'd love to hear from you! Join our Slack and let us know how you're using citations.

Traceability Built In

With the new provide_citations parameter, structured extraction becomes not only machine-readable but also evidence-backed.

Every field can now point back to its exact source location in the document, making Tensorlake the foundation for audit-ready, compliance-grade, and fraud-resistant AI workflows.

Start using it today. In production AI, traceability isn’t optional.

Related articles

No items found.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
Founder & CEO @ The Intelligent Search Company