August 13, 2025

Table Recognition now parses ~1,500-cell tables (with structure preserved)

New model is live—reliably extracting very large, dense tables from PDFs (incl. scans) while preserving header hierarchy, row/col spans, and cell boundaries, with fast HTML/CSV export and bbox for citations.

Key Highlights

Robust on ~1,500-cell tables; resilient to complex layouts and scanned documents.
Preserves header hierarchy and row/column spans; faithful HTML outputs.
Improved cell boundary detection and multi-row/multi-col header parsing.
Per-cell bounding boxes (bbox) for page-level citations and overlays.
Faster long-table parsing with fewer fallback OCR passes.
Works out of the box via TableOutputMode/TableParsingFormat (no breaking changes).

What’s new

Our Table Recognition model got a significant upgrade. It now reliably parses very large, dense tables (e.g., ~1,500 cells in a single table) that typically break VLMs and most OCR pipelines. The sample we used internally is a healthcare report table (patient safety indicators in California) with multi-row headers and wide column spans.

Why it matters

Big, busy tables are common in regulatory, healthcare, and finance PDFs.
Losing structure → bad retrieval, broken joins, and wrong analytics.
This release preserves cell grid, header hierarchy, and spans so you can export clean HTML and keep bbox coordinates for citeable answers.

Highlights

Better header detection (multi-row, multi-col headers)
Robust row/col span recovery on wide tables
Improved cell boundary accuracy on scans

How to use

This will just work out of the box when parsing any document with tables.

You can see an example in this colab notebook.

1[.code-block-title]page-classes.py[.code-block-title]doc_ai = DocumentAI()
2
3result = doc_ai.parse_and_wait(
4  file="https://tlake.link/blog/dense-tables",
5)
6
7for page in result.pages:
8  for fragment in page.page_fragments:
9    if(fragment.fragment_type == PageFragmentType.TABLE):
10      table = fragment.content.html
11      # pandas.read_html can parse a single table string
12      df = pd.read_html(StringIO(str(table)), flavor="lxml")[0].fillna('')
13      print(f"Table found on page {page.page_number} at {fragment.bbox}:")
14      print(df)

Tips

Keep tables atomic when chunking (don’t split a single table across chunks).
Attach page number + bbox to chunk metadata so you can render page previews in answers.

Known limitations

Extremely degraded scans (low DPI, heavy skew) may still need pre-deskewing.
Rotated tables are supported, but nested tables inside footnotes may require a second pass.

Status

✅ Live now. No config changes required beyond TableOutputMode / TableParsingFormat.

We’d love reports

Send tricky tables (wide spans, nested headers, tiny fonts). They directly drive our next round of improvements.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

Founder & CEO @ The Intelligent Search Company