September 19, 2025

Fixed: Citation filtering now respects page classification limits

Fixed bug where citations ignored page classification filtering, ensuring citations only reference pages you're actually extracting from.

Key Highlights

Citations now correctly respect page classification boundaries
Cleaner results with no citations pointing to irrelevant page content
Better RAG pipeline accuracy with properly scoped citations

What's new

Fixed a bug where structured extraction with citations enabled would ignore page classification filtering. Previously, when you limited extraction to specific page classes (e.g., only transactions pages), citations would still reference content from all pages. Now citations correctly respect page classification boundaries.

Why it matters

Accurate citations - citations now only reference the pages you're actually extracting from
Cleaner results - no more citations pointing to irrelevant page content
Expected behavior - page filtering works consistently whether citations are on or off
Better RAG pipelines - citations align with your intended extraction scope

The bug

When using both page classification filtering AND citations:

1[.code-block-title]Code[.code-block-title]# This configuration should only extract from "transactions" pages
2structured_extraction = StructuredExtractionConfig(
3  schema=transaction_schema,
4  page_classes=["transactions"],  # Only extract from transaction pages
5  enable_citations=True
6)

Before (bug): Citations could reference content from account_info or summary pages
After (fixed): Citations only reference content from transactions pages

How to use

No code changes needed. Existing configurations now work as expected.

Impact

This fix ensures consistent behavior across all extraction features and improves the reliability of citation-based RAG systems.

Status

✅ Fixed and live. No configuration changes required.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

Founder & CEO @ The Intelligent Search Company