Back to All changelogs
September 19, 2025

Fixed: Citation filtering now respects page classification limits

Fixed bug where citations ignored page classification filtering, ensuring citations only reference pages you're actually extracting from.

Key Highlights

  • Citations now correctly respect page classification boundaries
  • Cleaner results with no citations pointing to irrelevant page content
  • Better RAG pipeline accuracy with properly scoped citations

What's new

Fixed a bug where structured extraction with citations enabled would ignore page classification filtering. Previously, when you limited extraction to specific page classes (e.g., only transactions pages), citations would still reference content from all pages. Now citations correctly respect page classification boundaries.

Why it matters

  • Accurate citations - citations now only reference the pages you're actually extracting from
  • Cleaner results - no more citations pointing to irrelevant page content
  • Expected behavior - page filtering works consistently whether citations are on or off
  • Better RAG pipelines - citations align with your intended extraction scope

The bug

When using both page classification filtering AND citations:

1[.code-block-title]Code[.code-block-title]# This configuration should only extract from "transactions" pages
2structured_extraction = StructuredExtractionConfig(
3  schema=transaction_schema,
4  page_classes=["transactions"],  # Only extract from transaction pages
5  enable_citations=True
6)

Before (bug): Citations could reference content from account_info or summary pages
After (fixed): Citations only reference content from transactions pages

How to use

No code changes needed. Existing configurations now work as expected.

Impact

This fix ensures consistent behavior across all extraction features and improves the reliability of citation-based RAG systems.

Status

✅ Fixed and live. No configuration changes required.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
Founder & CEO @ The Intelligent Search Company