Fixed: Citation filtering now respects page classification limits
Fixed bug where citations ignored page classification filtering, ensuring citations only reference pages you're actually extracting from.
Key Highlights
- Citations now correctly respect page classification boundaries
- Cleaner results with no citations pointing to irrelevant page content
- Better RAG pipeline accuracy with properly scoped citations
What's new
Fixed a bug where structured extraction with citations enabled would ignore page classification filtering. Previously, when you limited extraction to specific page classes (e.g., only transactions pages), citations would still reference content from all pages. Now citations correctly respect page classification boundaries.
Why it matters
- Accurate citations - citations now only reference the pages you're actually extracting from
- Cleaner results - no more citations pointing to irrelevant page content
- Expected behavior - page filtering works consistently whether citations are on or off
- Better RAG pipelines - citations align with your intended extraction scope
The bug
When using both page classification filtering AND citations:
1[.code-block-title]Code[.code-block-title]# This configuration should only extract from "transactions" pages
2structured_extraction = StructuredExtractionConfig(
3 schema=transaction_schema,
4 page_classes=["transactions"], # Only extract from transaction pages
5 enable_citations=True
6)Before (bug): Citations could reference content from account_info or summary pages
After (fixed): Citations only reference content from transactions pages
How to use
No code changes needed. Existing configurations now work as expected.
Impact
This fix ensures consistent behavior across all extraction features and improves the reliability of citation-based RAG systems.
Status
✅ Fixed and live. No configuration changes required.
Get server-less runtime for agents and data ingestion
Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.