Back to All changelogs
August 11, 2025

DocumentAI API v2

V2 of the DocumentAI API is fully in production in the Python SDK and on the Playground, offering unified document processing with advanced structured extraction, page classification, and enrichment capabilities.

Key Highlights

  • Unified Parse and Jobs API
  • Advanced Structured Extraction with JSON Schema
  • Page Classification and Signature Detection
  • Table and Chart Summarization
  • Enhanced Document Layout Analysis

API v2 Summary

Tensorlake API v2 represents a major evolution in document processing capabilities, providing a unified interface for extracting structured data from any document format. The new API combines document parsing, structured extraction, and enrichment into a single, powerful endpoint that can handle complex document workflows.

Core Capabilities

Document Ingestion: Upload and process files up to 1GB in size, supporting PDF, Word documents (DOCX), Excel spreadsheets (XLS, XLSX, XLSM), PowerPoint presentations (PPTX), images (PNG, JPG, JPEG), CSV files, HTML, and plain text.

Unified Processing: Submit documents via file upload, public URL, or raw text content with a single API endpoint that handles all processing operations.

Flexible Output: Convert documents to markdown with intelligent chunking strategies, extract structured data using custom schemas, and classify pages into categories.

Structured Data Extraction

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using JSON Schema definitions.

Invoice Processing Example

Define schemas for extracting structured information from business documents:

1[.code-block-title]Code[.code-block-title]{
2  "title": "Invoice",
3  "type": "object",
4  "properties": {
5    "invoice_number": {"type": "string"},
6    "date": {"type": "string", "format": "date"},
7    "vendor": {
8      "type": "object",
9      "properties": {
10        "name": {"type": "string"},
11        "address": {"type": "string"}
12      }
13    },
14    "line_items": {
15      "type": "array",
16      "items": {
17        "type": "object",
18        "properties": {
19          "description": {"type": "string"},
20          "quantity": {"type": "number"},
21          "unit_price": {"type": "number"},
22          "total": {"type": "number"}
23        }
24      }
25    },
26    "total_amount": {"type": "number"}
27  }
28}

Contract Analysis Example

Extract key terms and parties from legal documents:

1[.code-block-title]Code[.code-block-title]{
2  "title": "Contract",
3  "type": "object",
4  "properties": {
5    "parties": {
6      "type": "array",
7      "items": {
8        "type": "object",
9        "properties": {
10          "name": { "type": "string" },
11          "role": { "type": "string" },
12          "address": { "type": "string" }
13        }
14      }
15    },
16    "effective_date": { "type": "string", "format": "date" },
17    "expiration_date": { "type": "string", "format": "date" },
18    "key_terms": {
19      "type": "array",
20      "items": { "type": "string" }
21    },
22    "governing_law": { "type": "string" },
23    "signatures_required": { "type": "boolean" }
24  }
25}

API Usage

Extract structured data using the unified parse endpoint:

1[.code-block-title]Code[.code-block-title]curl -X POST https://api.tensorlake.ai/documents/v2/parse \
2  -H "Authorization: Bearer YOUR_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "file_id": "file_12345",
6    "structured_extraction_options": [{
7      "schema_name": "invoice_data",
8      "json_schema": {
9        "type": "object",
10        "properties": {
11          "invoice_number": { "type": "string" },
12          "total_amount": { "type": "number" }
13        }
14      }
15    }]
16  }'

Page Classification

Classify document pages into categories for better organization and processing:

1[.code-block-title]Code[.code-block-title]{
2  "page_classifications": [
3    {
4      "name": "invoice",
5      "description": "Pages containing invoice information with line items and totals"
6    },
7    {
8      "name": "contract_terms",
9      "description": "Pages containing contract terms and conditions"
10    },
11    {
12      "name": "signature_page",
13      "description": "Pages containing signatures and execution information"
14    }
15  ]
16}

Document Enhancement

Table and Chart Summarization

Automatically generate summaries of complex tables and visual elements:

1[.code-block-title]Code[.code-block-title]{
2  "enrichment_options": {
3    "table_summarization": true,
4    "table_summarization_prompt": "Provide a concise summary of the key data points in this table",
5    "figure_summarization": true,
6    "figure_summarization_prompt": "Describe the main insights from this chart or diagram"
7  }
8}

Signature Detection

Detect and locate signatures within documents using specialized computer vision models:

1[.code-block-title]Code[.code-block-title]{
2  "parsing_options": {
3    "signature_detection": true
4  }
5}

Advanced Features

Document Layout Analysis

Get detailed document structure information including bounding boxes for all elements:

  • Page Fragments: Text blocks, tables, images, charts with precise coordinates
  • Layout Detection: Automatic identification of document structure and hierarchy  
  • Cross-Page Headers: Detection of headers that span multiple pages

Flexible Input Methods

  • File Upload: Upload documents directly to Tensorlake storage (up to 1GB)
  • URL Processing: Process documents from public URLs with automatic download
  • Raw Text: Extract structured data from text content, emails, HTML, or CSV

Intelligent Chunking

Multiple chunking strategies for different use cases:- None: Return full document content- Semantic: Chunk by logical document sections- Fixed-size: Split into consistent token lengths- Custom: Define your own chunking parameters

Response Format

Successful parse operations return comprehensive results:

1[.code-block-title]Code[.code-block-title]{
2  "parse_id": "parse_abcd1234",
3  "status": "successful",
4  "chunks": [
5    { "content": "Document text chunk 1" },
6    { "content": "Document text chunk 2" }
7  ],
8  "structured_data": {
9    "invoice_data": {
10      "invoice_number": "INV-2024-001",
11      "total_amount": 1250.00
12    }
13  },
14  "document_layout": {
15    "pages": [
16      {
17        "page_number": 1,
18        "page_fragments": [
19          {
20            "fragment_type": "text",
21            "bbox": { "x1": 100, "y1": 200, "x2": 400, "y2": 250 }
22          }
23        ]
24      }
25    ]
26  },
27  "page_classes": [
28    { "page": 1, "classification": "invoice" }
29  ]
30}

Migration and Compatibility

API v2 maintains backward compatibility while introducing powerful new capabilities:

  • Unified Endpoint: Single /documents/v2/parse endpoint replaces multiple v1 endpoints
  • Enhanced Error Handling: Detailed error messages and status tracking
  • Improved Performance: Faster processing with optimized document analysis
  • Better Scaling: Handle larger documents and more complex schemas

The API v2 is available now in the Python SDK and Playground, ready for production workloads requiring sophisticated document understanding and structured data extraction.    authors: List[Author] = Field(        description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation."    )    conference_journal: Conference = Field(description="Conference or journal information")    title: str = Field(description="Title of the research paper")

Convert to JSON schema for Tensorlake

json_schema = ResearchPaperMetadata.model_json_schema()`} />

Usage Example

Extract structured data from documents using your custom schemas:

1[.code-block-title]Code[.code-block-title]from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7  document_id="doc_123",
8  schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"

Supported Formats

  • PDF documents
  • Word documents (.docx, .doc)
  • Spreadsheets (XLSX, XLSM, XLS, CSV)
  • Images (PNG, JPG)
  • Presentations (PPTX, Keynote)
  • HTML pages
  • Plain text files

API Reference

1[.code-block-title]Code[.code-block-title]# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema \
3  -H "Authorization: Bearer YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{
6    "document_id": "doc_123",
7    "schema": {
8      "type": "object",
9      "properties": {
10        "title": { "type": "string" },
11        "authors": {
12          "type": "array",
13          "items": {
14            "type": "object",
15            "properties": {
16              "name": { "type": "string" },
17              "affiliation": { "type": "string" }
18            }
19          }
20        }
21      }
22    }
23  }'

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
Founder & CEO @ The Intelligent Search Company