August 11, 2025

DocumentAI API v2

V2 of the DocumentAI API is fully in production in the Python SDK and on the Playground, offering unified document processing with advanced structured extraction, page classification, and enrichment capabilities.

Key Highlights

Unified Parse and Jobs API
Advanced Structured Extraction with JSON Schema
Page Classification and Signature Detection
Table and Chart Summarization
Enhanced Document Layout Analysis

API v2 Summary

Tensorlake API v2 represents a major evolution in document processing capabilities, providing a unified interface for extracting structured data from any document format. The new API combines document parsing, structured extraction, and enrichment into a single, powerful endpoint that can handle complex document workflows.

Core Capabilities

Document Ingestion: Upload and process files up to 1GB in size, supporting PDF, Word documents (DOCX), Excel spreadsheets (XLS, XLSX, XLSM), PowerPoint presentations (PPTX), images (PNG, JPG, JPEG), CSV files, HTML, and plain text.

Unified Processing: Submit documents via file upload, public URL, or raw text content with a single API endpoint that handles all processing operations.

Flexible Output: Convert documents to markdown with intelligent chunking strategies, extract structured data using custom schemas, and classify pages into categories.

Structured Data Extraction

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using JSON Schema definitions.

Invoice Processing Example

Define schemas for extracting structured information from business documents:

1[.code-block-title]Code[.code-block-title]{
2  "title": "Invoice",
3  "type": "object",
4  "properties": {
5    "invoice_number": {"type": "string"},
6    "date": {"type": "string", "format": "date"},
7    "vendor": {
8      "type": "object",
9      "properties": {
10        "name": {"type": "string"},
11        "address": {"type": "string"}
12      }
13    },
14    "line_items": {
15      "type": "array",
16      "items": {
17        "type": "object",
18        "properties": {
19          "description": {"type": "string"},
20          "quantity": {"type": "number"},
21          "unit_price": {"type": "number"},
22          "total": {"type": "number"}
23        }
24      }
25    },
26    "total_amount": {"type": "number"}
27  }
28}

Contract Analysis Example

Extract key terms and parties from legal documents:

1[.code-block-title]Code[.code-block-title]{
2  "title": "Contract",
3  "type": "object",
4  "properties": {
5    "parties": {
6      "type": "array",
7      "items": {
8        "type": "object",
9        "properties": {
10          "name": { "type": "string" },
11          "role": { "type": "string" },
12          "address": { "type": "string" }
13        }
14      }
15    },
16    "effective_date": { "type": "string", "format": "date" },
17    "expiration_date": { "type": "string", "format": "date" },
18    "key_terms": {
19      "type": "array",
20      "items": { "type": "string" }
21    },
22    "governing_law": { "type": "string" },
23    "signatures_required": { "type": "boolean" }
24  }
25}

API Usage

Extract structured data using the unified parse endpoint:

1[.code-block-title]Code[.code-block-title]curl -X POST https://api.tensorlake.ai/documents/v2/parse \
2  -H "Authorization: Bearer YOUR_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "file_id": "file_12345",
6    "structured_extraction_options": [{
7      "schema_name": "invoice_data",
8      "json_schema": {
9        "type": "object",
10        "properties": {
11          "invoice_number": { "type": "string" },
12          "total_amount": { "type": "number" }
13        }
14      }
15    }]
16  }'

Page Classification

Classify document pages into categories for better organization and processing:

1[.code-block-title]Code[.code-block-title]{
2  "page_classifications": [
3    {
4      "name": "invoice",
5      "description": "Pages containing invoice information with line items and totals"
6    },
7    {
8      "name": "contract_terms",
9      "description": "Pages containing contract terms and conditions"
10    },
11    {
12      "name": "signature_page",
13      "description": "Pages containing signatures and execution information"
14    }
15  ]
16}

Document Enhancement

Table and Chart Summarization

Automatically generate summaries of complex tables and visual elements:

1[.code-block-title]Code[.code-block-title]{
2  "enrichment_options": {
3    "table_summarization": true,
4    "table_summarization_prompt": "Provide a concise summary of the key data points in this table",
5    "figure_summarization": true,
6    "figure_summarization_prompt": "Describe the main insights from this chart or diagram"
7  }
8}

Signature Detection

Detect and locate signatures within documents using specialized computer vision models:

1[.code-block-title]Code[.code-block-title]{
2  "parsing_options": {
3    "signature_detection": true
4  }
5}

Advanced Features

Document Layout Analysis

Get detailed document structure information including bounding boxes for all elements:

Page Fragments: Text blocks, tables, images, charts with precise coordinates
Layout Detection: Automatic identification of document structure and hierarchy
Cross-Page Headers: Detection of headers that span multiple pages

Flexible Input Methods

File Upload: Upload documents directly to Tensorlake storage (up to 1GB)
URL Processing: Process documents from public URLs with automatic download
Raw Text: Extract structured data from text content, emails, HTML, or CSV

Intelligent Chunking

Multiple chunking strategies for different use cases:- None: Return full document content- Semantic: Chunk by logical document sections- Fixed-size: Split into consistent token lengths- Custom: Define your own chunking parameters

Response Format

Successful parse operations return comprehensive results:

1[.code-block-title]Code[.code-block-title]{
2  "parse_id": "parse_abcd1234",
3  "status": "successful",
4  "chunks": [
5    { "content": "Document text chunk 1" },
6    { "content": "Document text chunk 2" }
7  ],
8  "structured_data": {
9    "invoice_data": {
10      "invoice_number": "INV-2024-001",
11      "total_amount": 1250.00
12    }
13  },
14  "document_layout": {
15    "pages": [
16      {
17        "page_number": 1,
18        "page_fragments": [
19          {
20            "fragment_type": "text",
21            "bbox": { "x1": 100, "y1": 200, "x2": 400, "y2": 250 }
22          }
23        ]
24      }
25    ]
26  },
27  "page_classes": [
28    { "page": 1, "classification": "invoice" }
29  ]
30}

Migration and Compatibility

API v2 maintains backward compatibility while introducing powerful new capabilities:

Unified Endpoint: Single /documents/v2/parse endpoint replaces multiple v1 endpoints
Enhanced Error Handling: Detailed error messages and status tracking
Improved Performance: Faster processing with optimized document analysis
Better Scaling: Handle larger documents and more complex schemas

The API v2 is available now in the Python SDK and Playground, ready for production workloads requiring sophisticated document understanding and structured data extraction. authors: List[Author] = Field( description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation." ) conference_journal: Conference = Field(description="Conference or journal information") title: str = Field(description="Title of the research paper")

Convert to JSON schema for Tensorlake

json_schema = ResearchPaperMetadata.model_json_schema()`} />

Usage Example

Extract structured data from documents using your custom schemas:

1[.code-block-title]Code[.code-block-title]from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7  document_id="doc_123",
8  schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"

Supported Formats

PDF documents
Word documents (.docx, .doc)
Spreadsheets (XLSX, XLSM, XLS, CSV)
Images (PNG, JPG)
Presentations (PPTX, Keynote)
HTML pages
Plain text files

API Reference

1[.code-block-title]Code[.code-block-title]# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema \
3  -H "Authorization: Bearer YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{
6    "document_id": "doc_123",
7    "schema": {
8      "type": "object",
9      "properties": {
10        "title": { "type": "string" },
11        "authors": {
12          "type": "array",
13          "items": {
14            "type": "object",
15            "properties": {
16              "name": { "type": "string" },
17              "affiliation": { "type": "string" }
18            }
19          }
20        }
21      }
22    }
23  }'

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

Founder & CEO @ The Intelligent Search Company