DocumentAI API v2
V2 of the DocumentAI API is fully in production in the Python SDK and on the Playground, offering unified document processing with advanced structured extraction, page classification, and enrichment capabilities.
Key Highlights
- Unified Parse and Jobs API
- Advanced Structured Extraction with JSON Schema
- Page Classification and Signature Detection
- Table and Chart Summarization
- Enhanced Document Layout Analysis
API v2 Summary
Tensorlake API v2 represents a major evolution in document processing capabilities, providing a unified interface for extracting structured data from any document format. The new API combines document parsing, structured extraction, and enrichment into a single, powerful endpoint that can handle complex document workflows.
Core Capabilities
Document Ingestion: Upload and process files up to 1GB in size, supporting PDF, Word documents (DOCX), Excel spreadsheets (XLS, XLSX, XLSM), PowerPoint presentations (PPTX), images (PNG, JPG, JPEG), CSV files, HTML, and plain text.
Unified Processing: Submit documents via file upload, public URL, or raw text content with a single API endpoint that handles all processing operations.
Flexible Output: Convert documents to markdown with intelligent chunking strategies, extract structured data using custom schemas, and classify pages into categories.
Structured Data Extraction
We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using JSON Schema definitions.
Invoice Processing Example
Define schemas for extracting structured information from business documents:
1[.code-block-title]Code[.code-block-title]{
2 "title": "Invoice",
3 "type": "object",
4 "properties": {
5 "invoice_number": {"type": "string"},
6 "date": {"type": "string", "format": "date"},
7 "vendor": {
8 "type": "object",
9 "properties": {
10 "name": {"type": "string"},
11 "address": {"type": "string"}
12 }
13 },
14 "line_items": {
15 "type": "array",
16 "items": {
17 "type": "object",
18 "properties": {
19 "description": {"type": "string"},
20 "quantity": {"type": "number"},
21 "unit_price": {"type": "number"},
22 "total": {"type": "number"}
23 }
24 }
25 },
26 "total_amount": {"type": "number"}
27 }
28}Contract Analysis Example
Extract key terms and parties from legal documents:
1[.code-block-title]Code[.code-block-title]{
2 "title": "Contract",
3 "type": "object",
4 "properties": {
5 "parties": {
6 "type": "array",
7 "items": {
8 "type": "object",
9 "properties": {
10 "name": { "type": "string" },
11 "role": { "type": "string" },
12 "address": { "type": "string" }
13 }
14 }
15 },
16 "effective_date": { "type": "string", "format": "date" },
17 "expiration_date": { "type": "string", "format": "date" },
18 "key_terms": {
19 "type": "array",
20 "items": { "type": "string" }
21 },
22 "governing_law": { "type": "string" },
23 "signatures_required": { "type": "boolean" }
24 }
25}API Usage
Extract structured data using the unified parse endpoint:
1[.code-block-title]Code[.code-block-title]curl -X POST https://api.tensorlake.ai/documents/v2/parse \
2 -H "Authorization: Bearer YOUR_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "file_id": "file_12345",
6 "structured_extraction_options": [{
7 "schema_name": "invoice_data",
8 "json_schema": {
9 "type": "object",
10 "properties": {
11 "invoice_number": { "type": "string" },
12 "total_amount": { "type": "number" }
13 }
14 }
15 }]
16 }'Page Classification
Classify document pages into categories for better organization and processing:
1[.code-block-title]Code[.code-block-title]{
2 "page_classifications": [
3 {
4 "name": "invoice",
5 "description": "Pages containing invoice information with line items and totals"
6 },
7 {
8 "name": "contract_terms",
9 "description": "Pages containing contract terms and conditions"
10 },
11 {
12 "name": "signature_page",
13 "description": "Pages containing signatures and execution information"
14 }
15 ]
16}Document Enhancement
Table and Chart Summarization
Automatically generate summaries of complex tables and visual elements:
1[.code-block-title]Code[.code-block-title]{
2 "enrichment_options": {
3 "table_summarization": true,
4 "table_summarization_prompt": "Provide a concise summary of the key data points in this table",
5 "figure_summarization": true,
6 "figure_summarization_prompt": "Describe the main insights from this chart or diagram"
7 }
8}Signature Detection
Detect and locate signatures within documents using specialized computer vision models:
1[.code-block-title]Code[.code-block-title]{
2 "parsing_options": {
3 "signature_detection": true
4 }
5}Advanced Features
Document Layout Analysis
Get detailed document structure information including bounding boxes for all elements:
- Page Fragments: Text blocks, tables, images, charts with precise coordinates
- Layout Detection: Automatic identification of document structure and hierarchy
- Cross-Page Headers: Detection of headers that span multiple pages
Flexible Input Methods
- File Upload: Upload documents directly to Tensorlake storage (up to 1GB)
- URL Processing: Process documents from public URLs with automatic download
- Raw Text: Extract structured data from text content, emails, HTML, or CSV
Intelligent Chunking
Multiple chunking strategies for different use cases:- None: Return full document content- Semantic: Chunk by logical document sections- Fixed-size: Split into consistent token lengths- Custom: Define your own chunking parameters
Response Format
Successful parse operations return comprehensive results:
1[.code-block-title]Code[.code-block-title]{
2 "parse_id": "parse_abcd1234",
3 "status": "successful",
4 "chunks": [
5 { "content": "Document text chunk 1" },
6 { "content": "Document text chunk 2" }
7 ],
8 "structured_data": {
9 "invoice_data": {
10 "invoice_number": "INV-2024-001",
11 "total_amount": 1250.00
12 }
13 },
14 "document_layout": {
15 "pages": [
16 {
17 "page_number": 1,
18 "page_fragments": [
19 {
20 "fragment_type": "text",
21 "bbox": { "x1": 100, "y1": 200, "x2": 400, "y2": 250 }
22 }
23 ]
24 }
25 ]
26 },
27 "page_classes": [
28 { "page": 1, "classification": "invoice" }
29 ]
30}Migration and Compatibility
API v2 maintains backward compatibility while introducing powerful new capabilities:
- Unified Endpoint: Single
/documents/v2/parseendpoint replaces multiple v1 endpoints - Enhanced Error Handling: Detailed error messages and status tracking
- Improved Performance: Faster processing with optimized document analysis
- Better Scaling: Handle larger documents and more complex schemas
The API v2 is available now in the Python SDK and Playground, ready for production workloads requiring sophisticated document understanding and structured data extraction. authors: List[Author] = Field( description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation." ) conference_journal: Conference = Field(description="Conference or journal information") title: str = Field(description="Title of the research paper")
Convert to JSON schema for Tensorlake
json_schema = ResearchPaperMetadata.model_json_schema()`} />
Usage Example
Extract structured data from documents using your custom schemas:
1[.code-block-title]Code[.code-block-title]from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7 document_id="doc_123",
8 schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"Supported Formats
- PDF documents
- Word documents (.docx, .doc)
- Spreadsheets (XLSX, XLSM, XLS, CSV)
- Images (PNG, JPG)
- Presentations (PPTX, Keynote)
- HTML pages
- Plain text files
API Reference
1[.code-block-title]Code[.code-block-title]# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema \
3 -H "Authorization: Bearer YOUR_API_KEY" \
4 -H "Content-Type: application/json" \
5 -d '{
6 "document_id": "doc_123",
7 "schema": {
8 "type": "object",
9 "properties": {
10 "title": { "type": "string" },
11 "authors": {
12 "type": "array",
13 "items": {
14 "type": "object",
15 "properties": {
16 "name": { "type": "string" },
17 "affiliation": { "type": "string" }
18 }
19 }
20 }
21 }
22 }
23 }'Get server-less runtime for agents and data ingestion
Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.