March 15, 2024

Advanced Schema Extraction

Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support

Key Highlights

Research paper metadata extraction
Pydantic schema support
Multi-format document support
Improved accuracy with structured outputs

Structured Data Extraction

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using Pydantic schemas.

Research Paper Schema Example

Define complex schemas for extracting structured information from research papers:

1[.code-block-title]Code[.code-block-title]from pydantic import BaseModel, Field
2from typing import List
3
4class Author(BaseModel):
5  """Author information for a research paper"""
6  name: str = Field(description="Full name of the author")
7  affiliation: str = Field(description="Institution or organization affiliation")
8
9class Conference(BaseModel):
10  """Conference or journal information"""
11  name: str = Field(description="Name of the conference or journal")
12  year: str = Field(description="Year of publication")
13  location: str = Field(description="Location of the conference or journal publication")
14
15class ResearchPaperMetadata(BaseModel):
16  """Complete schema for extracting research paper information"""
17  authors: List[Author] = Field(
18    description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation."
19  )
20  conference_journal: Conference = Field(description="Conference or journal information")
21  title: str = Field(description="Title of the research paper")
22
23# Convert to JSON schema for Tensorlake
24json_schema = ResearchPaperMetadata.model_json_schema()

Usage Example

Extract structured data from documents using your custom schemas:

1[.code-block-title]Example[.code-block-title]from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7  document_id="doc_123",
8  schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"

Supported Formats

PDF documents
Word documents (.docx, .doc)
Markdown files
HTML pages
Plain text files

API Reference

1[.code-block-title]Code[.code-block-title]# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema \
3  -H "Authorization: Bearer YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{
6    "document_id": "doc_123",
7    "schema": {
8      "type": "object",
9      "properties": {
10        "title": { "type": "string" },
11        "authors": {
12          "type": "array",
13          "items": {
14            "type": "object",
15            "properties": {
16              "name": { "type": "string" },
17              "affiliation": { "type": "string" }
18            }
19          }
20        }
21      }
22    }
23  }'

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

Founder & CEO @ The Intelligent Search Company