Back to All changelogs
March 15, 2024

Advanced Schema Extraction

Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support

Key Highlights

  • Research paper metadata extraction
  • Pydantic schema support
  • Multi-format document support
  • Improved accuracy with structured outputs

Structured Data Extraction

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using Pydantic schemas.

Research Paper Schema Example

Define complex schemas for extracting structured information from research papers:

1[.code-block-title]Code[.code-block-title]from pydantic import BaseModel, Field
2from typing import List
3
4class Author(BaseModel):
5  """Author information for a research paper"""
6  name: str = Field(description="Full name of the author")
7  affiliation: str = Field(description="Institution or organization affiliation")
8
9class Conference(BaseModel):
10  """Conference or journal information"""
11  name: str = Field(description="Name of the conference or journal")
12  year: str = Field(description="Year of publication")
13  location: str = Field(description="Location of the conference or journal publication")
14
15class ResearchPaperMetadata(BaseModel):
16  """Complete schema for extracting research paper information"""
17  authors: List[Author] = Field(
18    description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation."
19  )
20  conference_journal: Conference = Field(description="Conference or journal information")
21  title: str = Field(description="Title of the research paper")
22
23# Convert to JSON schema for Tensorlake
24json_schema = ResearchPaperMetadata.model_json_schema()

Usage Example

Extract structured data from documents using your custom schemas:

1[.code-block-title]Example[.code-block-title]from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7  document_id="doc_123",
8  schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"

Supported Formats

  • PDF documents
  • Word documents (.docx, .doc)
  • Markdown files
  • HTML pages
  • Plain text files

API Reference

1[.code-block-title]Code[.code-block-title]# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema \
3  -H "Authorization: Bearer YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{
6    "document_id": "doc_123",
7    "schema": {
8      "type": "object",
9      "properties": {
10        "title": { "type": "string" },
11        "authors": {
12          "type": "array",
13          "items": {
14            "type": "object",
15            "properties": {
16              "name": { "type": "string" },
17              "affiliation": { "type": "string" }
18            }
19          }
20        }
21      }
22    }
23  }'

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
Founder & CEO @ The Intelligent Search Company