Back to All changelogs
March 15, 2024
Advanced Schema Extraction
Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support
Key Highlights
- Research paper metadata extraction
- Pydantic schema support
- Multi-format document support
- Improved accuracy with structured outputs
Structured Data Extraction
We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using Pydantic schemas.
Research Paper Schema Example
Define complex schemas for extracting structured information from research papers:
1[.code-block-title]Code[.code-block-title]from pydantic import BaseModel, Field
2from typing import List
3
4class Author(BaseModel):
5 """Author information for a research paper"""
6 name: str = Field(description="Full name of the author")
7 affiliation: str = Field(description="Institution or organization affiliation")
8
9class Conference(BaseModel):
10 """Conference or journal information"""
11 name: str = Field(description="Name of the conference or journal")
12 year: str = Field(description="Year of publication")
13 location: str = Field(description="Location of the conference or journal publication")
14
15class ResearchPaperMetadata(BaseModel):
16 """Complete schema for extracting research paper information"""
17 authors: List[Author] = Field(
18 description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation."
19 )
20 conference_journal: Conference = Field(description="Conference or journal information")
21 title: str = Field(description="Title of the research paper")
22
23# Convert to JSON schema for Tensorlake
24json_schema = ResearchPaperMetadata.model_json_schema()Usage Example
Extract structured data from documents using your custom schemas:
1[.code-block-title]Example[.code-block-title]from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7 document_id="doc_123",
8 schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"Supported Formats
- PDF documents
- Word documents (.docx, .doc)
- Markdown files
- HTML pages
- Plain text files
API Reference
1[.code-block-title]Code[.code-block-title]# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema \
3 -H "Authorization: Bearer YOUR_API_KEY" \
4 -H "Content-Type: application/json" \
5 -d '{
6 "document_id": "doc_123",
7 "schema": {
8 "type": "object",
9 "properties": {
10 "title": { "type": "string" },
11 "authors": {
12 "type": "array",
13 "items": {
14 "type": "object",
15 "properties": {
16 "name": { "type": "string" },
17 "affiliation": { "type": "string" }
18 }
19 }
20 }
21 }
22 }
23 }'Get server-less runtime for agents and data ingestion
Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY
Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.