September 8, 2025

Document Ingestion now supports XML, DOC, and Markdown files

Document ingestion now supports XML, legacy DOC, and Markdown files with the same parsing capabilities as existing formats.

Key Highlights

Native XML parsing for config files and structured data exports
Legacy DOC file support for older document repositories
Markdown processing for documentation and technical specs

What's new

Document ingestion now handles three additional file formats: XML documents, legacy DOC files (Microsoft Word 97-2003), and native Markdown files. These join our existing support for PDF, DOCX, images, and other formats in a unified parsing pipeline.

Why it matters

XML files are everywhere in enterprise workflows (config files, data exports, structured documents)
Legacy DOC files still appear in legacy systems and older document repositories
Markdown files are standard for documentation, README files, and technical specs
Unified processing means fewer custom preprocessing steps in your pipeline

Highlights

Native parsing preserves document structure and metadata
Same extraction and classification capabilities as other formats
Automatic format detection - no manual format specification required
Full compatibility with structured extraction and summarization features

How to use

Works automatically when you upload any of these file types. No configuration changes needed.

1[.code-block-title]Code[.code-block-title]doc_ai = DocumentAI()
2
3# All of these now work seamlessly
4xml_file_id = doc_ai.upload(path="/path/to/config.xml")
5xml_result = doc_ai.parse_and_wait(xml_file_id)
6
7doc_file_id = doc_ai.upload(path="/path/to/legacy_report.doc")
8doc_result = doc_ai.parse_and_wait(doc_file_id)
9
10md_file_id = doc_ai.upload(path="/path/to/README.md")
11md_result = doc_ai.parse_and_wait(md_file_id)

Status

✅ Live now. All existing parsing features work across the new formats.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

Founder & CEO @ The Intelligent Search Company