Back to All changelogs
September 8, 2025

Document Ingestion now supports XML, DOC, and Markdown files

Document ingestion now supports XML, legacy DOC, and Markdown files with the same parsing capabilities as existing formats.

Key Highlights

  • Native XML parsing for config files and structured data exports
  • Legacy DOC file support for older document repositories
  • Markdown processing for documentation and technical specs

What's new

Document ingestion now handles three additional file formats: XML documents, legacy DOC files (Microsoft Word 97-2003), and native Markdown files. These join our existing support for PDF, DOCX, images, and other formats in a unified parsing pipeline.

Why it matters

  • XML files are everywhere in enterprise workflows (config files, data exports, structured documents)
  • Legacy DOC files still appear in legacy systems and older document repositories  
  • Markdown files are standard for documentation, README files, and technical specs
  • Unified processing means fewer custom preprocessing steps in your pipeline

Highlights

  • Native parsing preserves document structure and metadata
  • Same extraction and classification capabilities as other formats
  • Automatic format detection - no manual format specification required
  • Full compatibility with structured extraction and summarization features

How to use

Works automatically when you upload any of these file types. No configuration changes needed.

1[.code-block-title]Code[.code-block-title]doc_ai = DocumentAI()
2
3# All of these now work seamlessly
4xml_file_id = doc_ai.upload(path="/path/to/config.xml")
5xml_result = doc_ai.parse_and_wait(xml_file_id)
6
7doc_file_id = doc_ai.upload(path="/path/to/legacy_report.doc")
8doc_result = doc_ai.parse_and_wait(doc_file_id)
9
10md_file_id = doc_ai.upload(path="/path/to/README.md")
11md_result = doc_ai.parse_and_wait(md_file_id)

Status

✅ Live now. All existing parsing features work across the new formats.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
CEO, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
Founder & CEO @ The Intelligent Search Company