The Problem With Standard Document Translation
Standard machine translation treats a document as a flat string of text. Feed it a PDF or a scanned page and it strips out the formatting, collapses the tables, loses the heading hierarchy and returns a wall of translated prose that no longer resembles the original. If the document has a two-column layout, a structured table, or a numbered list, that structure is gone.
DocTranslate solves this by treating document translation as a two-stage problem: first, understand what the document is — its structure, not just its text — then translate each structural element individually while preserving the layout in the output.
How It Works
Stage 1: Vision-Based Structure Extraction
The uploaded image or PDF is sent to a vision-language model (Qwen3-VL-8B). The model does not just extract text — it analyses the visual layout of the document and returns a structured JSON representation of everything on the page.
The JSON schema distinguishes four section types:
- Headings — with hierarchy level (H1, H2, H3)
- Paragraphs — body text blocks
- Lists — bulleted or numbered items, each as a separate element
- Tables — with headers and every row and cell captured individually
The model also detects whether the document uses a two-column layout, which is preserved in the output. Sections are ordered as they appear on the page — top to bottom, left column before right column.
The extraction prompt is strict: transcribe only text that is clearly visible, copy exact wording without paraphrasing, and leave fields empty rather than guessing when text is obscured. This is the same principle used in the CrateVision vision pipeline — extract only what is actually there.
Stage 2: Element-Level Translation
Each structural element is translated individually by a 72B language model. The translation prompt auto-detects the source language and translates into the user-selected target, with a specific instruction for table cells and headings: keep translations concise and similar in length to the source, since some languages are longer than others and cell content needs to remain readable in a table.
Translating element by element — rather than the whole document as a single string — means the model has focused context for each piece of text. A table cell is translated as a table cell, not as part of a paragraph. A heading is translated at heading brevity. The structural intent of each element is preserved through the translation.
Stage 3: HTML Reconstruction
The translated JSON is rebuilt into a clean, styled HTML document that mirrors the original layout:
- Headings render at the correct hierarchy level
- Tables render with proper headers, striped rows and full-width layout
- Lists render as unordered lists with correct indentation
- Two-column documents render using CSS columns with a divider rule
The output is a self-contained HTML file, ready to view in the browser or download. It is also print-ready — the typography and layout are designed to work on paper as well as screen.
Supported Languages
DocTranslate supports 12 target languages: Italian, English, French, Spanish, German, Portuguese, Dutch, Polish, Russian, Chinese, Japanese and Arabic. The source language is detected automatically — no need to specify where the document came from.
PDF Support
PDFs are converted to images using pdftoppm before the vision stage. Up to three pages are processed, with the resulting page documents merged into a single translated output. This covers the most common use case: single-page letters, contracts, forms and certificates, as well as short multi-page documents.
Who Uses This
Legal and Compliance Teams
Foreign contracts, court documents, regulatory filings and certificates arrive in a wide range of languages. DocTranslate produces a structured translation that preserves the document’s legal formatting — clauses, numbered articles, defined terms tables — rather than collapsing it into undifferentiated prose.
Businesses Receiving Foreign Documents
Supplier invoices, technical specifications, insurance documents, customs declarations — businesses regularly receive documents in languages their staff cannot read. DocTranslate handles the full range of document types because it works from the visual layout rather than assuming a specific document format.
Researchers and Academics
Academic papers, research reports and conference proceedings in foreign languages can be translated with their structure — section headings, figures captions, data tables — intact. The translated HTML can be read in the browser with the same navigable structure as the original.
Medical and Technical Documentation
Medical records, clinical trial documents and technical manuals carry structured information — dosage tables, procedure steps, specification lists — that must survive translation intact. DocTranslate preserves every row and column of a table through the translation pipeline.
The Three-View Output
The result is presented in three views:
- Preview: A live rendering of the translated HTML document inside the browser, immediately readable
- Extracted JSON: The raw structured representation of the document as the vision model understood it, useful for debugging or downstream processing
- HTML Source: The full HTML markup, inspectable and editable before download
A single-click download saves the translated HTML file locally. The output is self-contained — no external dependencies, no server required to view it.
Why This Approach Works
Most translation tools are text-in, text-out. They work well for plain prose but fail on structured documents because they cannot see what the document looks like — only what it says. DocTranslate starts from the visual representation, which means it sees the same thing a human reader sees: layout, hierarchy, tables, columns.
This is what vision-language models make possible. The same capability that lets CrateVision read a vinyl record label or powers document understanding in our PRAG system can be applied to any visually structured document. The translation quality then depends on the LLM — and a 72B model running locally produces output that is significantly better than standard machine translation for nuanced or domain-specific text.
If you need document translation integrated into a business workflow — automated, private, structure-preserving — get in touch. We build custom AI pipelines for document processing, translation and extraction at scale.