Why Vulgate uses TEI XML

How structured XML encoding makes your documents searchable, citable, and ready for scholarly work.

May 21, 2026

Every document that lands in Vulgate is encoded as TEI XML — the Text Encoding Initiative's XML format, the de facto standard for scholarly digital editions. You don't have to write or even see the XML yourself, but it underpins almost everything Vulgate does well.

What TEI XML is

TEI XML is a vocabulary of XML tags for describing the structure and semantics of a text — front matter, body, chapters, sections, marginal notes, footnotes, speaker labels, line breaks, page breaks, and many more. It was designed by humanities scholars in the late 1980s and is maintained by an international consortium today.

Unlike plain text, TEI XML preserves:

  • The hierarchy of a document (book → chapter → section → paragraph).
  • Metadata such as author, editor, publication date, edition.
  • Apparatus such as footnotes, marginalia, and editorial corrections.
  • Page and line numbers anchored to specific positions in the text.
  • Speaker turns in dialogues, sermons, and plays.

What you get because of it

This structural awareness is why Vulgate can do things plain-text tools can't:

  • Precise citations. Citations point to encoded sections in the TEI — typically shown as document title and nearest section heading — not approximate character ranges.
  • Section-aware chat. Chat and the in-document AI Assistant retrieve at the section level, which gives more focused, more accurate answers than chunking by character count.
  • Per-paragraph translation. Machine translations work on whole paragraphs because the paragraph boundary is encoded in the XML.
  • Stable bookmarks. Bookmarks point at structural anchors, not byte offsets, so they survive document re-ingestion or edits.

How the encoding is produced

The encoding happens during ingestion, completely automatically:

  1. Raw text is extracted (OCR for scans, native extraction for searchable PDFs and Word docs).
  2. A pipeline of language models segments the text into structural units.
  3. Headings, footnotes, page numbers, marginalia, and speaker turns are detected and tagged.
  4. The result is serialized to TEI XML and stored alongside the original file.

Organization admins can inspect the underlying XML during document review in Uploads → Processing.

Editing the encoding

If the auto-encoding gets something wrong (a missing footnote, a chapter break in the wrong place), an Organization admin can open the document in the structural editor during review and adjust boundaries. The AI starts using the corrected structure once the document is republished.

Further reading

Search help