Skip to content
David Chiang edited this page Apr 15, 2019 · 6 revisions

Much of the structure of the XML format is specified by the RELAX NG schema (data/xml/schema.{rnc,rng}) and can be validated automatically. This document describes the structure less formally and also describes aspects of the format that aren't specified by the schema.

Structure

The root element is <volume id="X99">, where X is replaced by the one-letter code for the venue and 99 is replaced by the last two digits of the year.

The <volume> element has child elements <paper id="9999">, where 9999 is replaced by the four-digit paper identifier. For some venues (LREC), there is also an href attribute for the external URL of the paper.

Each <paper> element has several child elements:

  • <title>: The title (see below for more details)
  • <author>: The authors (see below for more details)
  • <editor>: The editors (see below for more details)
  • and others.

Text Fields

Text fields (<title>, <author>, etc.) are written in Unicode (UTF-8). The following elements are currently allowed for formatting:

  • <tex-math>: math formulas, coded using TeX (equivalent to TeX $...$). For example: An <tex-math>O(n^3)</tex-math> Algorithm for Parsing Context-Free Grammars.
  • <url>: a URL, displayed in typewriter font and hyperlinked
  • <i>: italics
  • <b>: boldface

Below are additional guidelines for specific fields.

Title

The title should be written in title-case. The Anthology doesn't currently have rules for what "title-case" means exactly, but individual meetings/journals might. Characters whose case should be preserved even when a bibliography style uppercases or lowercases the title should be placed inside a <fixed-case> element (this serves the same purpose as curly braces in BibTeX). For example:

<title>The <fixed-case>ACL</fixed-case> <fixed-case>A</fixed-case>nthology: Current State and Future Directions</title>

Authors and Editors

Each author/editor name must have exactly one <last> element and at most one <first> element.

  • The <last> element contains the name(s) by which papers are cited and their bibliography entries are sorted alphabetically. If an author has only a single name, that name should go into the <last> element. A "lineage" like Jr. or III should go into the <last> element.

  • The <first> element contains all other names, including middle names/initials.

Ideally, the name should appear in the XML the same way that it does on the original paper. For example, if the original paper has only a first initial and last name, like A. Joshi, the XML should also have only a first initial and last name: <first>A.</first> <last>Joshi</last>. If you know the full first name, please use the complete attribute to record it: <first complete="Aravind">A.</first> <last>Joshi</last>. Similarly for middle and last initials.

Link fields

Paper PDFs are linked in three ways.

  • <url>URL</url>: URL of Anthology-hosted PDF.
  • <paper href="URL">...</paper>: URL of externally-hosted, non-ACL-sponsored PDF (currently used mainly for LREC)
  • <href>URL</href>: URL of externally-hosted, ACL-sponsored PDF (currently used mainly for TACL)

Other files can be linked as well:

  • <software>filename</software>
  • <dataset>filename</dataset>
  • <attachment type="...">filename</attachment> where the type is 'note', 'presentation', 'poster', 'attachment', or missing
  • <mrf src="latexml">filename.xhtml</mrf> (machine readable format?)
  • <video href="URL" tag="video"/>
Clone this wiki locally