Skip to content
David Chiang edited this page May 9, 2019 · 6 revisions

Much of the structure of the XML format is specified by the RELAX NG schema (data/xml/schema.{rnc,rng}) and can be validated automatically. This document describes the structure less formally and also describes aspects of the format that aren't specified by the schema.

Structure

The root element is <volume id="X99">, where X is replaced by the one-letter code for the venue and 99 is replaced by the last two digits of the year.

The <volume> element has child elements <paper id="9999">, where 9999 is replaced by the four-digit paper identifier. For some venues (LREC), there is also an href attribute for the external URL of the paper.

Each <paper> element has several child elements:

  • <title>: The title (see below for more details)
  • <author>: The authors (see below for more details)
  • <editor>: The editors (see below for more details)
  • and others.

Text Fields

Text fields (<title>, <author>, etc.) are written in Unicode (UTF-8). The following elements are currently allowed for formatting:

  • <tex-math>: math formulas, coded using TeX (equivalent to TeX $...$). For example: An <tex-math>O(n^3)</tex-math> Algorithm for Parsing Context-Free Grammars.
  • <url>: a URL, displayed in typewriter font and hyperlinked
  • <i>: italics
  • <b>: boldface

Below are additional guidelines for specific fields.

Title

The title should be written in title-case. The Anthology doesn't currently have rules for what "title-case" means exactly, but individual meetings/journals might. Characters whose case should be preserved even when a bibliography style uppercases or lowercases the title should be placed inside a <fixed-case> element (this serves the same purpose as curly braces in BibTeX). For example:

<title>The <fixed-case>ACL</fixed-case> <fixed-case>A</fixed-case>nthology: Current State and Future Directions</title>

Authors and Editors

Each author/editor name must have exactly one <last> element and at most one <first> element.

  • The <last> element contains the name(s) by which papers are cited and their bibliography entries are sorted alphabetically. If an author has only a single name, that name should go into the <last> element. A "lineage" like Jr. or III should go into the <last> element.

  • The <first> element contains all other names, including middle names/initials.

The name should appear in the XML the same way that it does on the original paper. For example, if the original paper has only a first initial and last name, like A. Joshi, the XML should also have only a first initial and last name: <first>A.</first> <last>Joshi</last>.

The Anthology also needs to know what individual a name refers to. There are two ways to do this.

Variant method. If a person goes by multiple names, like Aravind Joshi and Aravind K. Joshi, all names should be entered into the file data/yaml/name_variants.yaml. An example entry would be:

- canonical: Aravind Joshi
  id: aravind-joshi
  variants:
  - Aravind K. Joshi

The canonical name is the one that the Anthology displays by default. Regarding the optional id, please see below. The variants must be globally unique.

ID method. Alternatively, the referent of the name can be indicated in the XML file itself using the id attribute:

<author id="aravind-joshi"><first>A.</first> <last>Joshi</last></author>

When should each of the two methods be used?

  • The variant method is better for names that are likely to be unique (because variant names must be unique) and likely to be reused (because variant names don't need special annotation to be resolved correctly).

  • The ID method is better for names that are either likely to be non-unique (for example, a name abbreviated to use just a first initial) or unlikely to be reused (for example, a misspelling).

The Anthology enforces the constraint that each name must always use the variant method or always use the ID method. This is to reduce the chance of a newly ingested paper having an author name that is not resolved correctly.

An ID can be any unique string that uses only characters allowed in URLs. Usually it is based on the author's canonical name, but in the case of two authors with a name in common, the ID could add a middle initial to distinguish them, or failing that, the current convention is to append the author's PhD institution to their ID (e.g., aravind-joshi-upenn).

Link fields

Paper PDFs are linked in three ways.

  • <url>URL</url>: URL of Anthology-hosted PDF.
  • <paper href="URL">...</paper>: URL of externally-hosted, non-ACL-sponsored PDF (currently used mainly for LREC)
  • <href>URL</href>: URL of externally-hosted, ACL-sponsored PDF (currently used mainly for TACL)

Other files can be linked as well:

  • <software>filename</software>
  • <dataset>filename</dataset>
  • <attachment type="...">filename</attachment> where the type is 'note', 'presentation', 'poster', 'attachment', or missing
  • <mrf src="latexml">filename.xhtml</mrf> (machine readable format? Mr. F?)
  • <video href="URL" tag="video"/>
  • <revision id="2">Q15-1022v2</revision>
  • <erratum id="1">Q15-1022e1</erratum>
Clone this wiki locally