-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Since I first played with DynaText and later eXist, I have been frustrated that many XML toys presume that an element implies a word boundary. Using such products to search for the word “obſtruction” in the WWP version of Margaret Cavendish’s The Blazing World ¹ a user would find 3 of the 5 occurrences. The 2 that would be missed are:
<lb/>an obſtru<sic>s</sic>tion and hinderance to his worth and merit; andand
<lb/>that this would perhaps be an hindrance or obſtru-
<lb/>ction to their Deſign.While it would be awfully hard for software to be smart enough to realize that “obſtruction” is a single word in that 2nd case (especially since in truth the last character of its 1st line is incorrectly encoded as U+00AD, not U+002D),² I think missing the first is, well, kinda inexcusable.
In some cases, like the <sic> above, word boundaries are implicit, typically bounded by whitespace or an element which implies a word boundary (like <lb>, or <pb>, or <head>). Problematically, the list of such elements is not only not the same from markup language to markup language, it is not necessarily even the same from one TEI project to another.
But in other cases the word boundaries are explicitly indicated, as in the following encoding of the sentence “Caesar seized control” taken from the 4th example in TEI P5 § 19.10.
<s>
<w xml:id="S1W1"><c xml:id="S1W1C1">C</c>ae<c xml:id="S1W1C2">s</c>ar</w>
<w xml:id="S1W2"><c xml:id="S1W2C1">s</c>ei<c xml:id="S1W2C2">z</c>e<c xml:id="S1W2C3">d</c></w>
<w xml:id="S1W3">con<c xml:id="S1W3C1">t</c>rol</w>.
</s>It is easier to see what is going on without the attributes:
<s>
<w><c>C</c>ae<c>s</c>ar</w>
<w><c>s</c>ei<c>z</c>e<c>d</c></w>
<w>con<c>t</c>rol</w>.
</s>In the above style of encoding the whitespace between <w> elements is implicit. In the Folger Shakespeare Library’s plays, it is always explicit, although they do not put elements inside <w> elements, I don’t think. E.g.³
<w xml:id="w0352820" n="5.8.19">Macduff</w>
<c xml:id="c0352830" n="5.8.19"> </c>
<w xml:id="w0352840" n="5.8.19">was</w>
<c xml:id="c0352850" n="5.8.19"> </c>
<w xml:id="w0352860" n="5.8.19">from</w>
<c xml:id="c0352870" n="5.8.19"> </c>
<w xml:id="w0352880" n="5.8.19">his</w>
<c xml:id="c0352890" n="5.8.19"> </c>
<w xml:id="w0352900" n="5.8.19">mother’s</w>
<c xml:id="c0352910" n="5.8.19"> </c>
<w xml:id="w0352920" n="5.8.19">womb</w>
<lb xml:id="lb-23840"/>
<milestone unit="ftln" xml:id="ftln-2385" n="5.8.20" ana="#short" corresp="#w0352930 #c0352940 #w0352950 #p0352960"/>
<w xml:id="w0352930" n="5.8.20">Untimely</w>
<c xml:id="c0352940" n="5.8.20"> </c>
<w xml:id="w0352950" n="5.8.20">ripped</w>
<pc xml:id="p0352960" n="5.8.20">.</pc>I would love to see Elemental (and eXist, and BaseX, and …) be able to handle these encodings correctly, even if it requires configuration by the user (to indicate which elements imply a word break and which do not).
notes
¹ The Description of a New World, Called the Blazing-World by Margaret Cavendish, Duchess of Newcastle, 2nd edition. London, 1668. WWP TR00253 as of 2020-03-19. This work has been called a forerunner of science fiction.
² See my paper on soft hyphens if you care.
³ From https://www.folger.edu/explore/shakespeares-works/download/.