-
Notifications
You must be signed in to change notification settings - Fork 131
Paragraph Recognition
To follow the wtf_wikipedia structure the following implementation is suggested.
- Create a subdirectory
src/paragraph - a section has basically a
titledepthandparagraphs. - output to HTML, LaTeX, ... generate the section, subsection, ... with the title according to the depth then iterates over the paragraph
- a paragraph consist of array of content elements of that are stored in order of appearance.
- content elements of the section are just an array paragraphs, while content elements of a paragraph are images, mathematical expressions, tables, lists, ...
- output generation checks the type of on content element and determines the appropriate method for output generation.
The implementation of a paragraph is related to a more generic element of an Abstract Syntax Tree (AST) called
ContentList. The following explaination describes how a paragraph can be implemented asContentListof typeParagraph. Furthermore it will be shown, how the all parsed elements,List,Table, ... can be described as extension of aContentList. Even aSectioncan be described as an extension ofContentList.
During parsing of the wiki source, the order of content element gets lost. The introduction of the ContentList fixes that. The following example shows the loss of page order
==Soccer==
The soccer game consists of the following components:
* 2 Teams with 11 players each,
* 3 referees
The game last 90 min.The output will be rendered in HTML in release 5.0 and the order of block of text is lost.
<h1>Soccer</h1>
<ul>
<li>2 Teams with 11 players each,</li>
<li>3 referees</li>
</ul>
The soccer game consists of the following components:
The game last 90 min.Especially when the preceeding text The soccer game ... must appear logically before the list and the concluding remarks must appear after the list to be comprehensive to the reader, then order of appearance must be preserved by the paragraphs and in general on every level of the Abstract Syntax Tree (AST). Even if paragraphs are not introduced in wtf_wikipedia, then order of appearance must be preserved in a contentlist.
- AST Type:
SectionHeadervalue:Soccer - AST Type:
TextBlockvalue:The soccer game consists of the following components: - AST Type:
List - AST Type:
TextBlockvalue:The game last 90 min.
The key challenges is anyway to preserve the order of content elements in a Section or Paragraph object.
- change
doSection()in file/src/section/index.js:
const paragraph_reg = /\n[\s]*\n[\n\s]+/g; // two or more newline -> one pargraph
//const paragraph_reg = /\n[\s\S]*\n/g; // just 2 newline with optional blanks,tabs, ... between \n
const doSection = function(section, wiki, options) {
// parse XML templates
wiki = parse.xmlTemplates(section, wiki, options);
//parse-out all {{templates}}
wiki = parse.templates(section, wiki, options);
// the aggregation of reference is currently done in the section resp. on the section level
// * to preserve the design of Spencer, provide 'section' as parameter of pargraph parsing
// * handle the <ref></ref> tags on deeper levels of the AST (Abstract Syntax Tree) with
// wiki = parse.references(section, wiki, options);
// now split the paragraphs and add them to the ContentList
let split = wiki.split(paragraph_reg); //.filter(s => s);
let paragraphs = new ContentList();
for (let i = 0; i < split.length; i++) {
let paragraph = {
type: 'paragraph',
contentlist: new ContentList()
};
// contentlist of a paragraphs could contain different types of content element:
//. "table", "list", "image", "math",...
content = split[i] || '';
// section is a parameter of doParagraph, so that references and citations can be handled
// on deeper levels of parsing the AST and it is still possible to add references, citations
// to the corresponding section, the paragraph belongs to.
// parse the content of the paragraph and populate the paragraph.contentlist
paragraph = doParagraph(section, paragraph, content , options);
// add the parsed paragraph to the contentlist
paragraphs.push(paragraph);
// push is a method of ContentList, to emulate the expected behaviour of arrays
}
return paragraphs
} In an object-oriented view Paragraph and Section classes extend the class ContentList. The Section class has a content list for storing paragraphs only, while a paragraph is just a contentlist but does not have section.title and section.depth attribute. The difference mainly appear if the output is rendered e.g. in LaTeX or HTML.
<p>My pargraph rendered in HTML</p>So the object ContentList could be generate in a following way:
- Create a subdirectory
src/contentlist - a content list is generate in the file
src/contentlist/ContentList.js -
paragraphsinherit all the methods and attributes from thecontentlist. - the method
doSection()splits into paragraphs and stores the paragraphs in order of appearance in thecontentlist - then methods for generation of output of plain text, HTML, LaTeX, ... is dependend on the content element type.
- generation output call the output method of each content element of the ContentList. In classical Javascript syntax it will look like this:
let contentlist = new ContentList();
// here populate the contentlist
const toHTML = function(mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toHTML();
}
return out
}
const toLatex = function(mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toLatex();
}
return out
}
const toMarkdown = function(mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toMarkdown();
}
return out
}
To implement the content list in a generic way with a software design that allows adding new output formats the refactoring could be implemented in the following way (see possible other output formats on PanDoc Website ).
let contentlist = new ContentList();
// here populate the contentlist
const toOutput = function(format,mypar1,mypar2,...) {
var out = '';
for (var i = 0; i < contentlist.length; i++) {
out += contentlist[i].toOutput(format);
}
return out
}
Even the headers of the sections can be designed as first contents element of a ContentList. The introduction of ContentList
- fixes the
loss of content element orderon the section level - builds a generic structure for building the Abstract Syntax Tree (AST)
The ContentList provides a generic structure for the Abstract Syntax Tree (AST). Refering to example above.
We have to perform 3 major steps
- open the bullet list or enumeration
- create a content element in the
ContentListfor all items of the bullet list or enumeration and populate the content list with items of list
//let bulletlist = new ContentList();
let bulletlist = createNode4AST("BulletList");
//Attribute: bulletlist.type = "bulletlist"
bulletlist.push(createNode4AST("OpenBulletList"))
bulletlist.push(createNode4AST("ItemBulletList",parseContentList("2 Teams with 11 players each,")))
bulletlist.push(createNode4AST("ItemBulletList",parseContentList("3 referees")))
bulletlist.push(createNode4AST("CloseBulletList"))The setting that bulletlist.type = "bulletlist" may seem that the a tree node for OpenBulletList and CloseBullet not necessary, due to the fact the type already defines the following items as bullet items.
The tree node OpenBulletList can be used to store formating attributes if desired, but all attributes for the bullet list may be store in the tree node BulletList as attributes.
The method createNode4AST() creates a very simple node for the Abstract Syntax Tree (AST) by return a hash with just the type attribute. This AST node can populated with additional attributes that may be relevant for generation of output formats.
const createNode4AST = function(nodeid) {
return {
"type":nodeid
}
}A tree node for Paragraph will be created with createNode4AST("Paragraph") and populated with more content. A node-specific constructor could use switch command for adding type specific additional attributes.
const createNode4AST = function(nodeid) {
let ast_node = {
"type":nodeid
};
switch (nodeid) {
case "Paragraph","BulletList","EnumList","TextBlock","Sentence":
ast_node.contentlist = new ContentList()
break;
case "Section":
ast_node.title = "";
ast_node.depth = -1;
ast_node.contentlist = new ContentList()
break;
default:
}
return ast_node
}var n1 = createNode4AST("ItemBulletList",parseContentList("2 Teams with 11 players each,"))
var n2 = createNode4AST("ItemBulletList",parseContentList("3 referees"))The method parseContentList() decomposes a string in tree nodes of the AST. Parsing a ContentList or even TextBlock will not be necessary in this example mentioned above, because the TextBlock contains just one sentence. If the item contains substructure of the AST e.g. a TextBlock then parseContentList() will be split the TextBlock into a ContentList of sentences. The basic example mentioned above will provide used again as parsing source of the wiki:
==Soccer==
The soccer game consists of the following components:
* 2 Teams with 11 players each,
* 3 referees
The game last 90 min.Parsing will create the following AST of the section above:
- AST Type:
SectionHeadervalue:Soccer - AST Type:
TextBlockvalue:The output will be rendered in HTML - AST Type:
OpenBulletList - AST Type:
ItemBulletListvalue:2 Teams with 11 players each, - AST Type:
ItemBulletListvalue:3 referees - AST Type:
CloseBulletList - AST Type:
TextBlockvalue:The game last 90 min.
This example is a linear concatenation of content elements in a ContentList. But this violates a bit the syntactical structure of the document. So it is recommended to design the BulletList as an extension of the ContentList.
- AST Type:
SectionHeadervalue:Soccer - AST Type:
TextBlockvalue:The output will be rendered in HTML - AST Type:
BulletListas extension ofContentList- AST Type:
OpenBulletList - AST Type:
ItemBulletListvalue:2 Teams with 11 players each, - AST Type:
ItemBulletListvalue:3 referees - AST Type:
CloseBulletList
- AST Type:
- AST Type:
TextBlockvalue:The game last 90 min.
- a
Paragraphwill be rendered differently that just aContentList. TheContentListas list of content elements concatenates just the generated output of elements of the list, whileParagraphwrap the generated output e.g. for HTML withp-tag. - A
TextBlockis basically aContentListofSentences,CitationsandReferences. Tables, Infoboxes, ... were already parsed. If inline images are allowed as content elements of the content listSentencemust be decided by thewtf_wikipediamaintainers (see Images ). Inline images are helpful for text comprehension.
The icon [[File:warning_icon.png]] visualizes a warning in the upcoming paragraph.
...
* [[File:warning_icon.png]] be aware of chemical X, it is nephrotoxic and destroys the kidney.
* self-protection should be applied with Y ...- A table is a
ContentListofTableHeaderandTableRows, - A
TableHeaderis aContentListofTHCell - A
TableBodyis aContentListofTableRow - A
TableRowis aContentListofTableCell - A
TableCellis aSentence, aTextBlockor again aContentList
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversiondeveloped by John MacFarlane PanDoc - https://www.pandoc.org