A PHP library for converting files tagged with corpus metadata to JSON, PHP, or XML.
Corpus linguistics researchers use a markup-like syntax to provide metadata about texts. For consumption by applications, this syntax needs to be converted into a more universal, machine-readable format. The format chosen was JSON.
The included /demo/index.php file contains a conversion form demonstration.
Make your code aware of the TagConverter class via your favorite method (e.g.,
use or require)
Then pass a string of text into the class:
$text = TagConverter::json('<MyTag: 123>My tagged text here');
echo $text;
// Returns {"MyTag":"123","text":"My tagged text here"}
$text = TagConverter::php('<MyTag: 123>My tagged text here');
echo $text;
// Returns array('MyTag' => '123', 'text' => 'My tagged text here')
$text = TagConverter::xml('<MyTag: 123>My tagged text here');
echo $text;
// Returns <?xml version="1.0"?><root><MyTag>123</MyTag><text>My tagged text here</text></root>The corpus style tagging syntax expected by the library is defined as follows:
- Tags must be wrapped in
<and> - Tag names and tag values may only alphanumeric characters, spaces, underscores, and hypens.
- Tag names must be separated from tag values by a
: - Spaces at the beginning at end of tag names or tag values are ignored; spaces within tag values will be preserved
- Everything not wrapped in
<and>will be considered "text"
| Status | Tag Example | Explanation |
|---|---|---|
| Good | <MyTag:SomeText> |
|
| Good | <My Tag:Some Text> |
Spaces in tag names & values OK |
| Good | < My Tag : Some Text > |
Spaces padding tag names & values OK |
| Good | < My-Tag : Some_Text > |
Underscores & hyphens OK |
| Good | ```< My-Tag : Value 1 | Value 2 >``` |
| Good | < My-Tag : Value 1 ; Value 2 > |
Semicolon separators for multiple values |
| Bad | < My/Tag : Some:Text > |
Other characters not OK |
Unit Tests can be run (after composer install) by executing vendor/bin/phpunit
