Xponents 3.5 Begin Again, Again
Happy Valentines
Xponents 3.5.5 BeginAgain (Again)
- Full Evaluation: internal evaluation work was redone start to finish to hone outlier gazetteer entries and
patterns of rogue entries from new data sources. Evaluation work called out and fixed serious false-positive and recall
errors - Log4J Remediation: While Log4J is not the primary choice of logging facility, it is a dependency that appears
mainly in the Solr 7.x server distribution. Vulnerable Log4J JAR files were removed and latest ones were injected. - API Changes:
TextEntityis a text span and requires a start, end offset pair. Only constructor
requires that pair. Other subclasses can have a zero argument constructor by exception, such asPoLiMatchGeonamesUtility.isCountry()now only returns true forPCLIentries others are historical country names or territories.- REST API now has
methodandmatch-idon most matches to be more consistent codesfeature can be requested in REST API:features=geo,taxons,patterns,codesfor example.
This will emit tagged acronyms for admin boundaries for now.- Xponents Core
TextUtilsnow offers trivial text span testing for common punctuation.
For example, to quickly test ifMARC __&__ Ulooks like a entity or is a false positive
when tagging the phraseMarc Ua common punct test was needed. These were fairly obvious
pre-filters to employ just after tagging and before serious reasoning happens.
- Geocoding: Tamped down on acronym false-positives on UPPERCASE and lowercase
documents given the added gazetteer data includes lots of codes.- Default behavior: country codes and province codes are NOT emitted although tagged.
These are requested explicitly by caller using thecodesfeature. Right, soUSA
orCODorMAare not emitted by default although those bare tokens may represent
countries or provinces. Such codes qualifying other placenames will be emitted. - Gazetteer tagging ommissions: numerous transliterated short names for Pacific/Asian islands
A xx,I-xx
and various other false-positive places are NOT tagged, although present in the gazetteer. - About 500 dictionary words in French, German and English were added to the stop-filter
for tokens commonly not places. E.g.,amend,adept, etc.
- Default behavior: country codes and province codes are NOT emitted although tagged.
- Bugs Fixed:
- Geocoder Rule
HeatMapmemory leak fixed Germanis removed as a country -- its a nationality or an adjective- Tagger will throw
ExtractionExceptionif it tags 100,000 or more locations from gazetter
- Geocoder Rule
DISTRIBUTIONS:
- Python: See attached Opensextant Python API 1.4.6
- Docker: https://hub.docker.com/r/mubaldino/opensextant - see "xponents-3.5" tag. Now
latestis also a tag - Gazetteer: see Docker image; Copy
xponents-solrout of docker image to use it outside of Docker - Java, Maven:
TESTING:
Deploy: https://github.com/OpenSextant/Xponents/blob/master/Examples/Docker/docker-compose.yml
Install client library (ATTACHED)
pip3 install opensextant-1.4.6.tar.gzUse Test suite: https://github.com/OpenSextant/Xponents/blob/master/test/xlayer-test-suite.py
DEFAULT_URL=localhost:8787
python3 xlayer-test-suite.py $DEFAULT_URLTest output:
- Consult docker logs on docker container, ala
docker logs xponentsto see that server is alive - Review output to console -- unit tests results for normal geotagging, postal geotagging and tests in Arabic and Japanese should appear.