Skip to content

Commit bedf5ca

Browse files
committed
draft blog post by Vilhjálmur Þorsteinsson (@vthorsteinsson)
1 parent aa87f34 commit bedf5ca

File tree

1 file changed

+168
-0
lines changed

1 file changed

+168
-0
lines changed

posts/2022/02/nlp-icelandic-pypy.rst

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
.. title: Natural Language Processing for Icelandic with PyPy: A Case Study
2+
.. slug: nlp-icelandic-case-study
3+
.. date: 2021-02-06 15:00:00 UTC
4+
.. tags: casestudy
5+
.. category:
6+
.. link:
7+
.. description:
8+
.. type: rest
9+
.. author: Vilhjálmur Þorsteinsson
10+
11+
====================================================================
12+
Natural Language Processing for Icelandic with PyPy: A Case Study
13+
====================================================================
14+
15+
`Icelandic <https://en.wikipedia.org/wiki/Icelandic_language>`__ is one
16+
of the smallest languages of the world, with about 370.000 speakers. It
17+
is a language in the Germanic family, most similar to Norwegian, Danish
18+
and Swedish, but closer to the original `Old
19+
Norse <https://en.wikipedia.org/wiki/Old_Norse>`__ spoken throughout
20+
Scandinavia until about the 14th century CE.
21+
22+
As with other small languages, there are `worries that the language may
23+
not
24+
survive <https://www.theguardian.com/world/2018/feb/26/icelandic-language-battles-threat-of-digital-extinction>`__
25+
in a digital world, where all kinds of fancy applications are developed
26+
first - and perhaps only - for the major languages. Voice assistants,
27+
chatbots, spelling and grammar checking utilities, machine translation,
28+
etc., are increasingly becoming staples of our personal and professional
29+
lives, but if they don’t exist for Icelandic, Icelanders will gravitate
30+
towards English or other languages where such tools are readily
31+
available.
32+
33+
Iceland is a technology-savvy country, with `world-leading adoption
34+
rates of the
35+
Internet <https://ourworldindata.org/grapher/share-of-individuals-using-the-internet?tab=table>`__,
36+
PCs and smart devices, and a thriving software industry. So the
37+
government figured that it would be worthwhile to fund a `5-year
38+
plan <https://aclanthology.org/2020.lrec-1.418.pdf>`__ to build natural
39+
language processing (NLP) resources and other infrastructure for the
40+
Icelandic language. The project focuses on collecting data and
41+
developing open source software for a range of core applications, such
42+
as tokenization, vocabulary lookup, n-gram statistics, part-of-speech
43+
tagging, named entity recognition, spelling and grammar checking, neural
44+
language models and speech processing.
45+
46+
------------
47+
48+
My name is Vilhjálmur Þorsteinsson, and I’m the founder and CEO of a
49+
`software startup <https://mideind.is/english.html>`__ in Reykjavík,
50+
Iceland, that employs 10 software engineers and linguists and focuses on
51+
NLP and AI for the Icelandic language. The company participates in the
52+
government’s language technology program, and has contributed
53+
significantly to the program’s core tools (e.g., a tokenizer and a
54+
parser), spelling and grammar checking modules, and a neural machine
55+
translation stack.
56+
57+
When it came to a choice of programming languages and development tools
58+
for the government program, the requirements were for a major, well
59+
supported, vendor-and-OS-agnostic FOSS platform with a large and diverse
60+
community, including in the NLP space. The decision to select Python as
61+
a foundational language for the project was a relatively easy one. That
62+
said, there was a bit of trepidation around the well known fact that
63+
CPython can be slow for inner-core tasks, such as tokenization and
64+
parsing, that can see heavy workloads in production.
65+
66+
I first became aware of PyPy in early 2016 when I was developing `a
67+
crossword game <https://github.com/mideind/Netskrafl>`__ in Python 2.7
68+
for Google App Engine. I had a utility program that compressed a
69+
dictionary into a Directed Acyclic Word Graph and was taking 160
70+
seconds  to run on CPython 2.7, so I tried PyPy and to my amazement saw
71+
a 4x speedup (down to 38 seconds), with literally no effort besides
72+
downloading the PyPy runtime.
73+
74+
This led me to select PyPy as the default Python interpreter for my
75+
company’s Python development efforts as well as for our production
76+
websites and API servers, a role in which it remains to this day. We
77+
have followed PyPy’s upgrades along the way, being just about to migrate
78+
our minimally required language version from 3.6 to 3.7.
79+
80+
In NLP, speed and memory requirements can be quite important for
81+
software usability. On the other hand, NLP logic and algorithms are
82+
often complex and challenging to program, so programmer productivity and
83+
code clarity are also critical success factors. A pragmatic approach
84+
balances these factors, avoids premature optimization and seeks a
85+
careful compromise between maximal run-time efficiency and minimal
86+
programming and maintenance effort.
87+
88+
Turning to our use cases, our `Icelandic text
89+
tokenizer <https://github.com/mideind/Tokenizer>`__ is fairly light,
90+
runs tight loops and performs a large number of small, repetitive
91+
operations. It runs very well on PyPy’s JIT and has not required further
92+
optimization.
93+
94+
Our `Icelandic parser <https://github.com/mideind/GreynirPackage>`__ is,
95+
if I may say so myself, a piece of work. It `parses natural language
96+
text <https://aclanthology.org/R19-1160.pdf>`__ according to a
97+
`hand-written context-free
98+
grammar <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/Greynir.grammar>`__,
99+
using an `Earley-type
100+
algorithm <https://en.wikipedia.org/wiki/Earley_parser>`__ as `enhanced
101+
by Scott and
102+
Johnstone <https://www.sciencedirect.com/science/article/pii/S0167642309000951>`__.
103+
The CFG contains almost 7,000 nonterminals and 6,000 terminals, and the
104+
parser handles ambiguity as well as left, right and middle recursion. It
105+
returns a packed parse forest for each input sentence, which is then
106+
pruned by a scoring heuristic down to a single best result tree.
107+
108+
This parser was originally coded in pure Python and turned out to be
109+
unusably slow when run on CPython - but usable on PyPy, where it was
110+
3-4x faster. However, when we started applying it to heavier production
111+
workloads, it  became apparent that it needed to be faster still. We
112+
then proceeded to convert the innermost Earley parsing loop from Python
113+
to `tight
114+
C++ <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/eparser.cpp>`__
115+
and to call it from PyPy via
116+
`CFFI <https://cffi.readthedocs.io/en/latest/>`__, with callbacks for
117+
token-terminal matching functions (“business logic”) that remained on
118+
the Python side. This made the parser much faster (on the order of 100x
119+
faster than the original on CPython) and quick enough for our production
120+
use cases.
121+
122+
Connecting C++ code with PyPy proved to be quite painless using CFFI,
123+
although we had to figure out a few `magic incantations in our build
124+
module <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/eparser_build.py>`__
125+
to make it compile smoothly during setup from source on Windows and
126+
MacOS in addition to Linux. Of course, we build binary PyPy and CPython
127+
wheels for the most common targets so most users don’t have to worry
128+
about setup requirements.
129+
130+
With the positive experience from the parser project, we proceeded to
131+
take a similar approach for two other core NLP packages: our `compressed
132+
vocabulary package <https://github.com/mideind/BinPackage>`__ and our
133+
`trigrams database package <https://github.com/mideind/Icegrams>`__.
134+
These packages both take large text input (3.1 million word forms with
135+
inflection data in the vocabulary case; 100 million tokens in the
136+
trigrams case) and compress it into packed binary structures. These
137+
structures are then memory-mapped at run-time using
138+
`mmap <https://docs.python.org/3/library/mmap.html>`__ and queried via
139+
Python functions with a lookup time in the microseconds range. The
140+
low-level data structure navigation is `done in
141+
C++ <https://github.com/mideind/Icegrams/blob/master/src/icegrams/trie.cpp>`__,
142+
called from Python via CFFI. The ex-ante preparation, packing,
143+
bit-fiddling and data structure generation is fast enough with PyPy, so
144+
we haven’t seen a need to optimize that part further.
145+
146+
To showcase our tools, we host public (and open source) websites such as
147+
`greynir.is <https://greynir.is/>`__ for our parsing, named entity
148+
recognition and query stack and
149+
`yfirlestur.is <https://yfirlestur.is/>`__ for our spell and grammar
150+
checking stack. The server code on these sites is all Python running on
151+
PyPy using `Flask <https://flask.palletsprojects.com/en/2.0.x/>`__,
152+
wrapped in `gunicorn <https://gunicorn.org/>`__ and hosted on
153+
`nginx <https://www.nginx.com/>`__. The underlying database is
154+
`PostgreSQL <https://www.postgresql.org/>`__ accessed via
155+
`SQLAlchemy <https://www.sqlalchemy.org/>`__ and
156+
`psycopg2cffi <https://pypi.org/project/psycopg2cffi/>`__. This setup
157+
has served us well for 6 years and counting, being fast, reliable and
158+
having helpful and supporting communities.
159+
160+
As can be inferred from the above, we are avid fans of PyPy and
161+
commensurately thankful for the great work by the PyPy team over the
162+
years. PyPy has enabled us to use Python for a larger part of our
163+
toolset than CPython alone would have supported, and its smooth
164+
integration with C/C++ through CFFI has helped us attain a better
165+
tradeoff between performance and programmer productivity in our
166+
projects. We wish for PyPy a great and bright future and also look
167+
forward to exciting related developments on the horizon, such as
168+
`HPy <https://hpyproject.org/>`__.

0 commit comments

Comments
 (0)