Skip to content

Commit ac7280d

Browse files
authored
Merge pull request #50 from pypy/mideind-blog-post
blog post 'Natural Language Processing for Icelandic with PyPy: A Case Study'. Thanks @vthorsteinsson!
2 parents aa87f34 + f9e0dd2 commit ac7280d

File tree

1 file changed

+170
-0
lines changed

1 file changed

+170
-0
lines changed

posts/2022/02/nlp-icelandic-pypy.rst

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
.. title: Natural Language Processing for Icelandic with PyPy: A Case Study
2+
.. slug: nlp-icelandic-case-study
3+
.. date: 2022-02-06 15:00:00 UTC
4+
.. tags: casestudy
5+
.. category:
6+
.. link:
7+
.. description:
8+
.. type: rest
9+
.. author: Vilhjálmur Þorsteinsson
10+
11+
====================================================================
12+
Natural Language Processing for Icelandic with PyPy: A Case Study
13+
====================================================================
14+
15+
`Icelandic <https://en.wikipedia.org/wiki/Icelandic_language>`__ is one
16+
of the smallest languages of the world, with about 370.000 speakers. It
17+
is a language in the Germanic family, most similar to Norwegian, Danish
18+
and Swedish, but closer to the original `Old
19+
Norse <https://en.wikipedia.org/wiki/Old_Norse>`__ spoken throughout
20+
Scandinavia until about the 14th century CE.
21+
22+
As with other small languages, there are `worries that the language may
23+
not
24+
survive <https://www.theguardian.com/world/2018/feb/26/icelandic-language-battles-threat-of-digital-extinction>`__
25+
in a digital world, where all kinds of fancy applications are developed
26+
first - and perhaps only - for the major languages. Voice assistants,
27+
chatbots, spelling and grammar checking utilities, machine translation,
28+
etc., are increasingly becoming staples of our personal and professional
29+
lives, but if they don’t exist for Icelandic, Icelanders will gravitate
30+
towards English or other languages where such tools are readily
31+
available.
32+
33+
Iceland is a technology-savvy country, with `world-leading adoption
34+
rates of the
35+
Internet <https://ourworldindata.org/grapher/share-of-individuals-using-the-internet?tab=table>`__,
36+
PCs and smart devices, and a thriving software industry. So the
37+
government figured that it would be worthwhile to fund a `5-year
38+
plan <https://aclanthology.org/2020.lrec-1.418.pdf>`__ to build natural
39+
language processing (NLP) resources and other infrastructure for the
40+
Icelandic language. The project focuses on collecting data and
41+
developing open source software for a range of core applications, such
42+
as tokenization, vocabulary lookup, n-gram statistics, part-of-speech
43+
tagging, named entity recognition, spelling and grammar checking, neural
44+
language models and speech processing.
45+
46+
------------
47+
48+
My name is Vilhjálmur Þorsteinsson, and I’m the founder and CEO of a
49+
software startup `Miðeind <https://mideind.is/english.html>`__ in Reykjavík,
50+
Iceland, that employs 10 software engineers and linguists and focuses on
51+
NLP and AI for the Icelandic language. The company participates in the
52+
government’s language technology program, and has contributed
53+
significantly to the program’s core tools (e.g., a tokenizer and a
54+
parser), spelling and grammar checking modules, and a neural machine
55+
translation stack.
56+
57+
When it came to a choice of programming languages and development tools
58+
for the government program, the requirements were for a major, well
59+
supported, vendor-and-OS-agnostic FOSS platform with a large and diverse
60+
community, including in the NLP space. The decision to select Python as
61+
a foundational language for the project was a relatively easy one. That
62+
said, there was a bit of trepidation around the well known fact that
63+
CPython can be slow for inner-core tasks, such as tokenization and
64+
parsing, that can see heavy workloads in production.
65+
66+
I first became aware of PyPy in early 2016 when I was developing a
67+
crossword game `Netskrafl <https://github.com/mideind/Netskrafl>`__ in Python 2.7
68+
for Google App Engine. I had a utility program that compressed a
69+
dictionary into a Directed Acyclic Word Graph and was taking 160
70+
seconds  to run on CPython 2.7, so I tried PyPy and to my amazement saw
71+
a 4x speedup (down to 38 seconds), with literally no effort besides
72+
downloading the PyPy runtime.
73+
74+
This led me to select PyPy as the default Python interpreter for my
75+
company’s Python development efforts as well as for our production
76+
websites and API servers, a role in which it remains to this day. We
77+
have followed PyPy’s upgrades along the way, being just about to migrate
78+
our minimally required language version from 3.6 to 3.7.
79+
80+
In NLP, speed and memory requirements can be quite important for
81+
software usability. On the other hand, NLP logic and algorithms are
82+
often complex and challenging to program, so programmer productivity and
83+
code clarity are also critical success factors. A pragmatic approach
84+
balances these factors, avoids premature optimization and seeks a
85+
careful compromise between maximal run-time efficiency and minimal
86+
programming and maintenance effort.
87+
88+
Turning to our use cases, our Icelandic text
89+
tokenizer `"Tokenizer" <https://github.com/mideind/Tokenizer>`__ is fairly light,
90+
runs tight loops and performs a large number of small, repetitive
91+
operations. It runs very well on PyPy’s JIT and has not required further
92+
optimization.
93+
94+
Our Icelandic parser `Greynir <https://github.com/mideind/GreynirPackage>`__
95+
(known on PyPI as `reynir <https://pypi.org/project/reynir/>`__) is,
96+
if I may say so myself, a piece of work. It `parses natural language
97+
text <https://aclanthology.org/R19-1160.pdf>`__ according to a
98+
`hand-written context-free
99+
grammar <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/Greynir.grammar>`__,
100+
using an `Earley-type
101+
algorithm <https://en.wikipedia.org/wiki/Earley_parser>`__ as `enhanced
102+
by Scott and
103+
Johnstone <https://www.sciencedirect.com/science/article/pii/S0167642309000951>`__.
104+
The CFG contains almost 7,000 nonterminals and 6,000 terminals, and the
105+
parser handles ambiguity as well as left, right and middle recursion. It
106+
returns a packed parse forest for each input sentence, which is then
107+
pruned by a scoring heuristic down to a single best result tree.
108+
109+
This parser was originally coded in pure Python and turned out to be
110+
unusably slow when run on CPython - but usable on PyPy, where it was
111+
3-4x faster. However, when we started applying it to heavier production
112+
workloads, it  became apparent that it needed to be faster still. We
113+
then proceeded to convert the innermost Earley parsing loop from Python
114+
to `tight
115+
C++ <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/eparser.cpp>`__
116+
and to call it from PyPy via
117+
`CFFI <https://cffi.readthedocs.io/en/latest/>`__, with callbacks for
118+
token-terminal matching functions (“business logic”) that remained on
119+
the Python side. This made the parser much faster (on the order of 100x
120+
faster than the original on CPython) and quick enough for our production
121+
use cases. Even after moving much of the heavy processing to C++ and using CFFI, PyPy still gives a significant speed boost over CPython.
122+
123+
Connecting C++ code with PyPy proved to be quite painless using CFFI,
124+
although we had to figure out a few `magic incantations in our build
125+
module <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/eparser_build.py>`__
126+
to make it compile smoothly during setup from source on Windows and
127+
MacOS in addition to Linux. Of course, we build binary PyPy and CPython
128+
wheels for the most common targets so most users don’t have to worry
129+
about setup requirements.
130+
131+
With the positive experience from the parser project, we proceeded to
132+
take a similar approach for two other core NLP packages: our compressed
133+
vocabulary package `BinPackage <https://github.com/mideind/BinPackage>`__
134+
(known on PyPI as `islenska <https://pypi.org/project/islenska/>`__) and our
135+
trigrams database package `Icegrams <https://github.com/mideind/Icegrams>`__.
136+
These packages both take large text input (3.1 million word forms with
137+
inflection data in the vocabulary case; 100 million tokens in the
138+
trigrams case) and compress it into packed binary structures. These
139+
structures are then memory-mapped at run-time using
140+
`mmap <https://docs.python.org/3/library/mmap.html>`__ and queried via
141+
Python functions with a lookup time in the microseconds range. The
142+
low-level data structure navigation is `done in
143+
C++ <https://github.com/mideind/Icegrams/blob/master/src/icegrams/trie.cpp>`__,
144+
called from Python via CFFI. The ex-ante preparation, packing,
145+
bit-fiddling and data structure generation is fast enough with PyPy, so
146+
we haven’t seen a need to optimize that part further.
147+
148+
To showcase our tools, we host public (and open source) websites such as
149+
`greynir.is <https://greynir.is/>`__ for our parsing, named entity
150+
recognition and query stack and
151+
`yfirlestur.is <https://yfirlestur.is/>`__ for our spell and grammar
152+
checking stack. The server code on these sites is all Python running on
153+
PyPy using `Flask <https://flask.palletsprojects.com/en/2.0.x/>`__,
154+
wrapped in `gunicorn <https://gunicorn.org/>`__ and hosted on
155+
`nginx <https://www.nginx.com/>`__. The underlying database is
156+
`PostgreSQL <https://www.postgresql.org/>`__ accessed via
157+
`SQLAlchemy <https://www.sqlalchemy.org/>`__ and
158+
`psycopg2cffi <https://pypi.org/project/psycopg2cffi/>`__. This setup
159+
has served us well for 6 years and counting, being fast, reliable and
160+
having helpful and supporting communities.
161+
162+
As can be inferred from the above, we are avid fans of PyPy and
163+
commensurately thankful for the great work by the PyPy team over the
164+
years. PyPy has enabled us to use Python for a larger part of our
165+
toolset than CPython alone would have supported, and its smooth
166+
integration with C/C++ through CFFI has helped us attain a better
167+
tradeoff between performance and programmer productivity in our
168+
projects. We wish for PyPy a great and bright future and also look
169+
forward to exciting related developments on the horizon, such as
170+
`HPy <https://hpyproject.org/>`__.

0 commit comments

Comments
 (0)