|
| 1 | +.. title: Natural Language Processing for Icelandic with PyPy: A Case Study |
| 2 | +.. slug: nlp-icelandic-case-study |
| 3 | +.. date: 2022-02-06 15:00:00 UTC |
| 4 | +.. tags: casestudy |
| 5 | +.. category: |
| 6 | +.. link: |
| 7 | +.. description: |
| 8 | +.. type: rest |
| 9 | +.. author: Vilhjálmur Þorsteinsson |
| 10 | +
|
| 11 | +==================================================================== |
| 12 | +Natural Language Processing for Icelandic with PyPy: A Case Study |
| 13 | +==================================================================== |
| 14 | + |
| 15 | +`Icelandic <https://en.wikipedia.org/wiki/Icelandic_language>`__ is one |
| 16 | +of the smallest languages of the world, with about 370.000 speakers. It |
| 17 | +is a language in the Germanic family, most similar to Norwegian, Danish |
| 18 | +and Swedish, but closer to the original `Old |
| 19 | +Norse <https://en.wikipedia.org/wiki/Old_Norse>`__ spoken throughout |
| 20 | +Scandinavia until about the 14th century CE. |
| 21 | + |
| 22 | +As with other small languages, there are `worries that the language may |
| 23 | +not |
| 24 | +survive <https://www.theguardian.com/world/2018/feb/26/icelandic-language-battles-threat-of-digital-extinction>`__ |
| 25 | +in a digital world, where all kinds of fancy applications are developed |
| 26 | +first - and perhaps only - for the major languages. Voice assistants, |
| 27 | +chatbots, spelling and grammar checking utilities, machine translation, |
| 28 | +etc., are increasingly becoming staples of our personal and professional |
| 29 | +lives, but if they don’t exist for Icelandic, Icelanders will gravitate |
| 30 | +towards English or other languages where such tools are readily |
| 31 | +available. |
| 32 | + |
| 33 | +Iceland is a technology-savvy country, with `world-leading adoption |
| 34 | +rates of the |
| 35 | +Internet <https://ourworldindata.org/grapher/share-of-individuals-using-the-internet?tab=table>`__, |
| 36 | +PCs and smart devices, and a thriving software industry. So the |
| 37 | +government figured that it would be worthwhile to fund a `5-year |
| 38 | +plan <https://aclanthology.org/2020.lrec-1.418.pdf>`__ to build natural |
| 39 | +language processing (NLP) resources and other infrastructure for the |
| 40 | +Icelandic language. The project focuses on collecting data and |
| 41 | +developing open source software for a range of core applications, such |
| 42 | +as tokenization, vocabulary lookup, n-gram statistics, part-of-speech |
| 43 | +tagging, named entity recognition, spelling and grammar checking, neural |
| 44 | +language models and speech processing. |
| 45 | + |
| 46 | +------------ |
| 47 | + |
| 48 | +My name is Vilhjálmur Þorsteinsson, and I’m the founder and CEO of a |
| 49 | +software startup `Miðeind <https://mideind.is/english.html>`__ in Reykjavík, |
| 50 | +Iceland, that employs 10 software engineers and linguists and focuses on |
| 51 | +NLP and AI for the Icelandic language. The company participates in the |
| 52 | +government’s language technology program, and has contributed |
| 53 | +significantly to the program’s core tools (e.g., a tokenizer and a |
| 54 | +parser), spelling and grammar checking modules, and a neural machine |
| 55 | +translation stack. |
| 56 | + |
| 57 | +When it came to a choice of programming languages and development tools |
| 58 | +for the government program, the requirements were for a major, well |
| 59 | +supported, vendor-and-OS-agnostic FOSS platform with a large and diverse |
| 60 | +community, including in the NLP space. The decision to select Python as |
| 61 | +a foundational language for the project was a relatively easy one. That |
| 62 | +said, there was a bit of trepidation around the well known fact that |
| 63 | +CPython can be slow for inner-core tasks, such as tokenization and |
| 64 | +parsing, that can see heavy workloads in production. |
| 65 | + |
| 66 | +I first became aware of PyPy in early 2016 when I was developing a |
| 67 | +crossword game `Netskrafl <https://github.com/mideind/Netskrafl>`__ in Python 2.7 |
| 68 | +for Google App Engine. I had a utility program that compressed a |
| 69 | +dictionary into a Directed Acyclic Word Graph and was taking 160 |
| 70 | +seconds to run on CPython 2.7, so I tried PyPy and to my amazement saw |
| 71 | +a 4x speedup (down to 38 seconds), with literally no effort besides |
| 72 | +downloading the PyPy runtime. |
| 73 | + |
| 74 | +This led me to select PyPy as the default Python interpreter for my |
| 75 | +company’s Python development efforts as well as for our production |
| 76 | +websites and API servers, a role in which it remains to this day. We |
| 77 | +have followed PyPy’s upgrades along the way, being just about to migrate |
| 78 | +our minimally required language version from 3.6 to 3.7. |
| 79 | + |
| 80 | +In NLP, speed and memory requirements can be quite important for |
| 81 | +software usability. On the other hand, NLP logic and algorithms are |
| 82 | +often complex and challenging to program, so programmer productivity and |
| 83 | +code clarity are also critical success factors. A pragmatic approach |
| 84 | +balances these factors, avoids premature optimization and seeks a |
| 85 | +careful compromise between maximal run-time efficiency and minimal |
| 86 | +programming and maintenance effort. |
| 87 | + |
| 88 | +Turning to our use cases, our Icelandic text |
| 89 | +tokenizer `"Tokenizer" <https://github.com/mideind/Tokenizer>`__ is fairly light, |
| 90 | +runs tight loops and performs a large number of small, repetitive |
| 91 | +operations. It runs very well on PyPy’s JIT and has not required further |
| 92 | +optimization. |
| 93 | + |
| 94 | +Our Icelandic parser `Greynir <https://github.com/mideind/GreynirPackage>`__ |
| 95 | +(known on PyPI as `reynir <https://pypi.org/project/reynir/>`__) is, |
| 96 | +if I may say so myself, a piece of work. It `parses natural language |
| 97 | +text <https://aclanthology.org/R19-1160.pdf>`__ according to a |
| 98 | +`hand-written context-free |
| 99 | +grammar <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/Greynir.grammar>`__, |
| 100 | +using an `Earley-type |
| 101 | +algorithm <https://en.wikipedia.org/wiki/Earley_parser>`__ as `enhanced |
| 102 | +by Scott and |
| 103 | +Johnstone <https://www.sciencedirect.com/science/article/pii/S0167642309000951>`__. |
| 104 | +The CFG contains almost 7,000 nonterminals and 6,000 terminals, and the |
| 105 | +parser handles ambiguity as well as left, right and middle recursion. It |
| 106 | +returns a packed parse forest for each input sentence, which is then |
| 107 | +pruned by a scoring heuristic down to a single best result tree. |
| 108 | + |
| 109 | +This parser was originally coded in pure Python and turned out to be |
| 110 | +unusably slow when run on CPython - but usable on PyPy, where it was |
| 111 | +3-4x faster. However, when we started applying it to heavier production |
| 112 | +workloads, it became apparent that it needed to be faster still. We |
| 113 | +then proceeded to convert the innermost Earley parsing loop from Python |
| 114 | +to `tight |
| 115 | +C++ <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/eparser.cpp>`__ |
| 116 | +and to call it from PyPy via |
| 117 | +`CFFI <https://cffi.readthedocs.io/en/latest/>`__, with callbacks for |
| 118 | +token-terminal matching functions (“business logic”) that remained on |
| 119 | +the Python side. This made the parser much faster (on the order of 100x |
| 120 | +faster than the original on CPython) and quick enough for our production |
| 121 | +use cases. Even after moving much of the heavy processing to C++ and using CFFI, PyPy still gives a significant speed boost over CPython. |
| 122 | + |
| 123 | +Connecting C++ code with PyPy proved to be quite painless using CFFI, |
| 124 | +although we had to figure out a few `magic incantations in our build |
| 125 | +module <https://github.com/mideind/GreynirPackage/blob/master/src/reynir/eparser_build.py>`__ |
| 126 | +to make it compile smoothly during setup from source on Windows and |
| 127 | +MacOS in addition to Linux. Of course, we build binary PyPy and CPython |
| 128 | +wheels for the most common targets so most users don’t have to worry |
| 129 | +about setup requirements. |
| 130 | + |
| 131 | +With the positive experience from the parser project, we proceeded to |
| 132 | +take a similar approach for two other core NLP packages: our compressed |
| 133 | +vocabulary package `BinPackage <https://github.com/mideind/BinPackage>`__ |
| 134 | +(known on PyPI as `islenska <https://pypi.org/project/islenska/>`__) and our |
| 135 | +trigrams database package `Icegrams <https://github.com/mideind/Icegrams>`__. |
| 136 | +These packages both take large text input (3.1 million word forms with |
| 137 | +inflection data in the vocabulary case; 100 million tokens in the |
| 138 | +trigrams case) and compress it into packed binary structures. These |
| 139 | +structures are then memory-mapped at run-time using |
| 140 | +`mmap <https://docs.python.org/3/library/mmap.html>`__ and queried via |
| 141 | +Python functions with a lookup time in the microseconds range. The |
| 142 | +low-level data structure navigation is `done in |
| 143 | +C++ <https://github.com/mideind/Icegrams/blob/master/src/icegrams/trie.cpp>`__, |
| 144 | +called from Python via CFFI. The ex-ante preparation, packing, |
| 145 | +bit-fiddling and data structure generation is fast enough with PyPy, so |
| 146 | +we haven’t seen a need to optimize that part further. |
| 147 | + |
| 148 | +To showcase our tools, we host public (and open source) websites such as |
| 149 | +`greynir.is <https://greynir.is/>`__ for our parsing, named entity |
| 150 | +recognition and query stack and |
| 151 | +`yfirlestur.is <https://yfirlestur.is/>`__ for our spell and grammar |
| 152 | +checking stack. The server code on these sites is all Python running on |
| 153 | +PyPy using `Flask <https://flask.palletsprojects.com/en/2.0.x/>`__, |
| 154 | +wrapped in `gunicorn <https://gunicorn.org/>`__ and hosted on |
| 155 | +`nginx <https://www.nginx.com/>`__. The underlying database is |
| 156 | +`PostgreSQL <https://www.postgresql.org/>`__ accessed via |
| 157 | +`SQLAlchemy <https://www.sqlalchemy.org/>`__ and |
| 158 | +`psycopg2cffi <https://pypi.org/project/psycopg2cffi/>`__. This setup |
| 159 | +has served us well for 6 years and counting, being fast, reliable and |
| 160 | +having helpful and supporting communities. |
| 161 | + |
| 162 | +As can be inferred from the above, we are avid fans of PyPy and |
| 163 | +commensurately thankful for the great work by the PyPy team over the |
| 164 | +years. PyPy has enabled us to use Python for a larger part of our |
| 165 | +toolset than CPython alone would have supported, and its smooth |
| 166 | +integration with C/C++ through CFFI has helped us attain a better |
| 167 | +tradeoff between performance and programmer productivity in our |
| 168 | +projects. We wish for PyPy a great and bright future and also look |
| 169 | +forward to exciting related developments on the horizon, such as |
| 170 | +`HPy <https://hpyproject.org/>`__. |
0 commit comments