Skip to content

Commit 5958c69

Browse files
committed
Merge branch 'master' of github.com:Mathics3/mathics-scanner
2 parents 27e4d58 + 83f430f commit 5958c69

20 files changed

+526
-82
lines changed

.github/workflows/osx.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,5 @@ jobs:
2828
- name: Test Mathics Scanner
2929
run: |
3030
pip install pytest
31+
python -m mathics_scanner.generate.build_tables
3132
make check

.github/workflows/ubuntu.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,5 @@ jobs:
2727
- name: Test Mathics Scanner
2828
run: |
2929
pip install pytest
30+
python -m mathics_scanner.generate.build_tables
3031
make check

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
*~
12
*.c
23
*.cpp
34
*.egg

Makefile

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ all: develop
2121

2222
mathics_scanner/data/characters.json: mathics_scanner/data/named-characters.yml
2323
$(PIP) install -r requirements-dev.txt
24-
$(PYTHON) mathics_scanner/build_tables.py
24+
$(PYTHON) mathics_scanner/generate/build_tables.py
2525

2626
#: build everything needed to install
2727
build: mathics_scanner/data/characters.json
@@ -47,6 +47,13 @@ clean:
4747
pytest: mathics_scanner/data/characters.json
4848
py.test test $o
4949

50+
#: Print to stdout a GNU Readline inputrc without Unicode
51+
inputrc-no-unicode:
52+
$(PYTHON) -m mathics_scanner.generate.rl_inputrc inputrc-no-unicode
53+
54+
#: Print to stdout a GNU Readline inputrc with Unicode
55+
inputrc-unicode:
56+
$(PYTHON) -m mathics_scanner.generate.rl_inputrc inputrc-unicode
5057

5158
#: Remove ChangeLog
5259
rmChangeLog:

README.rst

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,44 +5,45 @@ Mathics Scanner
55

66
This is the tokeniser or scanner portion for the Wolfram Language.
77

8-
As such, it also contains a full set of translation between WL Character names, their Unicode names and code points,
9-
and other character metadata such as whether the character is "letter like".
8+
As such, it also contains a full set of translation between Wolfram Language
9+
named characters, their Unicode/ASCII equivalents and code-points.
1010

1111
Uses
12-
====
12+
----
1313

14-
This is used as the scanner inside `Mathics <https://mathics.org>`_ but it can also be used for tokenizing and formatting WL code. In fact we intend to write one.
14+
This is used as the scanner inside `Mathics <https://mathics.org>`_ but it can
15+
also be used for tokenizing and formatting Wolfram Language code. In fact we
16+
intend to write one. This library is also quite usefull if you need to work
17+
with Wolfram Language named character and convert them to various formats.
1518

16-
Implementation
17-
==============
18-
19-
mathics_scaner.characters
20-
-------------------------
19+
Usage
20+
-----
2121

22-
This module consists mostly of translation tables between WL and unicode/ascii.
23-
Because of the large size of this tables, it was decided to store them in a
24-
file and read them from disk at runtime (when the module is imported). Our
25-
tests showed that storing the tables as JSON and using
26-
`ujson <https://github.com/ultrajson/ultrajson>`_ to read them is the most
27-
efficient way to access them. However, this is merelly an implementation
28-
detail and consumers of this library should not relly on this assumption.
22+
- For tokenizing and scanning Wolfram Language code, use the
23+
``mathics_scanner.tokenizer.Tokenizer`` class.
24+
- To convert between Wolfram Language named characters and Unicode/ASCII, use
25+
the ``mathics_scanner.characters.replace_wl_with_plain_text`` and
26+
``mathics_scanner.characters.replace_unicode_with_wl`` functions.
27+
- To convert between qualified names of named characters (such ``FormalA`` for
28+
``\[FormalA]``) and Wolfram's internal representation use the
29+
``m̀athics_scanner.characters.named_characters`` dictionary.
2930

30-
For maintainability and effeciency, we decided to store this data in a
31-
human-readable YAML file (`data/named-characters.yml`) and compile them into
32-
the JSON tables used internally by the library (`data/characters.json`) for
33-
faster access at runtime. The conversion of the data is performed by the
34-
script `mathics_scanner/build-tables.py`.
31+
Implementation
32+
--------------
3533

34+
For notes on the implementation of the packages or details on the conversion
35+
scheme please read ``implementation.rst``.
3636

3737
Contributing
3838
------------
3939

40-
Please feel encouraged to contribute to Mathics! Create your own fork, make the desired changes, commit, and make a pull request.
41-
40+
Please feel encouraged to contribute to this package or Mathics! Create your
41+
own fork, make the desired changes, commit, and make a pull request.
4242

4343
License
4444
-------
4545

4646
Mathics is released under the GNU General Public License Version 3 (GPL3).
4747

4848
.. |Workflows| image:: https://github.com/Mathics3/mathics-scanner/workflows/Mathics%20(ubuntu)/badge.svg
49+

implementation.rst

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
mathics_scanner.characters
2+
==========================
3+
4+
This module consists mostly of translation tables between Wolfram's internal
5+
representation and Unicode/ASCII. For maintainability, it was decided to store
6+
this data in a human-readable YAML table (in ``data/named-characters.yml``).
7+
8+
The YAML table mainly contains information about how to convert a
9+
named character to Unicode and back. If a given character has a direct Unicode
10+
equivalent (a Unicode character whose description is similar as the named
11+
character's), this is specified by the ``unicode-equivalent`` field in the YAML
12+
table. Note that multiple named characters may share a common
13+
``unicode-equivalent`` field. Also, if a named character has a Unicode
14+
equivalent, it's ``unicode-equivalent`` field need not to consist of a single
15+
Unicode code-point. For example, the Unicode equivalent of ``\[FormalAlpha]``
16+
is ``U+03B1 U+0323`` (or ``GREEK SMALL LETTER ALPHA + COMBINING DOT BELOW``).
17+
18+
If a named character has a ``unicode-equivalent`` field whose description fits
19+
the precise description of the character then it's ``has-unicode-inverse``
20+
field in the YAML table is set to ``true``.
21+
22+
The conversion routines ``replace_wl_with_plain_text`` and
23+
``replace_unicode_with_wl`` use this information to convert between Wolfram's
24+
internal format and standard Unicode, but it should be noted that the
25+
conversion scheme is more complex than a simple lookup in the YAML table.
26+
27+
The Conversion Scheme
28+
---------------------
29+
30+
The ``replace_wl_with_plain_text`` functions converts text from Wolfram's
31+
internal representation to standard Unicode *or* ASCII. If set to ``True``, the
32+
``use_unicode`` argument indicates to ``replace_wl_with_plain_text`` that the
33+
input should be converted to standard Unicode. If set to ``False``,
34+
``use_unicode`` indicates to ``replace_wl_with_plain_text`` that it should only
35+
output standard ASCII.
36+
37+
The algorithm for converting from Wolfram's internal representation to standard
38+
Unicode is the following:
39+
40+
- If a character has a direct Unicode equivalent then the character is replaced
41+
by it's Unicode equivalent.
42+
- If a character doesn't have a Unicode equivalent then the character is
43+
replaced by it's fully qualified name. For example, the ``\[AliasIndicator]``
44+
character (or ``U+F768`` in Wolfram's internal representation) is replaced by
45+
the Python string ``"\\[AliasIndicator]"``.
46+
47+
The algorithm for converting from Wolfram's internal representation to standard
48+
ASCII is the following:
49+
50+
- If a character has a direct Unicode equivalent and all of the characters of
51+
it's Unicode equivalent are valid ASCII characters then the character is
52+
replaced by it's Unicode equivalent.
53+
- If a character doesn't have a Unicode equivalent or any of the characters of
54+
it's Unicode equivalent isn't a valid character then the character is
55+
replaced by it's fully qualified name.
56+
57+
The ``replace_unicode_with_wl`` function converts text from standard Unicode to
58+
Wolfram's internal representation. The algorithm for converting from standard
59+
Unicode to Wolfram's internal representation is the following:
60+
61+
- If a Unicode character sequence happens to match the ``unicode-equivalent``
62+
of a Wolfram Language named character whose ``has-unicode-inverse`` field is
63+
set to ``true``, then the Unicode character is replaced by the Wolfram's internal
64+
representation of such named character. Note that the YAML table is
65+
maintained in such a way that there is always *at most* one character that
66+
fits such description.
67+
- Otherwise the character is left unchanged. Note that fully qualified names
68+
(such as the Python string ``"\\[Alpha]"`` or the Python string ``"Alpha"``) are *not* replaced at all.
69+
70+
Optimizations
71+
-------------
72+
73+
Because of the large size of the YAML table and the relative complexity of the
74+
conversion scheme, it was decided to store precompiled conversion tables in a
75+
file and read them from disk at runtime (when the module is imported). Our
76+
tests showed that storing the tables as JSON and using `ujson
77+
<https://github.com/ultrajson/ultrajson>`_ to read them is the most efficient
78+
way to access them. However, this is merely an implementation detail and
79+
consumers of this library should not rely on this assumption.
80+
81+
The conversion tables are stored in the ``data/characters.json`` file, along
82+
side other complementary information used internally by the library.
83+
``data/characters.json`` holds three conversion tables:
84+
85+
- The ``wl-to-unicode`` table, which stores the precompiled results of the
86+
Wolfram-to-Unicode conversion algorithm. ``wl-to-unicode`` is used for lookup
87+
when ``replace_wl_with_plain_text`` is called with the ``use_unicode``
88+
argument set to ``True``.
89+
- The ``wl-to-ascii`` table, which stores the precompiled results of the
90+
Wolfram-to-ASCII conversion algorithm. ``wl-to-ascii`` is used for lookup
91+
when ``replace_wl_with_plain_text`` is called with the ``use_unicode``
92+
argument set to ``False``.
93+
- The ``unicode-to-wl`` table, which stores the precompiled results of the
94+
Unicode-to-Wolfram conversion algorithm. ``unicode-to-wl`` is used for lookup
95+
when ``replace_unicode_with_wl`` is called.
96+
97+
The precompiled translation tables, as well as the rest of data stored in
98+
``data/characters.json``, is generated from the YAML table with the
99+
``mathics_scanner.generate.build_tables.compile_tables`` function.
100+
101+
Note that multiple entries in the YAML table are redundant in the following
102+
sense: when a character has a Unicode equivalent equivalent but the Unicode
103+
equivalent is the same as it's Wolfram's internal representation (i.e. the
104+
``wl-unicode`` field is the same as the ``unicode-equivalent`` field in the
105+
YAML table) then it is considered redundant for us, since no conversion is
106+
needed.
107+
108+
As an optimization, we explicitly remove any redundant characters from *all*
109+
precompiled conversion tables. Such optimization makes the tables smaller and
110+
easier to load. This implies that not all named characters that have a Unicode
111+
equivalent are included in the precompiled translation tables (the ones that
112+
are not included are the ones where no conversion is needed).
113+

mathics_scanner/characters.py

Lines changed: 32 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@
2727
_letterlikes = _data["letterlikes"]
2828

2929
# Conversion from WL to the fully qualified names
30-
wl_to_ascii = _data["wl-to-ascii-dict"]
30+
_wl_to_ascii = _data["wl-to-ascii-dict"]
3131
_wl_to_ascii_re = re.compile(_data["wl-to-ascii-re"])
3232

3333
# Conversion from WL to unicode
34-
wl_to_unicode = _data["wl-to-unicode-dict"]
34+
_wl_to_unicode = _data["wl-to-unicode-dict"]
3535
_wl_to_unicode_re = re.compile(_data["wl-to-unicode-re"])
3636

3737
# Conversion from unicode to WL
38-
unicode_to_wl = _data["unicode-to-wl-dict"]
38+
_unicode_to_wl = _data["unicode-to-wl-dict"]
3939
_unicode_to_wl_re = re.compile(_data["unicode-to-wl-re"])
4040

4141
# All supported named characters
@@ -46,20 +46,43 @@
4646

4747
def replace_wl_with_plain_text(wl_input: str, use_unicode=True) -> str:
4848
"""
49-
WL uses some non-unicode character for various things.
50-
Replace them with the unicode equivalent.
49+
The Wolfram Language uses specific Unicode characters to represent Wolfram
50+
Language named characters. This functions replaces all occurrences of such
51+
characters with their corresponding Unicode/ASCII equivalents.
52+
53+
@param: wl_input The string whose characters will be replaced.
54+
@param: use_unicode A flag that indicates whether to use Unicode or ASCII
55+
for the conversion.
56+
57+
Note that the occurrences of named characters in ``wl_input`` are expect to
58+
be represented by Wolfram's internal scheme. For more information Wolfram's
59+
representation scheme and on our own conversion scheme please see `Listing
60+
of Named Characters
61+
<https://reference.wolfram.com/language/guide/ListingOfNamedCharacters.html>`_
62+
and ``implementation.rst`` respectively.
5163
"""
5264
r = _wl_to_unicode_re if use_unicode else _wl_to_ascii_re
53-
d = wl_to_unicode if use_unicode else wl_to_ascii
65+
d = _wl_to_unicode if use_unicode else _wl_to_ascii
5466

5567
return r.sub(lambda m: d[m.group(0)], wl_input)
5668

5769
def replace_unicode_with_wl(unicode_input: str) -> str:
5870
"""
59-
WL uses some non-unicode character for various things.
60-
Replace their unicode equivalent with them.
71+
The Wolfram Language uses specific Unicode characters to represent Wolfram
72+
Language named characters. This functions replaces all occurrences of the
73+
corresponding Unicode equivalents of such characters with the characters
74+
themselves.
75+
76+
@param: unicode_input The string whose characters will be replaced.
77+
78+
Note that the occurrences of named characters in the output of
79+
``replace_unicode_with_wl`` are represented using Wolfram's internal
80+
scheme. For more information Wolfram's representation scheme and on our own
81+
conversion scheme please see `Listing of Named Characters
82+
<https://reference.wolfram.com/language/guide/ListingOfNamedCharacters.html>`_
83+
and ``implementation.rst`` respectively.
6184
"""
6285
return _unicode_to_wl_re.sub(
63-
lambda m: unicode_to_wl[m.group(0)], unicode_input
86+
lambda m: _unicode_to_wl[m.group(0)], unicode_input
6487
)
6588

mathics_scanner/data/README.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Files in this directory contains data for conversion between WL characters, names
2+
unicode symbols and things of this ilk.
3+
4+
For the most part this derived from named-characters.yml.

0 commit comments

Comments
 (0)