Mathics3
diff --git a/‎.github/workflows/osx.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/osx.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/workflows/ubuntu.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/ubuntu.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 8 additions & 1 deletion b/‎Makefile‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎README.rst‎
Lines changed: 24 additions & 23 deletions b/‎README.rst‎
Lines changed: 24 additions & 23 deletions
diff --git a/‎implementation.rst‎
Lines changed: 113 additions & 0 deletions b/‎implementation.rst‎
Lines changed: 113 additions & 0 deletions
diff --git a/‎mathics_scanner/characters.py‎
Lines changed: 32 additions & 9 deletions b/‎mathics_scanner/characters.py‎
Lines changed: 32 additions & 9 deletions
diff --git a/‎mathics_scanner/data/README.rst‎
Lines changed: 4 additions & 0 deletions b/‎mathics_scanner/data/README.rst‎
Lines changed: 4 additions & 0 deletions
@@ -28,4 +28,5 @@ jobs:
     - name: Test Mathics Scanner
       run: |
         pip install pytest
+        python -m mathics_scanner.generate.build_tables
         make check
@@ -27,4 +27,5 @@ jobs:
     - name: Test Mathics Scanner
       run: |
         pip install pytest
+        python -m mathics_scanner.generate.build_tables
         make check
@@ -1,3 +1,4 @@
+*~
 *.c
 *.cpp
 *.egg
 
@@ -21,7 +21,7 @@ all: develop
 
 mathics_scanner/data/characters.json: mathics_scanner/data/named-characters.yml
 	$(PIP) install -r requirements-dev.txt
-	$(PYTHON) mathics_scanner/build_tables.py
+	$(PYTHON) mathics_scanner/generate/build_tables.py
 
 #: build everything needed to install
 build: mathics_scanner/data/characters.json
@@ -47,6 +47,13 @@ clean:
 pytest: mathics_scanner/data/characters.json
 	py.test test $o
 
+#: Print to stdout a GNU Readline inputrc without Unicode
+inputrc-no-unicode:
+	$(PYTHON) -m mathics_scanner.generate.rl_inputrc inputrc-no-unicode
+
+#: Print to stdout a GNU Readline inputrc with Unicode
+inputrc-unicode:
+	$(PYTHON) -m mathics_scanner.generate.rl_inputrc inputrc-unicode
 
 #: Remove ChangeLog
 rmChangeLog:
 
@@ -5,44 +5,45 @@ Mathics Scanner
 
 This is the tokeniser or scanner portion for the Wolfram Language.
 
-As such, it also contains a full set of translation between WL Character names, their Unicode names and code points,
-and other character metadata such as whether the character is "letter like".
+As such, it also contains a full set of translation between Wolfram Language
+named characters, their Unicode/ASCII equivalents and code-points.
 
 Uses
-====
+----
 
-This is used as the scanner inside `Mathics <https://mathics.org>`_ but it can also be used for tokenizing and formatting WL code. In fact we intend to write one.
+This is used as the scanner inside `Mathics <https://mathics.org>`_ but it can
+also be used for tokenizing and formatting Wolfram Language code. In fact we
+intend to write one. This library is also quite usefull if you need to work
+with Wolfram Language named character and convert them to various formats.
 
-Implementation
-==============
-
-mathics_scaner.characters
--------------------------
+Usage
+-----
 
-This module consists mostly of translation tables between WL and unicode/ascii. 
-Because of the large size of this tables, it was decided to store them in a
-file and read them from disk at runtime (when the module is imported). Our
-tests showed that storing the tables as JSON and using
-`ujson <https://github.com/ultrajson/ultrajson>`_ to read them is the most
-efficient way to access them. However, this is merelly an implementation
-detail and consumers of this library should not relly on this assumption.
+- For tokenizing and scanning Wolfram Language code, use the
+  ``mathics_scanner.tokenizer.Tokenizer`` class.
+- To convert between Wolfram Language named characters and Unicode/ASCII, use
+  the ``mathics_scanner.characters.replace_wl_with_plain_text`` and
+  ``mathics_scanner.characters.replace_unicode_with_wl`` functions. 
+- To convert between qualified names of named characters (such ``FormalA`` for
+  ``\[FormalA]``) and Wolfram's internal representation use the
+  ``m̀athics_scanner.characters.named_characters`` dictionary.
 
-For maintainability and effeciency, we decided to store this data in a
-human-readable YAML file (`data/named-characters.yml`) and compile them into
-the JSON tables used internally by the library (`data/characters.json`) for
-faster access at runtime. The conversion of the data is performed by the
-script `mathics_scanner/build-tables.py`.
+Implementation
+--------------
 
+For notes on the implementation of the packages or details on the conversion
+scheme please read ``implementation.rst``.
 
 Contributing
 ------------
 
-Please feel encouraged to contribute to Mathics! Create your own fork, make the desired changes, commit, and make a pull request.
-
+Please feel encouraged to contribute to this package or Mathics! Create your
+own fork, make the desired changes, commit, and make a pull request.
 
 License
 -------
 
 Mathics is released under the GNU General Public License Version 3 (GPL3).
 
 .. |Workflows| image:: https://github.com/Mathics3/mathics-scanner/workflows/Mathics%20(ubuntu)/badge.svg
+
@@ -0,0 +1,113 @@
+mathics_scanner.characters
+==========================
+
+This module consists mostly of translation tables between Wolfram's internal
+representation and Unicode/ASCII. For maintainability, it was decided to store
+this data in a human-readable YAML table (in ``data/named-characters.yml``).
+
+The YAML table mainly contains information about how to convert a
+named character to Unicode and back. If a given character has a direct Unicode
+equivalent (a Unicode character whose description is similar as the named
+character's), this is specified by the ``unicode-equivalent`` field in the YAML
+table. Note that multiple named characters may share a common
+``unicode-equivalent`` field. Also, if a named character has a Unicode
+equivalent, it's ``unicode-equivalent`` field need not to consist of a single
+Unicode code-point. For example, the Unicode equivalent of ``\[FormalAlpha]``
+is ``U+03B1 U+0323`` (or ``GREEK SMALL LETTER ALPHA + COMBINING DOT BELOW``).
+
+If a named character has a ``unicode-equivalent`` field whose description fits
+the precise description of the character then it's ``has-unicode-inverse``
+field in the YAML table is set to ``true``.
+
+The conversion routines ``replace_wl_with_plain_text`` and
+``replace_unicode_with_wl`` use this information to convert between Wolfram's
+internal format and standard Unicode, but it should be noted that the
+conversion scheme is more complex than a simple lookup in the YAML table. 
+
+The Conversion Scheme
+---------------------
+
+The ``replace_wl_with_plain_text`` functions converts text from Wolfram's
+internal representation to standard Unicode *or* ASCII. If set to ``True``, the
+``use_unicode`` argument indicates to ``replace_wl_with_plain_text`` that the
+input should be converted to standard Unicode. If set to ``False``,
+``use_unicode`` indicates to ``replace_wl_with_plain_text`` that it should only
+output standard ASCII.
+
+The algorithm for converting from Wolfram's internal representation to standard
+Unicode is the following:
+
+- If a character has a direct Unicode equivalent then the character is replaced
+  by it's Unicode equivalent.
+- If a character doesn't have a Unicode equivalent then the character is
+  replaced by it's fully qualified name. For example, the ``\[AliasIndicator]``
+  character (or ``U+F768`` in Wolfram's internal representation) is replaced by
+  the Python string ``"\\[AliasIndicator]"``.
+
+The algorithm for converting from Wolfram's internal representation to standard
+ASCII is the following:
+
+- If a character has a direct Unicode equivalent and all of the characters of
+  it's Unicode equivalent are valid ASCII characters then the character is
+  replaced by it's Unicode equivalent.
+- If a character doesn't have a Unicode equivalent or any of the characters of
+  it's Unicode equivalent isn't a valid character then the character is
+  replaced by it's fully qualified name. 
+
+The ``replace_unicode_with_wl`` function converts text from standard Unicode to
+Wolfram's internal representation.  The algorithm for converting from standard
+Unicode to Wolfram's internal representation is the following:
+
+- If a Unicode character sequence happens to match the ``unicode-equivalent``
+  of a Wolfram Language named character whose ``has-unicode-inverse`` field is
+  set to ``true``, then the Unicode character is replaced by the Wolfram's internal
+  representation of such named character. Note that the YAML table is
+  maintained in such a way that there is always *at most* one character that
+  fits such description.
+- Otherwise the character is left unchanged. Note that fully qualified names
+  (such as the Python string ``"\\[Alpha]"`` or the Python string ``"Alpha"``) are *not* replaced at all.
+
+Optimizations
+-------------
+
+Because of the large size of the YAML table and the relative complexity of the
+conversion scheme, it was decided to store precompiled conversion tables in a
+file and read them from disk at runtime (when the module is imported). Our
+tests showed that storing the tables as JSON and using `ujson
+<https://github.com/ultrajson/ultrajson>`_ to read them is the most efficient
+way to access them. However, this is merely an implementation detail and
+consumers of this library should not rely on this assumption.
+
+The conversion tables are stored in the ``data/characters.json`` file, along
+side other complementary information used internally by the library.
+``data/characters.json`` holds three conversion tables:
+
+- The ``wl-to-unicode`` table, which stores the precompiled results of the
+  Wolfram-to-Unicode conversion algorithm. ``wl-to-unicode`` is used for lookup
+  when ``replace_wl_with_plain_text`` is called with the ``use_unicode``
+  argument set to ``True``.
+- The ``wl-to-ascii`` table, which stores the precompiled results of the
+  Wolfram-to-ASCII conversion algorithm. ``wl-to-ascii`` is used for lookup
+  when ``replace_wl_with_plain_text`` is called with the ``use_unicode``
+  argument set to ``False``.
+- The ``unicode-to-wl`` table, which stores the precompiled results of the
+  Unicode-to-Wolfram conversion algorithm. ``unicode-to-wl`` is used for lookup
+  when ``replace_unicode_with_wl`` is called.
+
+The precompiled translation tables, as well as the rest of data stored in
+``data/characters.json``, is generated from the YAML table with the
+``mathics_scanner.generate.build_tables.compile_tables`` function.
+
+Note that multiple entries in the YAML table are redundant in the following
+sense: when a character has a Unicode equivalent equivalent but the Unicode
+equivalent is the same as it's Wolfram's internal representation (i.e. the
+``wl-unicode`` field is the same as the ``unicode-equivalent`` field in the
+YAML table) then it is considered redundant for us, since no conversion is
+needed.
+
+As an optimization, we explicitly remove any redundant characters from *all*
+precompiled conversion tables. Such optimization makes the tables smaller and
+easier to load. This implies that not all named characters that have a Unicode
+equivalent are included in the precompiled translation tables (the ones that
+are not included are the ones where no conversion is needed).
+
@@ -27,15 +27,15 @@
 _letterlikes = _data["letterlikes"]
 
 # Conversion from WL to the fully qualified names
-wl_to_ascii = _data["wl-to-ascii-dict"]
+_wl_to_ascii = _data["wl-to-ascii-dict"]
 _wl_to_ascii_re = re.compile(_data["wl-to-ascii-re"])
 
 # Conversion from WL to unicode
-wl_to_unicode = _data["wl-to-unicode-dict"]
+_wl_to_unicode = _data["wl-to-unicode-dict"]
 _wl_to_unicode_re = re.compile(_data["wl-to-unicode-re"])
 
 # Conversion from unicode to WL
-unicode_to_wl = _data["unicode-to-wl-dict"]
+_unicode_to_wl = _data["unicode-to-wl-dict"]
 _unicode_to_wl_re = re.compile(_data["unicode-to-wl-re"])
 
 # All supported named characters
@@ -46,20 +46,43 @@
 
 def replace_wl_with_plain_text(wl_input: str, use_unicode=True) -> str:
     """
-    WL uses some non-unicode character for various things.
-    Replace them with the unicode equivalent.
+    The Wolfram Language uses specific Unicode characters to represent Wolfram
+    Language named characters. This functions replaces all occurrences of such
+    characters with their corresponding Unicode/ASCII equivalents.
+
+    @param: wl_input    The string whose characters will be replaced. 
+    @param: use_unicode A flag that indicates whether to use Unicode or ASCII 
+                        for the conversion.
+
+    Note that the occurrences of named characters in ``wl_input`` are expect to
+    be represented by Wolfram's internal scheme. For more information Wolfram's
+    representation scheme and on our own conversion scheme please see `Listing
+    of Named Characters
+    <https://reference.wolfram.com/language/guide/ListingOfNamedCharacters.html>`_
+    and ``implementation.rst`` respectively.
     """
     r = _wl_to_unicode_re if use_unicode else _wl_to_ascii_re
-    d = wl_to_unicode if use_unicode else wl_to_ascii
+    d = _wl_to_unicode if use_unicode else _wl_to_ascii
 
     return r.sub(lambda m: d[m.group(0)], wl_input)
 
 def replace_unicode_with_wl(unicode_input: str) -> str:
     """
-    WL uses some non-unicode character for various things.
-    Replace their unicode equivalent with them.
+    The Wolfram Language uses specific Unicode characters to represent Wolfram
+    Language named characters. This functions replaces all occurrences of the
+    corresponding Unicode equivalents of such characters with the characters
+    themselves.
+
+    @param: unicode_input The string whose characters will be replaced. 
+
+    Note that the occurrences of named characters in the output of
+    ``replace_unicode_with_wl`` are represented using Wolfram's internal
+    scheme. For more information Wolfram's representation scheme and on our own
+    conversion scheme please see `Listing of Named Characters
+    <https://reference.wolfram.com/language/guide/ListingOfNamedCharacters.html>`_
+    and ``implementation.rst`` respectively.
     """
     return _unicode_to_wl_re.sub(
-        lambda m: unicode_to_wl[m.group(0)], unicode_input
+        lambda m: _unicode_to_wl[m.group(0)], unicode_input
     )
 
@@ -0,0 +1,4 @@
+Files in this directory contains data for conversion between WL characters, names
+unicode symbols and things of this ilk.
+
+For the most part this derived from named-characters.yml.
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,4 @@`
	`1`	`+*~`
`1`	`2`	`*.c`
`2`	`3`	`*.cpp`
`3`	`4`	`*.egg`