Skip to content

Commit 14af837

Browse files
authored
Add support for bytes literals (#733)
Fixes #732
1 parent ebfa81b commit 14af837

File tree

11 files changed

+204
-22
lines changed

11 files changed

+204
-22
lines changed

CHANGELOG.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
66

77
## [Unreleased]
88
### Added
9-
* Added rudimentary support for `clojure.stacktrace` with `print-cause-trace` (part of #721).
9+
* Added rudimentary support for `clojure.stacktrace` with `print-cause-trace` (part of #721)
10+
* Added support for `bytes` literals using a `#b` prefix (#732)
1011

1112
### Fixed
12-
* Fix issue with `case` evaluating all of its clauses expressions (#699).
13-
* Fix issue with relative paths dropping their first character on MS-Windows (#703).
14-
* Fix incompatibility with `(str nil)` returning "nil" (#706).
15-
* Fix `sort-by` support for maps and boolean comparator fns (#709).
16-
* Fix `sort` support for maps and boolean comparator fns (#711).
17-
* Fix `(is (= exp act))` should only evaluate its args once on failure (#712).
18-
* Fix issue with `with` failing with a traceback error when an exception is thrown (#714).
19-
* Fix issue with `sort-*` family of funtions returning an error on an empty seq (#716).
20-
* Fix issue with `intern` failing when used (#725).
21-
* Fix issue with `ns` not being available after `in-ns` on the REPL (#718).
22-
* Fixed issue with import modules aliasing using ns eval (#719).
23-
* Fix issue with `ns-resolve` throwing an error on macros (#720).
24-
* Fix issue with py module `readerwritelock` locks handling (#722).
13+
* Fix issue with `case` evaluating all of its clauses expressions (#699)
14+
* Fix issue with relative paths dropping their first character on MS-Windows (#703)
15+
* Fix incompatibility with `(str nil)` returning "nil" (#706)
16+
* Fix `sort-by` support for maps and boolean comparator fns (#709)
17+
* Fix `sort` support for maps and boolean comparator fns (#711)
18+
* Fix `(is (= exp act))` should only evaluate its args once on failure (#712)
19+
* Fix issue with `with` failing with a traceback error when an exception is thrown (#714)
20+
* Fix issue with `sort-*` family of funtions returning an error on an empty seq (#716)
21+
* Fix issue with `intern` failing when used (#725)
22+
* Fix issue with `ns` not being available after `in-ns` on the REPL (#718)
23+
* Fixed issue with import modules aliasing using ns eval (#719)
24+
* Fix issue with `ns-resolve` throwing an error on macros (#720)
25+
* Fix issue with py module `readerwritelock` locks handling (#722)
2526

2627
## [v0.1.0a2]
2728
### Added

docs/reader.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,38 @@ Strings are denoted as a series of characters enclosed by ``"`` quotation marks.
9696
If a string needs to contain a quotation mark literal, that quotation mark should be escaped as ``\"``.
9797
Strings may be multi-line by default and only a closing ``"`` will terminate reading a string.
9898
Strings correspond to the Python ``str`` type.
99+
String literals are always read with the UTF-8 encoding.
100+
101+
String literals may contain the following escape sequences: ``\\``, ``\a``, ``\b``, ``\f``, ``\n``, ``\r``, ``\t``, ``\v``.
102+
Their meanings match the equivalent escape sequences supported in `Python string literals <https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals>`_\.
103+
104+
105+
Byte Strings
106+
------------
107+
108+
::
109+
110+
basilisp.user=> #b ""
111+
#b ""
112+
basilisp.user=> #b "this is a string"
113+
#b "this is a string"
114+
basilisp.user=> (type #b "")
115+
<class 'bytes'>
116+
117+
Byte strings are denoted as a series of ASCII characters enclosed by ``"`` quotation marks and preceded by a ``#b``.
118+
If a string needs to contain a quotation mark literal, that quotation mark should be escaped as ``\"``.
119+
Strings may be multi-line by default and only a closing ``"`` will terminate reading a string.
120+
Strings correspond to the Python ``bytes`` type.
121+
122+
Byte string literals may contain the following escape sequences: ``\\``, ``\a``, ``\b``, ``\f``, ``\n``, ``\r``, ``\t``, ``\v``.
123+
Byte strings may also characters using a hex escape code as ``\xhh`` where ``hh`` is a hexadecimal value.
124+
Their meanings match the equivalent escape sequences supported in `Python byte string literals <https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals>`_\.
125+
126+
127+
.. warning::
128+
129+
As in Python, byte string literals may not include any characters outside of the ASCII range.
130+
99131

100132
.. _character_literals:
101133

src/basilisp/core.lpy

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1321,11 +1321,6 @@
13211321
[x]
13221322
(and (integer? x) (neg? x)))
13231323

1324-
(defn ^:inline nil?
1325-
"Return ``true`` if ``x`` is ``nil``\\, otherwise ``false``\\."
1326-
[x]
1327-
(operator/is- x nil))
1328-
13291324
(defn ^:inline some?
13301325
"Return ``true`` if ``x`` is not ``nil``\\, otherwise ``false`` s."
13311326
[x]
@@ -3603,7 +3598,7 @@
36033598
;; pairs - pairs of bindings; either symbol/seq pairs or modifier pairs
36043599
gen-iter (fn gen-iter [pairs]
36053600
(if (seq pairs)
3606-
(let [for-iter (gensym "for")
3601+
(let [for-iter (gensym "for")
36073602
seq-arg (gensym "seq")
36083603
pair (first pairs)
36093604
binding (first pair)

src/basilisp/lang/compiler/analyzer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3583,6 +3583,7 @@ def _const_node_type(_: Any) -> ConstType:
35833583

35843584
for tp, const_type in {
35853585
bool: ConstType.BOOL,
3586+
bytes: ConstType.BYTES,
35863587
complex: ConstType.NUMBER,
35873588
datetime: ConstType.INST,
35883589
Decimal: ConstType.DECIMAL,
@@ -3612,6 +3613,7 @@ def _const_node_type(_: Any) -> ConstType:
36123613

36133614

36143615
@_analyze_form.register(bool)
3616+
@_analyze_form.register(bytes)
36153617
@_analyze_form.register(complex)
36163618
@_analyze_form.register(datetime)
36173619
@_analyze_form.register(Decimal)
@@ -3649,6 +3651,7 @@ def _const_node(form: ReaderForm, ctx: AnalyzerContext) -> Const:
36493651
form,
36503652
(
36513653
bool,
3654+
bytes,
36523655
complex,
36533656
datetime,
36543657
Decimal,

src/basilisp/lang/compiler/generator.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3275,6 +3275,7 @@ def _const_meta_kwargs_ast(
32753275

32763276

32773277
@_const_val_to_py_ast.register(bool)
3278+
@_const_val_to_py_ast.register(bytes)
32783279
@_const_val_to_py_ast.register(type(None))
32793280
@_const_val_to_py_ast.register(complex)
32803281
@_const_val_to_py_ast.register(float)

src/basilisp/lang/compiler/nodes.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,7 @@ class ConstType(Enum):
271271
SET = kw.keyword("set")
272272
VECTOR = kw.keyword("vector")
273273
BOOL = kw.keyword("bool")
274+
BYTES = kw.keyword("bytes")
274275
KEYWORD = kw.keyword("keyword")
275276
SYMBOL = kw.keyword("symbol")
276277
STRING = kw.keyword("string")

src/basilisp/lang/obj.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,12 @@ def _lrepr_bool(o: bool, **_) -> str:
202202
return repr(o).lower()
203203

204204

205+
@lrepr.register(bytes)
206+
def _lrepr_bytes(o: bytes, **_) -> str:
207+
v = repr(o)
208+
return f'#b "{v[2:-1]}"'
209+
210+
205211
@lrepr.register(type(None))
206212
def _lrepr_nil(_: None, **__) -> str:
207213
return "nil"

src/basilisp/lang/reader.py

Lines changed: 74 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -843,7 +843,7 @@ def _read_num( # noqa: C901 # pylint: disable=too-many-statements
843843

844844

845845
def _read_str(ctx: ReaderContext, allow_arbitrary_escapes: bool = False) -> str:
846-
"""Return a string from the input stream.
846+
"""Return a UTF-8 encoded string from the input stream.
847847
848848
If allow_arbitrary_escapes is True, do not throw a SyntaxError if an
849849
unknown escape sequence is encountered."""
@@ -869,6 +869,75 @@ def _read_str(ctx: ReaderContext, allow_arbitrary_escapes: bool = False) -> str:
869869
s.append(token)
870870

871871

872+
_BYTES_ESCAPE_CHARS = {
873+
'"': b'"',
874+
"\\": b"\\",
875+
"a": b"\a",
876+
"b": b"\b",
877+
"f": b"\f",
878+
"n": b"\n",
879+
"r": b"\r",
880+
"t": b"\t",
881+
"v": b"\v",
882+
}
883+
884+
885+
def _read_hex_byte(ctx: ReaderContext) -> bytes:
886+
"""Read a byte with a 2 digit hex code such as `\\xff`."""
887+
reader = ctx.reader
888+
c1 = reader.next_token()
889+
c2 = reader.next_token()
890+
try:
891+
return bytes([int("".join(["0x", c1, c2]), base=16)])
892+
except ValueError as e:
893+
raise ctx.syntax_error(
894+
f"Invalid byte representation for base 16: 0x{c1}{c2}"
895+
) from e
896+
897+
898+
def _read_byte_str(ctx: ReaderContext) -> bytes:
899+
"""Return a byte string from the input stream.
900+
901+
Byte strings have the same restrictions and semantics as byte literals in Python.
902+
Individual characters must be within the ASCII range or must be valid escape sequences.
903+
"""
904+
reader = ctx.reader
905+
906+
token = reader.peek()
907+
while whitespace_chars.match(token):
908+
token = reader.next_token()
909+
910+
if token != '"':
911+
raise ctx.syntax_error(f"Expected '\"'; got '{token}' instead")
912+
913+
b: List[bytes] = []
914+
while True:
915+
token = reader.next_token()
916+
if token == "":
917+
raise ctx.eof_error("Unexpected EOF in byte string")
918+
if ord(token) < 1 or ord(token) > 127:
919+
raise ctx.eof_error("Byte strings must contain only ASCII characters")
920+
if token == "\\":
921+
token = reader.next_token()
922+
escape_char = _BYTES_ESCAPE_CHARS.get(token, None)
923+
if escape_char:
924+
b.append(escape_char)
925+
continue
926+
elif token == "x":
927+
b.append(_read_hex_byte(ctx))
928+
continue
929+
else:
930+
# In Python, invalid escape sequences entered into byte strings are
931+
# retained with backslash for debugging purposes, so we do the same.
932+
b.append(b"\\")
933+
b.append(token.encode("utf-8"))
934+
continue
935+
if token == '"':
936+
reader.next_token()
937+
return b"".join(b)
938+
b.append(token.encode("utf-8"))
939+
940+
872941
@_with_loc
873942
def _read_sym(ctx: ReaderContext) -> MaybeSymbol:
874943
"""Return a symbol from the input stream.
@@ -1380,7 +1449,7 @@ def _load_record_or_type(
13801449
raise ctx.syntax_error("Records may only be constructed from Vectors and Maps")
13811450

13821451

1383-
def _read_reader_macro(ctx: ReaderContext) -> LispReaderForm:
1452+
def _read_reader_macro(ctx: ReaderContext) -> LispReaderForm: # noqa: MC0001
13841453
"""Return a data structure evaluated as a reader
13851454
macro from the input stream."""
13861455
start = ctx.reader.advance()
@@ -1408,6 +1477,9 @@ def _read_reader_macro(ctx: ReaderContext) -> LispReaderForm:
14081477
return _read_reader_conditional(ctx)
14091478
elif token == "#":
14101479
return _read_numeric_constant(ctx)
1480+
elif token == "b":
1481+
ctx.reader.advance()
1482+
return _read_byte_str(ctx)
14111483
elif ns_name_chars.match(token):
14121484
s = _read_sym(ctx)
14131485
assert isinstance(s, sym.Symbol)

src/basilisp/lang/typing.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
LispNumber = Union[int, float, Fraction]
2222
LispForm = Union[
2323
bool,
24+
bytes,
2425
complex,
2526
datetime,
2627
Decimal,

tests/basilisp/lrepr_test.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,11 @@ def test_print_readably(lcompile: CompileFn):
173173
'#uuid "81f35603-0408-4b3d-bbc0-462e3702747f"',
174174
'(pr-str #uuid "81f35603-0408-4b3d-bbc0-462e3702747f")',
175175
),
176+
('#b ""', '(pr-str #b "")'),
177+
(
178+
r'#b "\x7fELF\x01\x01\x01\x00"',
179+
r'(pr-str #b "\x7f\x45\x4c\x46\x01\x01\x01\x00")',
180+
),
176181
('#"\\s"', '(pr-str #"\\s")'),
177182
(
178183
'#inst "2018-11-28T12:43:25.477000+00:00"',
@@ -205,6 +210,11 @@ def test_lrepr(lcompile: CompileFn, repr: str, code: str):
205210
(-float("inf"), "(read-string (pr-str ##-Inf))"),
206211
("hi", '(read-string (pr-str "hi"))'),
207212
("Hello\nworld!", '(read-string (pr-str "Hello\nworld!"))'),
213+
(b"", '(read-string (pr-str #b ""))'),
214+
(
215+
b"\x7fELF\x01\x01\x01\x00",
216+
r'(read-string (pr-str #b "\x7f\x45\x4c\x46\x01\x01\x01\x00"))',
217+
),
208218
(
209219
uuid.UUID("81f35603-0408-4b3d-bbc0-462e3702747f"),
210220
'(read-string (pr-str #uuid "81f35603-0408-4b3d-bbc0-462e3702747f"))',
@@ -253,6 +263,11 @@ def test_lrepr_round_trip_special_cases(lcompile: CompileFn):
253263
("##-Inf", "(print-str ##-Inf)"),
254264
("hi", '(print-str "hi")'),
255265
("Hello\nworld!", '(print-str "Hello\nworld!")'),
266+
('#b ""', '(print-str #b "")'),
267+
(
268+
r'#b "\x7fELF\x01\x01\x01\x00"',
269+
r'(print-str #b "\x7f\x45\x4c\x46\x01\x01\x01\x00")',
270+
),
256271
# In Clojure, (print-str #uuid "...") produces '#uuid "..."' but in Basilisp
257272
# print-str is tied directly to str (which in Clojure simply returns the string
258273
# part of the UUID).

0 commit comments

Comments
 (0)