|
| 1 | +# Digit Separators |
| 2 | + |
| 3 | +Author: Lasse Nielsen, Sam Rawlins |
| 4 | + |
| 5 | +Status: In-progress |
| 6 | + |
| 7 | +Version 1.0 |
| 8 | + |
| 9 | +## Motivation |
| 10 | + |
| 11 | +To make long number literals more readable, allow authors to inject [digit |
| 12 | +group separators][] inside numbers. Examples with different possible separators: |
| 13 | + |
| 14 | +```none |
| 15 | +100 000 000 000 000 000 000 // space |
| 16 | +100,000,000,000,000,000,000 // comma |
| 17 | +100.000.000.000.000.000.000 // period |
| 18 | +100'000'000'000'000'000'000 // apostrophe (C++) |
| 19 | +100_000_000_000_000_000_000 // underscore (many programming languages). |
| 20 | +``` |
| 21 | + |
| 22 | +## Proposal |
| 23 | + |
| 24 | +### Digit separators in number literals |
| 25 | + |
| 26 | +Allow one or more `_`s between any two otherwise adjacent _digits_ of a NUMBER |
| 27 | +or HEX\_NUMBER token. The following are not digits: The leading `0x` or `0X` in |
| 28 | +HEX\_NUMBER, and any `.`, `e`, `E`, `+` or `-` in NUMBER. |
| 29 | + |
| 30 | +That means only allowing `_`s between two `0-9` digits in NUMBER and between |
| 31 | +two `0-9`,`a-f`,`A-F` digits in HEX\_NUMBER. |
| 32 | + |
| 33 | +The grammar would be changing `<DIGIT>+` to `<DIGITS>` which is then `<DIGIT>`s |
| 34 | +with optional `_`s between, and same for hex digits: |
| 35 | + |
| 36 | +```bnf |
| 37 | +<NUMBER> ::= <DIGITS> (`.' <DIGITS>)? <EXPONENT>? |
| 38 | + \alt `.' <DIGITS> <EXPONENT>? |
| 39 | +
|
| 40 | +<EXPONENT> ::= (`e' | `E') (`+' | `-')? <DIGITS> |
| 41 | +
|
| 42 | +<DIGITS> ::= <DIGIT> (`_'* <DIGIT>)* |
| 43 | +
|
| 44 | +<HEX\_NUMBER> ::= `0x' <HEX\_DIGITS> |
| 45 | + \alt `0X' <HEX\_DIGITS> |
| 46 | +
|
| 47 | +<HEX\_DIGIT> ::= `a' .. `f' |
| 48 | + \alt `A' .. `F' |
| 49 | + \alt <DIGIT> |
| 50 | +
|
| 51 | +<HEX\_DIGITS> ::= <HEX\_DIGIT> (`_'* <HEX\_DIGIT>)* |
| 52 | +``` |
| 53 | + |
| 54 | +### Examples |
| 55 | + |
| 56 | +```none |
| 57 | +100__000_000__000_000__000_000 // one hundred million million millions! |
| 58 | +0x4000_0000_0000_0000 |
| 59 | +0.000_000_000_01 |
| 60 | +0x00_14_22_01_23_45 // MAC address |
| 61 | +555_123_4567 // US Phone number |
| 62 | +``` |
| 63 | + |
| 64 | +**Invalid** literals: |
| 65 | + |
| 66 | +```none |
| 67 | +100_ |
| 68 | +0x_00_14_22_01_23_45 |
| 69 | +0._000_000_000_1 |
| 70 | +100_.1 |
| 71 | +1.2e_3 |
| 72 | +``` |
| 73 | + |
| 74 | +An identifier like `_100` is a valid identifier, and `_100._100` is a valid |
| 75 | +member access. If users learn the "separator only between digits" rule quickly, |
| 76 | +this will likely not be an issue. |
| 77 | + |
| 78 | +### Why choose underscores |
| 79 | + |
| 80 | +The syntax must work even with just a single separator, so it can't be anything |
| 81 | +that can already validly seperate two expressions (excludes all infix operators |
| 82 | +and comma) and should already be part of a number literal (excludes decimal |
| 83 | +point). |
| 84 | + |
| 85 | +So, the comma and decimal point are probably never going to work, even if they |
| 86 | +are already the standard "thousands separator" in text in different parts of |
| 87 | +the world. |
| 88 | + |
| 89 | +Space separation is dangerous because it's hard to see whether it's just space, |
| 90 | +or it's an accidental tab character. If we allow spacing, should we allow |
| 91 | +arbitrary whitespace, including line terminators? If so, then this suddenly |
| 92 | +become quite dangerous. Forget a comma at the end of a line in a multiline |
| 93 | +list, and two adjacent integers are automatically combined (we already have |
| 94 | +that problem with strings). So, probably not a good choice, even if it is the |
| 95 | +preferred formatting for print text. |
| 96 | + |
| 97 | +The apostrope is also the string single-quote character. We don't currently |
| 98 | +allow adjacent numbers and strings, but if we ever do, then this syntax becomes |
| 99 | +ambiguous. It's still possible (we disambiguate by assuming it's a digit |
| 100 | +separator). It is currently used by C++ 14 as a digit group separator, so it is |
| 101 | +definitely possible. |
| 102 | + |
| 103 | +That leaves underscore, which could be the start of an identifier. Currently |
| 104 | +`100_000` would be tokenized as "integer literal 100" followed by "identifier |
| 105 | +`_000`". However, users would never write an identifier adjacent to another |
| 106 | +token that contains identifier-valid characters (unlike strings, which have |
| 107 | +clear delimiters that do not occur anywher else), so this is unlikely to happen |
| 108 | +in practice. Underscore is already used by a large number of programming |
| 109 | +languages including Java, Swift, and Python. |
| 110 | + |
| 111 | +We also want to allow multiple separators for higher-level grouping, e.g.,: |
| 112 | + |
| 113 | +```none |
| 114 | +100__000_000_000__000_000_000 |
| 115 | +``` |
| 116 | + |
| 117 | +For this purpose, the underscore extends gracefully. So does space, but has the |
| 118 | +disadvantage that it collapses when inserted into HTML, whereas `''` looks odd. |
| 119 | + |
| 120 | +### Related work |
| 121 | + |
| 122 | +* [Java digit separators](https://docs.oracle.com/javase/8/docs/technotes/guides/language/underscores-literals.html) |
| 123 | +* [Python PEP 515 - underscores in numeric literals](https://peps.python.org/pep-0515/) |
| 124 | + |
| 125 | +### Possible new lint rules |
| 126 | + |
| 127 | +There are some possible new lint rule considerations, but none of these are |
| 128 | +considered vital to the usability or general success of the feature. |
| 129 | + |
| 130 | +The feature is designed to help the readability of long numbers. But a |
| 131 | +developer can still make a mistake about where to place separators. For example: |
| 132 | + |
| 133 | +``` |
| 134 | +var one = 1_000_000; |
| 135 | +var two = 2_000_000; |
| 136 | +var three = 3_000_000; |
| 137 | +var four = 4_0000_000; // Whoops! |
| 138 | +``` |
| 139 | + |
| 140 | +If a developer uses the Dart formatter to format their code, they cannot try to |
| 141 | +vertically align the numbers with whitespace (extra space characters are |
| 142 | +removed by the formatter). So we could offer a lint rule to only place |
| 143 | +separators every three digits of a decimal number. Also possibly a similar rule |
| 144 | +for hexadecimal numbers. If a developer ever uses digit separators for a |
| 145 | +different purpose (as in separating the digits of a phone number), the rule may |
| 146 | +not prove useful. |
| 147 | + |
| 148 | +A separate lint rule could encourage _consistent_ digit separators, which |
| 149 | +triggers if the digit groups do not have the same size (except the most |
| 150 | +significant one, which can be shorter). If there are any `__` separators, the |
| 151 | +number of `_`-separated groups between them should also be the same, and |
| 152 | +repeatedly for higher numbers of `_`s. |
| 153 | + |
| 154 | +### Possible new quick fixes |
| 155 | + |
| 156 | +There are some possible new automated fix ("quick fix") considerations, but |
| 157 | +none of these are considered vital to the usability or general success of the |
| 158 | +feature. |
| 159 | + |
| 160 | +#### Unexpected underscores |
| 161 | + |
| 162 | +With the digit-separators feature, separators can be added between _digits_ of |
| 163 | +a number literal, but nowhere else. In most error cases, the unexpected |
| 164 | +underscore can be detected as such, and we can offer quick fixes to remove |
| 165 | +unexpected errors (for example, `100_`, `100_e1.2`, `100._00`). In a few cases, |
| 166 | +the intention is not as straightforward, such as `100._100`, where `_100` can |
| 167 | +be a legal name of an extension member (though the presense of such a private |
| 168 | +extension member can be detected). |
| 169 | + |
| 170 | +#### Unexpected commas |
| 171 | + |
| 172 | +The only legal digit separator that is introduced with this feature is the |
| 173 | +underscore character. If a developer attempts to use another character, for |
| 174 | +example commas, as a separator, we may be able to detect this, and offer a |
| 175 | +quick fix to convert the commas to underscores. |
| 176 | + |
| 177 | +### Non-breaking change |
| 178 | + |
| 179 | +This change is strictly non-breaking. The feature can be thought of as a single |
| 180 | +change from previous Dart syntax: some syntax which was previously illegal |
| 181 | +(producing compile-time errors) becomes legal. |
| 182 | + |
| 183 | +(The feature is still introduced with a [Dart language version][], so that |
| 184 | +packages that start using the feature declare that they require some new lower |
| 185 | +bound of the Dart SDK.) |
| 186 | + |
| 187 | +### Formatting |
| 188 | + |
| 189 | +As any number literal remains a single token, there are no formatting |
| 190 | +considerations. |
| 191 | + |
| 192 | +## Changelog |
| 193 | + |
| 194 | +### 1.0 |
| 195 | + |
| 196 | +- Initial version |
| 197 | + |
| 198 | +[digit group separators]: https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping |
| 199 | +[Dart language version]: https://github.com/dart-lang/language/blob/main/accepted/2.8/language-versioning/feature-specification.md |
0 commit comments