Documentation: update pseudo-grammar description to better reflect actual implementation

The documentation [grammar webpage](https://docs.unidata.ucar.edu/udunits/current/udunits2lib.html#Grammar) (you have to scroll down a bit) is a key resource for understanding how UDUNITS2 works. However, the pseudo-grammar does not fully reflect what is actually implemented.

With the help of two different LLMs (ChatGPT5 and Claude Sonnet 4.5), and quite a bit of back and forth there is now an updated version below. While I am not an expert on these matters (far from!) I still believe that this version comes closer to the actual implementation. Of course, more critical eyes on this would be most welcome.
 
*UPDATED 2025-11-08 to correctly capture how \<packed_date\> and \<packed_time\>  are parsed. Other minor fixes.*

-----------------------------------------------------------------------------
 
Here is the unit-syntax understood by the UDUNITS-2 package. Words printed \_Thusly\_ indicate non-terminals; words printed THUSLY indicate terminals; and words printed \<thusly\> indicate lexical elements.

// Main Grammar Structure

       _Unit-Spec_: one of
                nothing
                _Shift-Spec_

       _Shift-Spec_: one of
                _Product-Spec_
                _Product-Spec_ SHIFT REAL
                _Product-Spec_ SHIFT INT
                _Product-Spec_ SHIFT _Timestamp_

        _Product-Spec_: one of
                _Power-Spec_
                _Product-Spec_ _Power-Spec_
                _Product-Spec_ MULTIPLY _Power-Spec_
                _Product-Spec_ DIVIDE _Power-Spec_

        _Power-Spec_: one of
                _Basic-Spec_
                _Basic-Spec_ INT
                _Basic-Spec_ EXPONENT

        _Basic-Spec_: one of
                ID
                "(" _Shift-Spec_ ")"
                LOGREF _Product-Spec_ ")"      // token includes "<log>(" and base marker; parser provides the closing ")"
                _Number_

// Lexical Building Blocks

        <space>: one of
                U+0020      // space
                U+0009      // horizontal tab
                U+000D      // carriage return
                U+000C      // form feed
                U+000B      // vertical tab
                // Note: newline (U+000A) is NOT included
                // Note: non-breaking space (U+00A0) is NOT included

        <digit>:
                [0-9]

        <alpha>:
                [A-Za-z_]
                <iso-8859-1-alpha>

        <iso-8859-1-alpha>: one of
                U+00A0          // non-breaking space (sic!) allowed in identifiers
                U+00AD          // soft hyphen
                U+00B0          // degree sign
                U+00B5          // micro sign
                U+00C0-U+00D6   // Latin capital letters with diacritics (À-Ö, excluding U+00D7 multiplication sign)
                U+00D8-U+00F6   // Latin capital and small letters with diacritics (Ø-ö, excluding U+00F7 division sign)
                U+00F8-U+00FF   // Latin small letters with diacritics (ø-ÿ)

        <alphanum>: one of
                <alpha>
                <digit>

// Numbers

        _Number_: one of
                INT
                REAL

        INT:
                [+-]? <digit>+                    // optional sign followed by one or more digits

        REAL:
                [+-]? <real-value>

        <real-value>: one of
                <digit>+ <exponent>               // e.g., 5e10, 3E-2
                <mantissa> <exponent>?            // e.g., 3.14, .5, 2., 3.14e-10

        <mantissa>: one of
                <digit>+ "."                      // e.g., 2.
                "." <digit>+                      // e.g., .5
                <digit>+ "." <digit>+             // e.g., 3.14

        <exponent>:
                ("e" | "E") [+-]? <digit>+

        // Note: NaN and Inf[inity] are explicitly excluded (cf. PR #136)

// Identifiers

        ID: one of
                <id>
                "%"
                "'"         // single quote
                "\""        // double quote
                U+00B0      // degree sign
                U+00B5      // micro sign

        <id>: one of
                <alpha>
                <alpha> <alphanum>* <alpha>
                // Multi-character identifiers must end with <alpha>, not <digit>

// Operators

        SHIFT:
                <space>* <shift-op> <space>*

        <shift-op>: one of
                "@"
                "after"
                "from"
                "since"
                "ref"

        MULTIPLY: one of
                "-"         // accepted as multiply operator in running text as per the SI Brochure
                "."
                "*"
                <space>+
                U+00B7      // centered middot

        DIVIDE:
                <space>* <divide-op> <space>*

        <divide-op>: one of
                " per "      // surrounding space required, cf. PR #135
                " PER "      // surrounding space required, cf. PR #135
                "/"

        EXPONENT: one of
                <raise-with-int>
                <utf8-exponent>

        <raise-with-int>:
                ("^" | "**") INT

        <utf8-exponent>:
                <utf8-exp-sign>? <utf8-exp-digit>+

        <utf8-exp-sign>: one of
                U+207A      // superscript plus sign
                U+207B      // superscript minus sign

        <utf8-exp-digit>: one of
                U+2070      // superscript zero
                U+00B9      // superscript one
                U+00B2      // superscript two
                U+00B3      // superscript three
                U+2074      // superscript four
                U+2075      // superscript five
                U+2076      // superscript six
                U+2077      // superscript seven
                U+2078      // superscript eight
                U+2079      // superscript nine

// Date/Time Elements

        _Timestamp_: one of
                DATE
                DATE CLOCK
                DATE CLOCK CLOCK    // timezone offset as a second CLOCK, which cannot have <second>
                DATE CLOCK ID       // timezone ID (UTC/GMT/Z)
                DATE ID             // date-only with timezone ID (only Z)

        DATE: one of
                <date-broken> ("T" | <space>*)
                <date-packed> ("T" | <space>*)
                // NOTE: The trailing "T" or <space>* is included in the DATE token and serves as a separator when followed by CLOCK
                // NOTE: Packed dates only return DATE token when appearing after time-related units; otherwise parsed as REAL

        <date-broken>:
                <year> "-" <month> ("-" <day>)?  // allow truncation from the right

        <date-packed>:
                <year> (<month> <day>)?
                // Allow truncation from the right:
                // Length-based interpretation for <date-packed> (not counting a sign):
                //   len = 1-4  → Y, YY, YYY, YYYY
                //   len = 5    → YYYYM
                //   len = 6    → YYYYMM
                //   len = 7    → YYYYMMD
                //   len = 8    → YYYYMMDD
                //   decimals are accepted but then <date-packed> is interpreted as a REAL and not as a DATE

        CLOCK: one of
                <clock-broken> <space>*
                <clock-packed> <space>*
                // NOTE: The trailing <space>* is included in the CLOCK token

        <clock-broken>:
                <hour> ":" <minute> (":" <second>)?  // allow truncation from the right

        <clock-packed>:
                <hour> (<minute> (<second>)?)?
                // Allow truncation from the right:
                // Length-based interpretation for <clock-packed>.
                // Length is measured on the INTEGER part (before any decimal point).
                //   len = 1  → H       → 0H0000   decimals are not allowed
                //   len = 2  → HH      → HH0000   decimals are not allowed
                //   len = 3  → HHM     → HH0M00   decimals accepted: HH0M00.dddd...
                //   len = 4  → HHMM    → HHMM00   decimals accepted: HHMM00.dddd...
                //   len = 5  → HHMMS   → HHMM0S   decimals accepted: HHMM0S.dddd...
                //   len = 6  → HHMMSS  → HHMMSS   decimals accepted: HHMMSS.dddd...

        <year>:
                [+-]? [0-9]{1,4}    // year 0 is silently interpreted as year 1

        <month>:
                "0"?[1-9] | "1"[0-2]

        <day>:
                "0"?[1-9] | [1-2][0-9] | "30" | "31"

        <hour>:
                [+-]?[0-1]?[0-9] | 2[0-3]
                // In principle sign may appear in both time-of-day and timezone offsets. However,
                // sign only applies to 0-19, not 20-23. This is correct: time-of-day (0-23) should
                // be unsigned whereas timezone offsets that are in the range (±0-14) should have signs.

        <minute>:
                [0-5]?[0-9]

        <second>:
                (<minute> | "60") (\. <digit>*)?
                // leap second (= 60) is silently interpreted as the same as 00 of the immediately succeeding minute)

// Logarithmic Elements

        LOGREF:
                <log> <space>* "(" <space>* <re> ":"? <space>*   // NOTE: LOGREF token spans through the base marker (e.g., "lg(re" or "ln(re:"), but not the closing ")"

        <log>: one of
                "log"
                "lg"
                "ln"
                "lb"

        <re>: one of
                "re"
                "RE"


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: update pseudo-grammar description to better reflect actual implementation #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Documentation: update pseudo-grammar description to better reflect actual implementation #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions