Skip to content

Add more tests for Intl.Locale.prototype.getTextInfo#5023

Open
anba wants to merge 1 commit intotc39:mainfrom
anba:locale-text-info
Open

Add more tests for Intl.Locale.prototype.getTextInfo#5023
anba wants to merge 1 commit intotc39:mainfrom
anba:locale-text-info

Conversation

@anba
Copy link
Copy Markdown
Contributor

@anba anba commented Apr 13, 2026

Add coverage for:

  • Language subtag has more than three letters.
  • Script meta data's RTL field is UNKNOWN.
  • Don't default to "ltr" when script meta data has no entry for script.
  • Missing script is added through add-likely-subtags algorithm.
  • Script subtag is present and script's general ordering of characters is known.
  • Script subtag doesn't refer to a valid registered script.

Add coverage for:
- Language subtag has more than three letters.
- Script meta data's RTL field is `UNKNOWN`.
- Don't default to `"ltr"` when script meta data has no entry for script.
- Missing script is added through add-likely-subtags algorithm.
- Script subtag is present and script's general ordering of characters
  is known.
- Script subtag doesn't refer to a valid registered script.
@anba anba requested a review from a team as a code owner April 13, 2026 08:18
@anba
Copy link
Copy Markdown
Contributor Author

anba commented Apr 13, 2026

The are various test failures in JSC and V8, because both implementations don't support returning undefined from TextDirectionOfLocale.

Fixed ICU4C implementation for JSC and V8:

enum class ScriptDirection {
  Unknown,
  LeftToRight,
  RightToLeft,
};

// Input: Canonicalized locale with alias mappings already replaced.
//
// Preferably |locale| contains only language-script-region subtags,
// so ICU4C doesn't reject too long locales.
static ScriptDirection GetScriptDirection(const char* locale) {
  UErrorCode status = U_ZERO_ERROR;

  // Get the script subtag.
  char script[ULOC_SCRIPT_CAPACITY] = {};
  int32_t scriptLength = uloc_getScript(locale, script, std::size(script), &status);
  if (U_FAILURE(status)) {
    return ScriptDirection::Unknown;
  }

  // If no script subtag present, add likely subtags.
  if (scriptLength == 0) {
    char maximal[ULOC_FULLNAME_CAPACITY] = {};
    int32_t maxLength = uloc_addLikelySubtags(locale, maximal, std::size(maximal), &status);
    if (U_FAILURE(status)) {
      return ScriptDirection::Unknown;
    }

    scriptLength = uloc_getScript(maximal, script, std::size(script), &status);

    // If still no script subtag present, return Unknown.
    if (scriptLength == 0) {
      return ScriptDirection::Unknown;
    }
  }

  // Get script code from script.
  UScriptCode scriptCode = (UScriptCode) u_getPropertyValueEnum(UCHAR_SCRIPT, script);
  if (scriptCode == USCRIPT_INVALID_CODE) {
    return ScriptDirection::Unknown;
  }
  if (const char* shortName = uscript_getShortName(scriptCode)) {
    // Ignore Unicode aliases from PropertyValueAliases.txt, because they don't
    // apply here.
    if (std::strcmp(script, shortName)) {
      return ScriptDirection::Unknown;
    }
  }
  switch (scriptCode) {
    // Marked as UNKNOWN in scriptMetadata.txt.
    //
    // ICU4C doesn't allow to query all possible "RTL" field values (YES, NO, UNKNOWN),
    // so the four scripts with UNKNOWN are hard-coded below.
    case USCRIPT_COMMON:     // Zyyy
    case USCRIPT_INHERITED:  // Zinh
    case USCRIPT_UNKNOWN:    // Zzzz
    case USCRIPT_BRAILLE:    // Brai
      return ScriptDirection::Unknown;

    // Not in scriptMetadata.txt
    //
    // Up to the implementations how to handle these cases. Either return
    // UNKNOWN or the correct script direction. But don't return the obviously
    // wrong answer, for example don't return left-to-right for "Aran".
    case USCRIPT_AFAKA:                         // Afak   (LTR)
    case USCRIPT_ARABIC_NASTALIQ:               // Aran   (RTL)
    case USCRIPT_BLISSYMBOLS:                   // Blis   (varies)
    case USCRIPT_CIRTH:                         // Cirt   (varies)
    case USCRIPT_OLD_CHURCH_SLAVONIC_CYRILLIC:  // Cyrs   (LTR)
    case USCRIPT_DEMOTIC_EGYPTIAN:              // Egyd   (mixed)
    case USCRIPT_HIERATIC_EGYPTIAN:             // Egyh   (mixed)
    case USCRIPT_KHUTSURI:                      // Geok   (LTR)
    case USCRIPT_TRADITIONAL_HAN_WITH_LATIN:    // Hntl   (Hant+Latn)
    case USCRIPT_KATAKANA_OR_HIRAGANA:          // Hrkt   (LTR)
    case USCRIPT_HARAPPAN_INDUS:                // Inds   (RTL)
    case USCRIPT_KPELLE:                        // Kpel   (LTR)
    case USCRIPT_LATIN_FRAKTUR:                 // Latf   (LTR)
    case USCRIPT_LATIN_GAELIC:                  // Latg   (LTR)
    case USCRIPT_LOMA:                          // Loma   (LTR)
    case USCRIPT_MAYAN_HIEROGLYPHS:             // Maya   (mixed)
    case USCRIPT_MOON:                          // Moon   (mixed)
    case USCRIPT_NAKHI_GEBA:                    // Nkgb   (LTR)
    case USCRIPT_BOOK_PAHLAVI:                  // Phlv   (mixed)
    case USCRIPT_RONGORONGO:                    // Roro   (mixed)
    case USCRIPT_SARATI:                        // Sara   (mixed)
    case USCRIPT_ESTRANGELO_SYRIAC:             // Syre   (RTL)
    case USCRIPT_WESTERN_SYRIAC:                // Syrj   (RTL)
    case USCRIPT_EASTERN_SYRIAC:                // Syrn   (RTL)
    case USCRIPT_TENGWAR:                       // Teng   (LTR)
    case USCRIPT_VISIBLE_SPEECH:                // Visp   (LTR)
    case USCRIPT_WOLEAI:                        // Wole   (LTR)
    case USCRIPT_MATHEMATICAL_NOTATION:         // Zmth   (UNKNOWN)
    case USCRIPT_SYMBOLS_EMOJI:                 // Zsye   (UNKNOWN)
    case USCRIPT_SYMBOLS:                       // Zsym   (UNKNOWN)
    case USCRIPT_UNWRITTEN_LANGUAGES:           // Zxxx   (UNKNOWN)
      return ScriptDirection::Unknown;

    default:
      break;
  }
  if (uscript_isRightToLeft(scriptCode)) {
    return ScriptDirection::RightToLeft;
  }
  return ScriptDirection::LeftToRight;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants