Skip to content

Commit c87c163

Browse files
Merge pull request #117 from sillsdev/BL-15160_anamolous_languages
BL-15160 anomalous languages (#117)
2 parents 0bcadfc + 8c65005 commit c87c163

File tree

7 files changed

+221
-23
lines changed

7 files changed

+221
-23
lines changed

components/language-chooser/common/find-language/language-data/languageData.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

components/language-chooser/common/find-language/languageSearch.spec.ts

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ describe("Macrolanguage handling", () => {
246246
});
247247

248248
// Make sure that the unusual language entries that don't behave as expected are still preserved in some form
249-
// Langtags.json has anomalies/unique situations for "bnc", "aka", "nor", "hbs", "san", "zap" such that are usual code can't map them to individual languages
249+
// Langtags.json has had anomalies/unique situations for "bnc", "aka", "nor", "hbs", "san", "zap" preventing our usual code from mapping them to individual languages
250250
// How we handle these cases may change, but make sure some result is always available for these
251251
it("should include results for unusual language situations", async () => {
252252
async function asyncExpectToFindResultByExonym(
@@ -380,7 +380,6 @@ describe("getLanguageBySubtag", () => {
380380
expect(getLanguageBySubtag("mg")?.iso639_3_code).toEqual("plt");
381381
expect(getLanguageBySubtag("zh")?.exonym).toEqual("Chinese");
382382
expect(getLanguageBySubtag("za")?.exonym).toEqual("Zhuang");
383-
expect(getLanguageBySubtag("ak")?.iso639_3_code).toEqual("twi");
384383
expect(getLanguageBySubtag("bnc")?.iso639_3_code).toEqual("lbk");
385384
expect(getLanguageBySubtag("no")?.exonym).toEqual("Norwegian");
386385
expect(getLanguageBySubtag("sh")?.iso639_3_code).toEqual("hbs");
@@ -412,9 +411,6 @@ describe("getLanguageBySubtag", () => {
412411
expect(
413412
getLanguageBySubtag("za", defaultSearchResultModifier)?.exonym
414413
).toEqual("Zhuang");
415-
expect(
416-
getLanguageBySubtag("ak", defaultSearchResultModifier)?.iso639_3_code
417-
).toEqual("twi");
418414
expect(
419415
getLanguageBySubtag("bnc", defaultSearchResultModifier)?.iso639_3_code
420416
).toEqual("lbk");

components/language-chooser/common/find-language/macrolanguageNotes.md

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ See also
66

77
<!-- - https://issues.bloomlibrary.org/youtrack/issue/BL-12657/Issues-with-macrolanguage-codes-in-the-language-picker -->
88

9+
- https://writingsystems.info/topics/writingsystems/language-tagging/#macrolanguages
10+
911
- https://iso639-3.sil.org/about/scope#Macrolanguages
1012
- https://github.com/silnrsi/langtags/blob/master/doc/langtags.md#macro-languages
1113
- https://iso639-3.sil.org/code_tables/macrolanguage_mappings/
@@ -34,8 +36,6 @@ The language data we use is primarily based on [langtags.json](https://ldml.api.
3436

3537
For example: For the macrolanguage `pus` (Pashto), there are 4 relevant entries in langtags.json (listed below). From the first (ps-Arab-AF), we can detect that `pus`/`ps` is mapped to/equivalent to representative language `pbu` as described above. (`ps` is the ISO 639-1 equivalent of `pus`.) We therefore know the second entry (ps-Arab-PK) is also for language `pbu`. Since this language chooser delineates languages firstly by their ISO 639-3 codes, we combine the first two entries. We mark the result with `aliasMacrolanguage: pus`. The `pbt` and `pst` entries we straightforwardly handle as normal individual languages.
3638

37-
There are a few entries in langtags.json for which we cannot straightforwardly determine the individual language. These we mark with `aliasMacrolanguage: unknown` and keep the iso639-3 code despite it being a macrolanguage code. For the react language chooser, the desired behavior for these situations should be handled in search result modifiers. As of February 2025, these entries are `bnc`, `nor`, `san`, `hbs`, and `zap`. Other unusual situations we are aware of are `aka` and `zhx`, these may also warrant special checking.
38-
3939
```
4040
{
4141
"full": "ps-Arab-AF",
@@ -110,3 +110,33 @@ There are a few entries in langtags.json for which we cannot straightforwardly d
110110
### Stripping the "(macrolanguage)" Parentheticals
111111

112112
Some entries in langtags.json contain "macrolanguage" in the language name, and yet contain the only data present for the representative language for that macrolanguage. For example, Dogri macrolanguage code is doi and Dogri individual language code is dgo. There are 4 entries with "iso639_3": "doi", all of which have "name": "Dogri (macrolanguage)", and dgo tags in the tags field which lists equivalent tags. The code dgo does not appear anywhere in langtags.json outside of these 4 entries. We are therefore interpreting data from these entries as applying to the individual language, and simply stripping "(macrolanguage)" wherever we find it. When we create macrolanguage search results, we set `isMacrolanguage=true`.
113+
114+
### Anomalies and Special Situations
115+
116+
Last updated September 2025. These should be specially checked and handled as our language data may be weird or inaccurate.
117+
118+
- `aka` - From the ISO 639-3 site, [Akan is a macrolanguage](https://iso639-3.sil.org/code/aka) (ISO639-3: `aka`; ISO639-1: `ak`) containing individual languages Twi (ISO639-3: `twi`; ISO639-1: `tw`) and Fanti (ISO639-3: `fat`). The langtags.json entry lists `ak`, `fat`, and `tw` as equivalent and provides no names other than "Akan". In Ethnologue, [Akan has a page](https://www.ethnologue.com/language/aka/) that lists Twi and Fanti as mutually intelligible dialects, among others.
119+
120+
Since we don't have enough data from langtags.json to make any individual language entries, we give one result with codes "aka" and "ak" but don't mark it as a macrolanguage. For now this is somewhat consistent with Ethnologue, which treats it as a single language, and there is no point discouraging people from using it in the absence of alternatives.
121+
122+
See also https://unicode-org.atlassian.net/browse/CLDR-10293 and https://unicode-org.atlassian.net/browse/CLDR-17323; The current langtags.json handling of these languages might not be desired or permanent.
123+
124+
- `hbs` - [Serbo-Croatian is a macrolanguage](https://iso639-3.sil.org/code/hbs) (ISO639-3: `hbs`; ISO639-1: `sh`) containing individual languages Bosnian (ISO639-3: `bos`; ISO639-1: `bs`), Montenegrin(ISO639-3: `cnr`), Croatian(ISO639-3: `hrv`; ISO639-1: `hr`) and Serbian(ISO639-3: `srp`; ISO639-1: `sr`). Contrary to its usual behavior, langtags.json has a separate entry for Serbo-Croatian which that is not mappable to any individual language.
125+
126+
We straightforwardly give an `hbs` macrolanguage result as well as all the individual language results as per usual.
127+
128+
- `nor` - [Norwegian is a macrolanguage](https://iso639-3.sil.org/code/nor) (ISO639-3: `nor`; ISO639-1: `no`) with child languages Bokmål (`nob`) and Nynorsk (`nno`), but [Ethnologue treats it as a single language](https://www.ethnologue.com/language/nor/) and says "Norwegian has 2 written standards, both of which are assigned codes in the ISO 639-3 standard: Bokmål Norwegian (nob) and Nynorsk Norwegian (nno)."
129+
130+
Currently we give `nor` as a macrolanguage as well as `nob` and `nno` as individual languages.
131+
132+
- `san` - [Sanskrit is a macrolanguage](https://iso639-3.sil.org/code/san) (ISO639-3: `san`; ISO639-1: `sa`) with child languages child languages Classical Sanskrit (`cls`) and Vedic Sanskrit (`vsn`). It has a single [Ethnologue page](https://www.ethnologue.com/language/san/) which states "Sanskrit has 2 individual historical languages, both of which are assigned codes in the ISO 639-3 standard: Classical Sanskrit (cls) and Vedic Sanskrit (vsn)."
133+
134+
Since we have no differentiated information on `cls` or `vsn` from langtags.json and there is only 1 Ethnologue page, for now we are giving a single entry for Sanskrit and not marking it as a macrolanguage.
135+
136+
- `zap` - [Zapotec is a macrolanguage](https://iso639-3.sil.org/code/zap) with many child languages. Due to a known error, two of its child languages, Isthmus Zapotec (`zai`) and Las Delicias Zapotec (`zcd`) are currently being conflated in langtags.json, so both of their data shows up in a single zap entry. Langtags.json has an entry for `zap-Latn-MX` which lists `zap`, `zai`, and `zcd` as equivalent. The `zai` [Ethnologue page](https://www.ethnologue.com/language/zai/) has "Dialects - None known. 18% intelligibility of Santa María Petapa [zpe] (most similar). A member of macrolanguage Zapotec [zap]". The `zcd` [Ethnologue page](https://www.ethnologue.com/language/zcd/) has "Dialects - None known. Reportedly most similar to Rincón Zapotec [zar]. A member of macrolanguage Zapotec [zap]". CLDR lists "zai" to "zap" for macrolanguage (but doesn't mention "zcd").
137+
138+
For now we remove the names which obviously refer to Las Delicias Zapotec or Isthmus Zapotec from the `zap` macrolanguage card; `zai` and `zcd` do not have their own cards. We hope the bug causing the langtags.json error will be resolved soon.
139+
140+
- `zhx` - [This is a ISO 639-5 Collective code](https://iso639-3.sil.org/code/zhx) which has an entry in langtags.json with script with script `nshu`. It does not have an ethnologue page and is the only ISO 639-5 code that we have found in langtags.json.
141+
142+
We do not include `zhx` in our search results.

components/language-chooser/common/find-language/scripts/langtagProcessing.ts

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ function addOrCombineLangtagsEntry(
7979
entry.isRepresentativeForMacrolanguage;
8080
} else {
8181
const scriptCode = entry.script;
82-
const scripts = {};
82+
const scripts: { [key: string]: IScript } = {};
8383
if (scriptCode) {
8484
scripts[scriptCode] = {
8585
code: scriptCode,
@@ -106,6 +106,7 @@ function addOrCombineLangtagsEntry(
106106
regionNames: new Set([entry.regionname]),
107107
names: getAllPossibleNames(entry),
108108
scripts,
109+
isMacrolanguage: entry.isMacrolanguage || false,
109110
parentMacrolanguage:
110111
macrolanguagesByCode[indivlangsToMacrolangs[entry.indivIsoCode]],
111112
isRepresentativeForMacrolanguage: entry.isRepresentativeForMacrolanguage,
@@ -132,13 +133,14 @@ function parseLangtagsJson() {
132133
: entry.iso639_3;
133134

134135
if (!augmentedEntry["indivIsoCode"]) {
135-
// This is a data anomaly but we do have 5 as of Feb 2025: bnc, nor, san, hbs, zap
136+
// This is a data anomaly but we do have a few as of Sep 2025: nor, san, hbs, zap, ar-SA, ku-Arab-TR, and man-Latn-GN
136137
// See macrolanguageNotes.md. These cases should be specially handled.
137138
console.log(
138139
"No indivIsoCode found for macrolang",
139140
entry.iso639_3,
140141
entry.tag
141142
);
143+
augmentedEntry["isMacrolanguage"] = true;
142144
}
143145
}
144146

@@ -177,7 +179,7 @@ function parseLangtagsJson() {
177179
].filter((name) => !!name),
178180
alternativeTags: [...langData.alternativeTags],
179181
parentMacrolanguage: langData.parentMacrolanguage,
180-
isMacrolanguage: false, // we add macrolanguages separately below. See macrolanguageNotes.md
182+
isMacrolanguage: langData.isMacrolanguage,
181183
isRepresentativeForMacrolanguage:
182184
langData.isRepresentativeForMacrolanguage,
183185
languageType: langData.languageType,

components/language-chooser/common/find-language/scripts/langtagProcessingHelpers.ts

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ export interface ILangtagsJsonEntryInternal {
5353
windows: string;
5454

5555
// These are not in the langtags.json file but may be added in the processing
56+
isMacrolanguage: boolean;
5657
isRepresentativeForMacrolanguage: boolean;
5758
indivIsoCode: string; // If iso639_3 is a macrolanguage code, this is the corresponding (representative) individual language code - see macrolanguageNotes.md
5859
}
@@ -158,8 +159,9 @@ for (const line of macrolangMappingFile.split("\n")) {
158159
// So in langtags.json, for representative languages, the iso639_3 field is often the macrolangauge code,
159160
// but the tags field (in some but not all entries) contains equivalent tags that use the individual language codes.
160161
// We want to save the individual language codes, so gather as many macrolangauge to representative individual language
161-
// mappings as we can. As of 2/2025, this covers all macrolanguage codes in langtags.json except for
162-
// bnc, nor, san, hbs, and zap which should all be handled by search result modifiers. (a fix for `man` was incorporated 8/2025)
162+
// mappings as we can. As of 9/2025, this covers all macrolanguage codes in langtags.json except for
163+
// nor, san, hbs, and zap which should all be handled by search result modifiers. (a fix for `man` was incorporated 8/2025.
164+
// We had previously noted that bnc was a problem, but it seems to be working now.)
163165
// See macrolanguageNotes.md for more explanation.
164166

165167
// eslint-disable-next-line @typescript-eslint/no-explicit-any

components/language-chooser/common/find-language/searchResultModifiers.spec.ts

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -216,4 +216,117 @@ describe("reordering entries to prioritize desired language when keywords are se
216216
expect(spanishResult?.names[0]).toBeTruthy();
217217
});
218218
});
219+
220+
// I'm not sure if this is the behavior we want to stick with, but putting it here to document it
221+
// and so we notice if we accidentally change it. See macrolanguageNotes.md for more details.
222+
describe("Anomalous special case handling", () => {
223+
it("should handle Akan as expected", async () => {
224+
// one "aka" card, is macrolanguage, has multiple names
225+
const akaResults = defaultSearchResultModifier(
226+
(await asyncGetAllLanguageResults("Akan")) as ILanguage[],
227+
"Akan"
228+
);
229+
const akaResult = akaResults.find((result) =>
230+
codeMatches(result.iso639_3_code, "aka")
231+
);
232+
expect(akaResult).toBeDefined();
233+
// Not marking Akan as a macrolanguage, so we don't show the "better to pick an individual language" warning
234+
// when there are no relevant individual language options
235+
expect(akaResult?.isMacrolanguage).toBe(false);
236+
// language subtag should be "ak"
237+
expect(akaResult?.languageSubtag).toBe("ak");
238+
// should have more than one script option
239+
expect(akaResult?.scripts.length).toBeGreaterThan(1);
240+
// Currently we don't have enough info for any of the akan individual languages
241+
expect(
242+
akaResults.some(
243+
(result) =>
244+
codeMatches(result.iso639_3_code, "twi") ||
245+
codeMatches(result.iso639_3_code, "fat") ||
246+
result.languageSubtag === "tw"
247+
)
248+
).toBe(false);
249+
});
250+
251+
it("should handle Serbo-Croatian as expected", async () => {
252+
// one "hbs" card, is macrolanguage, has multiple names
253+
const hbsResults = defaultSearchResultModifier(
254+
(await asyncGetAllLanguageResults("Serbo-Croatian")) as ILanguage[],
255+
"Serbo-Croatian"
256+
);
257+
const hbsResult = hbsResults.find((result) =>
258+
codeMatches(result.iso639_3_code, "hbs")
259+
);
260+
expect(hbsResult).toBeDefined();
261+
expect(hbsResult?.isMacrolanguage).toBe(true);
262+
expect(hbsResult?.names.length).toBe(0);
263+
// All of the child languages should be present
264+
for (const childCode of ["bos", "cnr", "hrv", "srp"]) {
265+
expect(
266+
hbsResults.some(
267+
(result) =>
268+
codeMatches(result.iso639_3_code, childCode) &&
269+
!result.isMacrolanguage
270+
)
271+
).toBe(true);
272+
}
273+
});
274+
275+
it("should handle Norwegian as expected", async () => {
276+
// one macrolanguage "nor" card, plus one for Bokmål and one for Nynorsk
277+
const norResults = defaultSearchResultModifier(
278+
(await asyncGetAllLanguageResults("Norwegian")) as ILanguage[],
279+
"Norwegian"
280+
);
281+
const norResult = norResults.find((result) =>
282+
codeMatches(result.iso639_3_code, "nor")
283+
);
284+
expect(norResult).toBeDefined();
285+
expect(norResult?.isMacrolanguage).toBe(true);
286+
// Both of the child languages should be present
287+
for (const childCode of ["nob", "nno"]) {
288+
expect(
289+
norResults.some(
290+
(result) =>
291+
codeMatches(result.iso639_3_code, childCode) &&
292+
!result.isMacrolanguage
293+
)
294+
).toBe(true);
295+
}
296+
});
297+
298+
it("should handle Sanskrit as expected", async () => {
299+
// we don't have enough info for the individual languages, so just one "san" card
300+
const sanResults = defaultSearchResultModifier(
301+
(await asyncGetAllLanguageResults("Sanskrit")) as ILanguage[],
302+
"Sanskrit"
303+
);
304+
const sanResult = sanResults.find((result) =>
305+
codeMatches(result.iso639_3_code, "san")
306+
);
307+
expect(sanResult).toBeDefined();
308+
expect(sanResult?.isMacrolanguage).toBe(false);
309+
// Make sure it has the list of names
310+
expect(sanResult?.names.length).toBeGreaterThan(3);
311+
});
312+
313+
it("should handle Zapotec as expected", async () => {
314+
// in addition to all the individual zapotec languages, there should be one macrolanguage zap card,
315+
// and its list of names should not contain "Isthmus Zapotec" nor "Las Delicias Zapotec" as that could be confusing
316+
const zapResults = defaultSearchResultModifier(
317+
(await asyncGetAllLanguageResults("Zapotec")) as ILanguage[],
318+
"Zapotec"
319+
);
320+
const zapResult = zapResults.find((result) =>
321+
codeMatches(result.iso639_3_code, "zap")
322+
);
323+
expect(zapResult).toBeDefined();
324+
expect(zapResult?.isMacrolanguage).toBe(true);
325+
expect(
326+
zapResult?.names.some((n) =>
327+
["Isthmus Zapotec", "Las Delicias Zapotec"].includes(n)
328+
)
329+
).toBe(false);
330+
});
331+
});
219332
});

0 commit comments

Comments
 (0)