-
-
Notifications
You must be signed in to change notification settings - Fork 9k
frontend: Use system locale with UTF-8 instead of 'C' #12624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Scouted the web and the codebase for potential issues. Here's some observations... General stuff about setlocale() on Windows
multibyte <-> utf8 <-> wcharIn platform.h there are various string conversion functions. From these only the multibyte functions with The The utf8 <-> wchar functions (ie Streams and file operationsC++ streams, like C-style
When to setlocale() ?On principle locale should be one of the first things to set, since many things inherit it and it's cumbersome to retroactively reset it's state to existing things. However, since Qt overwrites locale during construction of |
9d20b0c to
c0dbe23
Compare
3849250 to
92f93f4
Compare
Highlighting this because this is a severe change, even though I think it's correct in principle. Changing an application's display language should not change the regional settings (which encompass sorting rules as well as decimal point character, etc.), and at least that's how it works on macOS. @Warchamp7 @Fenrirthviti would be good to hear if you'd be fine with this change conceptually as well. |
|
My main concern here, as someone who only uses the English/USA locale/region, is that I'm unsure what the expectation for a Windows application is. The current motivation seems to be "Unix does it this way" and that to me, is not sufficient. Do we have examples and recommendations from Microsoft, or other prominent Windows applications on how they handle this kind of setting for apps that use translations? |
|
Microsoft general guidelines for globalization suggests:
My own take is that OS regional settings + app translations is the desired output with no additional settings in UI, good likelyhood being fine by default, but allows configuration when needed. The obvious drawback is that to change the regional settings one needs to change OS settings, which could be an issue when using a shared device, but OS should provide options for it. Many Microsoft apps understandably just follow the OS region settings. That includes the file explorer. Apps like Office do offer a separate in app setting for regional settings as well, but that's because they specialize in that sort of thing. For apps in general many have translation + maybe time formatting options and don't follow a specific region per se. The sorting order is a gamble. Steam seems to sort games and friends using the app display language. Spotify sorts playlists by Windows display language (not region settings, nor Spotify display language, I don't recommend this). Whatsapp uses OS regional settings. In any case, any locale is still better than the current situation with non-changeable "C" locale. |
|
Thanks for the additional context here. My, admittedly mostly uninformed opinion based on the discussion here, is that this seems fine. Without lack of a clear "best practice" on Windows, moving things in-line with our cross-platform implementation seems like the best option. @PatTheMav or @jcm93 Does macOS follow a similar approach to what is being proposed here? |
On macOS language and regional settings are also separate things and it's expected that your app's language is actually changed from the OS' language settings (rather than within the application itself) which in gendered languages also includes choice of preferred pronoun: The language setting indeed only changes the display language, decimal format, sorting, et. al. still follow the regional setting (at least in "native" apps). |
|
In the current state the PR works as designed, but I'm thinking of adding the following lines to obs.manifest to set active code page to utf-8 for Win32 APIs (the A versions of functions). While OBS uses those relatively little, with hard coded strings mostly, it could ease 3rd party code adaptation, plugins etc. It also allows utf-8 encoding of commandline arguments for loading a specific collection, scene or profile by their name. Lines to add to obs.manifest Doing this would mean the locale setup routine would need to be done much earlier in the program than the |
dd143ca to
29aa8fc
Compare
|
The obs.manifest changes has been added and locale setup routine moved to beginning of the app, with unix re-running it after OBSApp has been created. Here's some info about the manifest if you wish to read. https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page If there's any issues with locale setup routine being the first thing to do please let me know. |
Sets runtime locale to system locale with UTF-8 codepage. This is already default behavior on unix, but Windows defaults to minimal 'C' locale. Use CRT locale for C++ std::locale default OBS Studio language settings no longer change QLocale default locale, instead system locale is used for conformity. It is likely this is what user wants as well. Ie. sorting and formatting functions should follow OS locale instead of OBS Studio language (which also lacks country information).
Cast ctype function char parameters to unsigned char to ensure they are in correct range (0 to 255 vs -128 to 127) when used with utf-8 encoding (or extended ascii). Fixes dstr astrcmp* functions when used with utf-8 (or extended ascii) characters, so now they are treated greater than the base ascii and thus sorted after them, not before.
Switch locale-aware timestamping for logging / crash handling to %H:%M:%S
Utf-8 manifest allows Win32 API to use utf-8 instead of ANSI codepages. This changes the "A" versions of fucntions to work with utf-8. Manifest also treats command line arguments as utf8. This allows for example --profile <name> to load profiles with special characters. The locale setup routine is moved to the beginning of program to better cover io streams, without the need to reconfigure them later. The caveat is that unix will need to re-do the routine after initilizing OBSApp, because Qt overwrites the locales for unix during it's construction. Added logging for locale info and app language to better diagnose potential errors.
|
The main "problem" this PR has to address is that OBS on the whole is conceptually not built to be aware (much less so capable of handling) of locales (and locale differences), similarly to how it pretends that any set of bytes in memory is "UTF-8" but then happily treats it as "just ASCII". For better or worse all that OBS can handle without issue is US-American text and formats, anything beyond that is a "happy accident". Introducing these capabilities has to go far beyond just slapping on some
That's already a big pile of work to get through and figure out where/how/if to change OBS (and
Mind you, I'm not saying that we shouldn't adopt changes like these, but that these changes require a great deal of thought and need to address many architectural design flaws in OBS as it is right now. Simply adding it in bits and pieces runs the risk of fixing symptoms but not fixing the core issue(s), particularly as the entire app has not been designed to properly handle locale's or anything beyond ASCII text and making it handle text "right" in one area might collapse a whole house of cards of assumptions about character data in another. For that (and a few other reasons) it might take same time (and might even require splitting the whole endeavour up into separate "units") to get it over the finish line. |
|
The input is much appreciated. I aknowledge the uncertainty and I'm fine with whatever approach is taken, but I have some counterpoints too.
I don't believe this. The fact that locales, the region and utf-8 enforcement, have been in use on non-Windows OSs for all of OBS existence while it has been using Qt is a point for that. Some degree of good design choices is needed for that.
This is not true. The threads on Windows have the locale, as long as it is set before spawning threads. References below.
Some only care about the encoding, others about the regional things. Others follow other ways of setting locale, like the manifest. Some really don't care, but considering their limited usability, it might not matter to us either. If it does it's limited job fine it's ok, if not then another approach is in order. Regarding OBS, Probably biggest culprit we have is the I also expected this to be a hurdle. So I did search the codebase for all prominent standard C and C++ functions that care about the locale and addressed them in this PR if there was a need. That includes tracking down their input origins and where they go to see if UTF-8 and localization is acceptable. I have looked if they care about the region or the encoding, and by set by what (setlocale, active codepage manifest, std::locale::global, or something else). These are not a random list of magic switches in the pr, it's all very much mappable, and have reason to be here. My experience about utf-8 readiness is that it is almost there. Kind of like you said, many things pretend it already is utf-8 while we have been playing around with ASCII. Other things don't really need to become locale or encoding aware, just as long as their limitation is recognized and used in contexts where you don't expect anything special. I'm weirdly enough expecting this to fix more things by collateral than break, especially those cases where the encoding is coincidentally in ANSI because there was nothing to tell the source we'd like utf-8. The unicode handling in OBS is otherwise... ok. We have a few ways and can keep using those, nothing wrong with that, they are built correctly for the purpose and work fine after too. This doesn't touch Qt, -W Apis (or defaults selected by UNICODE build flag), utf-16 or ucs2. It is not as if we need to adapt or handle more things, quite the opposite. It's one of the things that coercing utf-8-ness is about. The ANSI pages being one of those things exactly to prevent. The -A apis currently return ANSI coded strings, which simply do not function in OBS if they happen to contain anything non-ASCII. std::string and char are byte containers with programmer hints about using it for text. They support utf-8 as well as any other encoding with multibyte bytes. Things like strlen will tell the size in bytes, which is usually good for example allocating arrays and when you iterate over them you are often times looking for something that is in ASCII anyway (that is compatible with utf-8 codepoints). Some after thoughts The need for this started from trying to get locale-awarness to C, for sorting specifically. Locale availability on C++ or Qt could probably be leveraged with some externs to C, true. Or a signaling mechanism in case module encapsulation paradigms would fight against direct usage, though it is a round about way for something relatively low level. For commandline arguments I don't see a nice way to fix them without the also -A API changing manifest. I suppose we could use MultibyteToWideChar(), but with CP_ACP (Current code page, OS default if not changed) instead of CP_UTF8 flag and then follow with WideCharToMultibyte with the UTF-8 flag. Should work, but I don't think that type thing should become any sort of standard. I might look into this a bit more from broader, design point of view too later. Any decision is fine though. |
|
Continuing a bit here.
I do not share the sentiment about the the kind of architectural flaws present here. Reasoning being that this is already being done on non-Windows systems and the research done as described to introduce this PR. The parts of code that supposedly have a problem with non-ASCII data have that now and will, without this update, still be fed non-ASCII data if they are used as such in what ever encoding. In short, parts that don't work after this, never have, and this will not make fixing them more difficult. We treat all non utf-16/32 text as if it were utf-8 encoded already and use proper conversion functions to convert between that and target. Any other encoding text might be in is not supported by our conversion functions, so we should do everything we can to coerce external strings, and internal functions to work with the assumption the text is indeed what we believe, utf-8. While we can and have played around with ASCII, the actual data has always been what it is, it has never actually been limited to codepoints representable in ASCII. |



Description
Sets C runtime locale to system locale with UTF-8 codepage on Windows.
This has always been default behavior on unix, but Windows defaults to minimal 'C' locale.
LC_NUMERICis still set to"C". Now on all platforms instead of just unix.This is so decimal point is a dot (not a comma) for string <-> float conversions.
The configured CRT locale is copied to be the default
std::localefor C++.All platforms have been using minimal
"C"until now. This change affects newfacetandios_baseinstances without a specified locale orimbue()call.Sets UTF-8 as active codepage for Windows in
obs.manifestfile. This changes Win32 API to use utf-8 for theAfunctions instead of the language dependent ANSI codepage. TheWfunctions are untouched and are used by default because we use the UNICODE build flag. It also treats commandline arguments as utf-8, which allows for example--profile <name>to load profiles with special characters.OBS Studio language setting no longer changes
QLocale's default locale and instead always uses system locale.This gives conformity with non Qt functions, but most importantly is likely what user wants as well. Ie. sorting and formatting functions should follow OS locale rules instead of OBS Studio translations language. (Reverts c4840dd)
obs_get_locale()still returns OBS language locale, which is used for Python and LUA apis, GDI+ text widget transformations, and HTTP accepted languages header.Motivation and Context
Locale-aware operations like sorting and time formatting in C are not available on Windows, but are on unix, as pointed out in PR #12577.
Fixes #11133, fixes #12953
The C++ locale and QLocale changes make the locale-aware functions of all layers work in similiar fashion.
For example: On unix currently the used locales are: OS locale for CRT, minimal
"C"for C++ and OBS language for QLocale.A weekday name can be in three different languages depending if you used
strftime(),std::time_getfacet orQLocale.This makes string transformations between C, C++ and Qt very tricky.
How Has This Been Tested?
An important point is that the CRT locale settings introduced here have always been this way for unix, which suggests there aren't any insurmountable problems with the new locales. Windows specific functions should be tested for CRT locale. Changes for C++ and QLocale defaults affect all platforms.
Searched the codebase for affected areas and addressed as necessary:
strftime()formatting with%placeholdersscanf()andprintf()formatting with%placeholdersFILEoperationsfstreamoperationsfacetlocale usageQStringlocale-aware methodsSome general testing with Japanese characters
I'm on Windows 11 English US version, but with Finnish locale settings (fi_FI). OBS language is English.
Types of changes
_mbs_conversion functions directly or via file io operations have to input text in utf-8 and expect utf-8 output (except forwchar/_wcs_which is OS defined).Checklist: