Skip to content

Commit 7461a98

Browse files
Revise approach to Unicode filenames on Windows (#5794)
Treat filenames as UTF-8 initially and fall back to ANSI functions if conversion to UTF-16 fails Adds new HDF5_PREFER_WINDOWS_CODE_PAGE environment variable to prefer interpreting filenames according to the active Windows code page rather than assuming UTF-8 encoding
1 parent b21c848 commit 7461a98

File tree

8 files changed

+650
-187
lines changed

8 files changed

+650
-187
lines changed

doxygen/aliases

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,7 @@ ALIASES += cpp_c_api_note="\anchor cpp_c_api_note \attention \Bold{C++ Developer
254254
ALIASES += callback_note="\attention \Bold{Leaving callback functions:}\n The callback function must return normally, even in the case of error. Returning with H5_ITER_ERROR, instead of leaving by means of exceptions, exit() function, etc... will allow the HDF5 library to manage its resources and maintain a consistent state. See \ref cpp_c_api_note \"C++ Developers using HDF5 C-API functions\" warning for detail."
255255
ALIASES += par_compr_note="\attention If you are planning to use compression with parallel HDF5, ensure that calls to H5Dwrite() occur in collective mode. In other words, all MPI ranks (in the relevant communicator) call H5Dwrite() and pass a dataset transfer property list with the MPI-IO collective option property set to #H5FD_MPIO_COLLECTIVE_IO.\n Note that data transformations are currently \Bold{not} supported when writing to datasets in parallel and with compression enabled."
256256
ALIASES += sa_metadata_ops="\sa \li H5Pget_all_coll_metadata_ops() \li H5Pget_coll_metadata_write() \li H5Pset_all_coll_metadata_ops() \li H5Pset_coll_metadata_write() \li \ref maybe_metadata_reads"
257+
ALIASES += unicode_filename_note="\note On Windows, HDF5 assumes that a file name string is UTF-8 encoded and will attempt to convert it to UTF-16 before calling wide-character Windows API functions. If a file name string cannot be converted to UTF-16, the equivalent non-wide-character Windows API functions will be used, causing a file name string to be interpreted according to the active Windows code page. If an application desires that the active Windows code page be preferred, the environment variable #HDF5_PREFER_WINDOWS_CODE_PAGE can be set to the value '1' or 'TRUE' (case-insensitive)."
257258

258259
################################################################################
259260
# Specifications

release_docs/RELEASE.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -780,6 +780,30 @@ Bug Fixes since HDF5-2.0.0 release
780780
===================================
781781
Library
782782
-------
783+
- Revised handling of Unicode filenames on Windows
784+
785+
In the HDF5 1.14.4 release, a change was made to address some issues
786+
with the library's handling of code pages and file paths on Windows.
787+
This change introduced other issues with the handling of UTF-8 file
788+
names that caused breakage for software using the 1.14.4 and 1.14.5
789+
releases of HDF5. That change was reverted for the 1.14.6 release
790+
and the behavior has been slightly modified for this release.
791+
792+
On Windows, the library once again assumes that filename strings will
793+
be UTF-8 encoded strings and will attempt to convert them to UTF-16
794+
before passing them to Windows API functions. However, if the library
795+
fails to convert a filename string to UTF-16, it will now fallback to
796+
the equivalent Windows "ANSI" API functions which will interpret the
797+
string according to the active Windows code page.
798+
799+
Support for a new environment variable, HDF5_PREFER_WINDOWS_CODE_PAGE,
800+
was added in order to instruct HDF5 to prefer interpreting filenames
801+
according to the active Windows code page rather than assuming UTF-8
802+
encoding. If this environment variable is set to "1" or "TRUE"
803+
(case-insensitive), the active code page will be preferred. If it is
804+
unset or set to "0" or "FALSE" (case-insensitive), UTF-8 will be
805+
preferred.
806+
783807
- Fixed an issue with caching in the ROS3 VFD
784808

785809
The ROS3 VFD uses a very simple caching mechanism that caches the

src/H5Fpublic.h

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,9 +269,15 @@ extern "C" {
269269
* container_name can be opened with the file access property list
270270
* \p fapl_id.
271271
*
272+
* \parblock
272273
* \note The H5Fis_accessible() function enables files to be checked with a
273274
* given file access property list, unlike H5Fis_hdf5(), which only uses
274275
* the default file driver when opening a file.
276+
* \endparblock
277+
*
278+
* \parblock
279+
* \unicode_filename_note
280+
* \endparblock
275281
*
276282
* \since 1.12.0
277283
*
@@ -320,13 +326,21 @@ H5_DLL htri_t H5Fis_accessible(const char *container_name, hid_t fapl_id);
320326
* \par Example
321327
* \snippet H5F_examples.c minimal
322328
*
329+
* \parblock
323330
* \note #H5F_ACC_TRUNC and #H5F_ACC_EXCL are mutually exclusive; use
324331
* exactly one.
332+
* \endparblock
325333
*
334+
* \parblock
326335
* \note An additional flag, #H5F_ACC_DEBUG, prints debug information. This
327336
* flag can be combined with one of the above values using the bit-wise
328337
* OR operator (\c |), but it is used only by HDF5 library developers;
329338
* \Emph{it is neither tested nor supported for use in applications}.
339+
* \endparblock
340+
*
341+
* \parblock
342+
* \unicode_filename_note
343+
* \endparblock
330344
*
331345
* \attention \Bold{Special case — File creation in the case of an already-open file:}
332346
* If a file being created is already opened, by either a previous
@@ -420,8 +434,14 @@ H5_DLL hid_t H5Fcreate_async(const char *filename, unsigned flags, hid_t fcpl_id
420434
* \par Example
421435
* \snippet H5F_examples.c open
422436
*
437+
* \parblock
423438
* \note #H5F_ACC_RDWR and #H5F_ACC_RDONLY are mutually exclusive; use
424439
* exactly one.
440+
* \endparblock
441+
*
442+
* \parblock
443+
* \unicode_filename_note
444+
* \endparblock
425445
*
426446
* \attention \Bold{Special cases — Multiple opens:} A file can often be opened
427447
* with a new H5Fopen() call without closing an already-open
@@ -666,6 +686,10 @@ H5_DLL herr_t H5Fclose_async(hid_t file_id, hid_t es_id);
666686
* is an HDF5 file via H5Fis_accessible(). This is done to ensure that
667687
* H5Fdelete() cannot be used as an arbitrary file deletion call.
668688
*
689+
* \parblock
690+
* \unicode_filename_note
691+
* \endparblock
692+
*
669693
* \since 1.12.0
670694
*
671695
*/
@@ -1958,6 +1982,10 @@ H5_DLL herr_t H5Fset_latest_format(hid_t file_id, hbool_t latest_format);
19581982
*
19591983
* \details H5Fis_hdf5() determines whether a file is in the HDF5 format.
19601984
*
1985+
* \parblock
1986+
* \unicode_filename_note
1987+
* \endparblock
1988+
*
19611989
* \since 1.0.0
19621990
* \deprecated 1.12.0 Deprecated in favor of the function H5Fis_accessible()
19631991
*

src/H5private.h

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -900,12 +900,13 @@ H5_DLL int HDvasprintf(char **bufp, const char *fmt, va_list _ap);
900900
#if defined(H5_HAVE_WINDOW_PATH)
901901

902902
/* directory delimiter for Windows: slash and backslash are acceptable on Windows */
903-
#define H5_DIR_SLASH_SEPC '/'
904-
#define H5_DIR_SEPC '\\'
905-
#define H5_DIR_SEPS "\\"
906-
#define H5_CHECK_DELIMITER(SS) ((SS == H5_DIR_SEPC) || (SS == H5_DIR_SLASH_SEPC))
907-
#define H5_CHECK_ABSOLUTE(NAME) ((isalpha(NAME[0])) && (NAME[1] == ':') && (H5_CHECK_DELIMITER(NAME[2])))
908-
#define H5_CHECK_ABS_DRIVE(NAME) ((isalpha(NAME[0])) && (NAME[1] == ':'))
903+
#define H5_DIR_SLASH_SEPC '/'
904+
#define H5_DIR_SEPC '\\'
905+
#define H5_DIR_SEPS "\\"
906+
#define H5_CHECK_DELIMITER(SS) ((SS == H5_DIR_SEPC) || (SS == H5_DIR_SLASH_SEPC))
907+
#define H5_CHECK_ABSOLUTE(NAME) \
908+
((isalpha((unsigned char)NAME[0])) && (NAME[1] == ':') && (H5_CHECK_DELIMITER(NAME[2])))
909+
#define H5_CHECK_ABS_DRIVE(NAME) ((isalpha((unsigned char)NAME[0])) && (NAME[1] == ':'))
909910
#define H5_CHECK_ABS_PATH(NAME) (H5_CHECK_DELIMITER(NAME[0]))
910911

911912
#define H5_GET_LAST_DELIMITER(NAME, ptr) \

src/H5public.h

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -222,8 +222,8 @@
222222
* opening a file. Valid values for this environment variable are
223223
* as follows:
224224
*
225-
* "TRUE" or "1" - Request that file locks should be used
226-
* "FALSE" or "0" - Request that file locks should NOT be used
225+
* "TRUE" or "1" - Request that file locks should be used <br />
226+
* "FALSE" or "0" - Request that file locks should NOT be used <br />
227227
* "BEST_EFFORT" - Request that file locks should be used and
228228
* that any locking errors caused by file
229229
* locking being disabled on the system
@@ -238,6 +238,18 @@
238238
* \since 1.14.0
239239
*/
240240
#define HDF5_NOCLEANUP "HDF5_NOCLEANUP"
241+
/**
242+
* Macro for environment variable used to instruct HDF5 to prefer
243+
* Windows code pages over UTF-8 for functions that accept 'char *'
244+
* parameters. Valid values for this environment variable are as
245+
* follows (case-insensitive):
246+
*
247+
* "TRUE" or "1" - Request that Windows code pages be preferred <br />
248+
* "FALSE" or "0" - Request that UTF-8 be preferred <br />
249+
*
250+
* \since 2.0.0
251+
*/
252+
#define HDF5_PREFER_WINDOWS_CODE_PAGE "HDF5_PREFER_WINDOWS_CODE_PAGE"
241253

242254
/**
243255
* Status return values. Failed integer functions in HDF5 result almost

src/H5system.c

Lines changed: 91 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -447,43 +447,57 @@ Wflock(int fd, int operation)
447447
*
448448
* Purpose: Gets a UTF-16 string from an UTF-8 (or ASCII) string.
449449
*
450-
* Return: Success: A pointer to a UTF-16 string
451-
* This must be freed by the caller using H5MM_xfree()
452-
* Failure: NULL
450+
* On success, a pointer to the new UTF-16 string is returned
451+
* in `wstring`. This must be freed by the caller using
452+
* H5MM_xfree().
453+
*
454+
* Return: Non-negative on success/Negative on failure
455+
*
456+
* On failure, the result of GetLastError() is returned
457+
* through the `win_error` parameter, if non-NULL.
453458
*
454459
*-------------------------------------------------------------------------
455460
*/
456-
wchar_t *
457-
H5_get_utf16_str(const char *s)
461+
herr_t
462+
H5_get_utf16_str(const char *s, wchar_t **wstring, uint32_t *win_error)
458463
{
459464
int nwchars = -1; /* Length of the UTF-16 buffer */
460465
wchar_t *ret_s = NULL; /* UTF-16 version of the string */
461466

462467
/* Get the number of UTF-16 characters needed */
463-
if (0 == (nwchars = MultiByteToWideChar(CP_UTF8, 0, s, -1, NULL, 0)))
468+
if (0 == (nwchars = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, s, -1, NULL, 0)))
464469
goto error;
465470

466471
/* Allocate a buffer for the UTF-16 string */
467-
if (NULL == (ret_s = (wchar_t *)H5MM_calloc(sizeof(wchar_t) * (size_t)nwchars)))
472+
if (NULL == (ret_s = H5MM_calloc(sizeof(wchar_t) * (size_t)nwchars)))
468473
goto error;
469474

470475
/* Convert the input UTF-8 string to UTF-16 */
471-
if (0 == MultiByteToWideChar(CP_UTF8, 0, s, -1, ret_s, nwchars))
476+
if (0 == MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, s, -1, ret_s, nwchars))
472477
goto error;
473478

474-
return ret_s;
479+
*wstring = ret_s;
480+
481+
return SUCCEED;
475482

476483
error:
484+
/* Store error value first before doing anything else */
485+
if (win_error)
486+
*win_error = (uint32_t)GetLastError();
487+
477488
if (ret_s)
478489
H5MM_xfree((void *)ret_s);
479-
return NULL;
490+
491+
*wstring = NULL;
492+
493+
return FAIL;
480494
} /* end H5_get_utf16_str() */
481495

482496
/*-------------------------------------------------------------------------
483497
* Function: Wopen
484498
*
485499
* Purpose: Equivalent of open(2) for use on Windows. Necessary to
486-
* handle code pages and Unicode on that platform.
500+
* handle Unicode and code pages on that platform.
487501
*
488502
* Return: Success: A POSIX file descriptor
489503
* Failure: -1
@@ -492,9 +506,13 @@ H5_get_utf16_str(const char *s)
492506
int
493507
Wopen(const char *path, int oflag, ...)
494508
{
495-
int fd = -1; /* POSIX file descriptor to be returned */
496-
wchar_t *wpath = NULL; /* UTF-16 version of the path */
497-
int pmode = 0; /* mode (optionally set via variable args) */
509+
uint32_t win_error = 0; /* Windows error code for failures */
510+
wchar_t *wpath = NULL; /* UTF-16 version of the path */
511+
herr_t h5_ret = FAIL; /* HDF5 return code */
512+
char *env = NULL; /* Environment variable string */
513+
bool prefer_code_page = false; /* Whether to prefer using the Windows code page */
514+
int fd = -1; /* POSIX file descriptor to be returned */
515+
int pmode = 0; /* mode (optionally set via variable args) */
498516

499517
/* _O_BINARY must be set in Windows to avoid CR-LF <-> LF EOL
500518
* transformations when performing I/O. Note that this will
@@ -511,30 +529,36 @@ Wopen(const char *path, int oflag, ...)
511529
va_end(vl);
512530
}
513531

514-
/* First try opening the file with the normal POSIX open() call.
515-
* This will handle ASCII without additional processing as well as
516-
* systems where code pages are being used instead of true Unicode.
532+
/*
533+
* Check HDF5_PREFER_WINDOWS_CODE_PAGE environment variable to
534+
* determine how to handle the pathname.
517535
*/
518-
if ((fd = open(path, oflag, pmode)) >= 0) {
519-
/* If this succeeds, we're done */
520-
goto done;
536+
env = getenv(HDF5_PREFER_WINDOWS_CODE_PAGE);
537+
if (env && (*env != '\0')) {
538+
if (0 == HDstrcasecmp(env, "true") || 0 == strcmp(env, "1"))
539+
prefer_code_page = true;
521540
}
522541

523-
if (errno == ENOENT) {
524-
/* Not found, reset errno and try with UTF-16 */
525-
errno = 0;
526-
}
527-
else {
528-
/* Some other error (like permissions), so just exit */
529-
goto done;
530-
}
531-
532-
/* Convert the input UTF-8 path to UTF-16 */
533-
if (NULL == (wpath = H5_get_utf16_str(path)))
534-
goto done;
542+
/*
543+
* Unless requested to prefer Windows code pages, try to convert
544+
* the pathname from UTF-8 to UTF-16. If this fails, fallback to
545+
* the normal POSIX open() call.
546+
*/
547+
if (!prefer_code_page) {
548+
h5_ret = H5_get_utf16_str(path, &wpath, &win_error);
549+
if (h5_ret >= 0) {
550+
/* Open the file using a UTF-16 path */
551+
fd = _wopen(wpath, oflag, pmode);
552+
}
553+
else {
554+
if (ERROR_NO_UNICODE_TRANSLATION != win_error)
555+
goto done;
535556

536-
/* Open the file using a UTF-16 path */
537-
fd = _wopen(wpath, oflag, pmode);
557+
fd = open(path, oflag, pmode);
558+
}
559+
}
560+
else
561+
fd = open(path, oflag, pmode);
538562

539563
done:
540564
H5MM_xfree(wpath);
@@ -546,7 +570,7 @@ Wopen(const char *path, int oflag, ...)
546570
* Function: Wremove
547571
*
548572
* Purpose: Equivalent of remove(3) for use on Windows. Necessary to
549-
* handle code pages and Unicode on that platform.
573+
* handle Unicode and code pages on that platform.
550574
*
551575
* Return: Success: 0
552576
* Failure: -1
@@ -555,33 +579,43 @@ Wopen(const char *path, int oflag, ...)
555579
int
556580
Wremove(const char *path)
557581
{
558-
wchar_t *wpath = NULL; /* UTF-16 version of the path */
559-
int ret = -1;
582+
uint32_t win_error = 0; /* Windows error code for failures */
583+
wchar_t *wpath = NULL; /* UTF-16 version of the path */
584+
herr_t h5_ret = FAIL; /* HDF5 return code */
585+
char *env = NULL; /* Environment variable string */
586+
bool prefer_code_page = false; /* Whether to prefer using the Windows code page */
587+
int ret = -1;
560588

561-
/* First try removing the file with the normal POSIX remove() call.
562-
* This will handle ASCII without additional processing as well as
563-
* systems where code pages are being used instead of true Unicode.
589+
/*
590+
* Check HDF5_PREFER_WINDOWS_CODE_PAGE environment variable to
591+
* determine how to handle the pathname.
564592
*/
565-
if ((ret = remove(path)) >= 0) {
566-
/* If this succeeds, we're done */
567-
goto done;
568-
}
569-
570-
if (errno == ENOENT) {
571-
/* Not found, reset errno and try with UTF-16 */
572-
errno = 0;
573-
}
574-
else {
575-
/* Some other error (like permissions), so just exit */
576-
goto done;
593+
env = getenv(HDF5_PREFER_WINDOWS_CODE_PAGE);
594+
if (env && (*env != '\0')) {
595+
if (0 == HDstrcasecmp(env, "true") || 0 == strcmp(env, "1"))
596+
prefer_code_page = true;
577597
}
578598

579-
/* Convert the input UTF-8 path to UTF-16 */
580-
if (NULL == (wpath = H5_get_utf16_str(path)))
581-
goto done;
599+
/*
600+
* Unless requested to prefer Windows code pages, try to convert
601+
* the pathname from UTF-8 to UTF-16. If this fails, fallback to
602+
* the normal POSIX remove() call.
603+
*/
604+
if (!prefer_code_page) {
605+
h5_ret = H5_get_utf16_str(path, &wpath, &win_error);
606+
if (h5_ret >= 0) {
607+
/* Remove the file using a UTF-16 path */
608+
ret = _wremove(wpath);
609+
}
610+
else {
611+
if (ERROR_NO_UNICODE_TRANSLATION != win_error)
612+
goto done;
582613

583-
/* Remove the file using a UTF-16 path */
584-
ret = _wremove(wpath);
614+
ret = remove(path);
615+
}
616+
}
617+
else
618+
ret = remove(path);
585619

586620
done:
587621
H5MM_xfree(wpath);

0 commit comments

Comments
 (0)