Skip to content

Commit 7dc42fa

Browse files
authored
Merge pull request #109 from pp-mo/new_docs
Big docs reorganise and expand.
2 parents d6aaf03 + 695d463 commit 7dc42fa

24 files changed

+1521
-146
lines changed

docs/change_log.rst

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,27 @@
1+
.. _change_log:
2+
13
Versions and Change Notes
24
=========================
35

4-
Project Status
5-
--------------
6+
.. _development_status:
7+
8+
Project Development Status
9+
--------------------------
610
We intend to follow `PEP 440 <https://peps.python.org/pep-0440/>`_,
711
or (older) `SemVer <https://semver.org/>`_ versioning principles.
812
This means the version string has the basic form **"major.minor.bugfix[special-types]"**.
913

10-
Current release version is at **"v0.1"**.
14+
Current release version is at **"v0.2"**.
1115

12-
This is a first complete implementation,
13-
with functional operational of all public APIs.
16+
This is a complete implementation, with functional operational of all public APIs.
1417
The code is however still experimental, and APIs are not stable
1518
(hence no major version yet).
1619

20+
.. _change_notes:
1721

1822
Change Notes
1923
------------
24+
Summary of key features by release number
2025

2126
Unreleased
2227
^^^^^^^^^^
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
.. _string-and-character-data:
2+
3+
Character and String Data Handling
4+
----------------------------------
5+
NetCDF can contain string and character data in at least 3 different contexts :
6+
7+
Characters in Data Component Names
8+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9+
That is, names of groups, variables, attributes or dimensions.
10+
Component names in the API are just native Python strings.
11+
12+
Since NetCDF version 4, the names of components within files are fully unicode
13+
compliant, using UTF-8.
14+
15+
These names can use virtually **any** characters, with the exception of the forward
16+
slash "/", since in some technical cases a component name needs to specified as a
17+
"path-like" compound.
18+
19+
20+
Characters in Attribute Values
21+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22+
Character data in string *attribute* values can likewise be read and written simply as
23+
Python strings.
24+
25+
However they are actually *stored* in an :class:`~ncdata.NcAttribute`'s
26+
``.value`` as a character array of dtype "<U??" (that is, the dtype does not really
27+
have a "??", but some definite length). These are returned by
28+
:meth:`ncdata.NcAttribute.as_python_value` as a simple Python string.
29+
30+
A vector of strings is also a permitted attribute value, but bear in mind that
31+
**a vector of strings is not currently supported in netCDF4 implementations**.
32+
Thus, you cannot have an array or list of strings as an attribute value in an actual file,
33+
and if stored to a file such an attribute will be concatenated into a single string value.
34+
35+
In actual files, Unicode is again supported via UTF-8, and seamlessly encoded/decoded.
36+
37+
38+
Characters in Variable Data
39+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
40+
Character data in variable *data* arrays are generally stored as fixed-length arrays of
41+
characters (i.e. fixed-width strings), and no unicode interpretation is applied by the
42+
libraries (neither netCDF4 or ncdata). In this case, the strings appear in Python as
43+
numpy character arrays of dtype "<U1". All elements have the same fixed length, but
44+
may contain zero bytes so that they convert to variable-width (Python) strings up to a
45+
maximum width. Trailing characters are filled with "NUL", i.e. "\\0" character
46+
aka "zero byte". The (maximum) string length is a separate dimension, which is
47+
recorded as a normal netCDF file dimension like any other.
48+
49+
.. note::
50+
51+
Although it is not tested, it has proved possible (and useful) at present to load
52+
files with variables containing variable-length string data, but it is
53+
necessary to supply an explicit user chunking to workaround limitations in Dask.
54+
Please see the :ref:`howto example <howto_load_variablewidth_strings>`.
55+
56+
.. warning::
57+
58+
The netCDF4 package will perform automatic character encoding/decoding of a
59+
character variable if it has a special ``_Encoding`` attribute. Ncdata does not
60+
currently allow for this. See : :ref:`known-issues`
61+

docs/details/details_index.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
Detail Topics
22
=============
3+
Detail reference topics
4+
35
.. toctree::
46
:maxdepth: 2
57

8+
../change_log
9+
./known_issues
610
./interface_support
11+
./character_handling
712
./threadlock_sharing
813
./developer_notes
914

docs/details/developer_notes.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ Documentation build
2828
Release actions
2929
---------------
3030

31+
#. Update the :ref:`change_log` page in the details section
32+
33+
#. ensure all major changes + PRs are referenced in the :ref:`change_notes` section
34+
35+
#. update the "latest version" stated in the :ref:`development_status` section
36+
3137
#. Cut a release on GitHub : this triggers a new docs version on [ReadTheDocs](https://readthedocs.org/projects/ncdata/)
3238

3339
#. Build the distribution

docs/details/interface_support.rst

Lines changed: 35 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -14,43 +14,59 @@ Datatypes
1414
^^^^^^^^^
1515
Ncdata supports all the regular datatypes of netcdf, but *not* the
1616
variable-length and user-defined datatypes.
17+
Please see : :ref:`data-types`.
1718

18-
This means, notably, that all string variables will have the basic numpy type
19-
'S1', equivalent to netcdf 'NC_CHAR'. Thus, multi-character string variables
20-
must always have a definite "string-length" dimension.
2119

22-
Attribute values, by contrast, are treated as Python strings with the normal
23-
variable length support. Their basic dtype can be any numpy string dtype,
24-
but will be converted when required.
25-
26-
The NetCDF C library and netCDF4-python do not support arrays of strings in
27-
attributes, so neither does NcData.
28-
29-
30-
Data Scaling, Masking and Compression
31-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
32-
Ncdata does not implement scaling and offset within data arrays : The ".data"
20+
Data Scaling and Masking
21+
^^^^^^^^^^^^^^^^^^^^^^^^
22+
Ncdata does not implement scaling and offset within variable data arrays : The ".data"
3323
array has the actual variable dtype, and the "scale_factor" and
3424
"add_offset" attributes are treated like any other attribute.
3525

36-
The existence of a "_FillValue" attribute controls how.. TODO
26+
Likewise, Ncdata does not use masking within its variable data arrays, so that variable
27+
data arrays contain "raw" data, which include any "fill" values -- i.e. at any missing
28+
data points you will have a "fill" value rather than a masked point.
29+
30+
The use of "scale_factor", "add_offset" and "_FillValue" attributes are standard
31+
conventions described in the NetCDF documentation itself, and implemented by NetCDF
32+
library software including the Python netCDF4 library. To ignore these default
33+
interpretations, ncdata has to actually turn these features "off". The rationale for
34+
this, however, is that the low-level unprocessed data content, equivalent to actual
35+
file storage, may be more likely to form a stable common basis of equivalence, particularly
36+
between different system architectures.
3737

3838

39+
.. _file-storage:
40+
3941
File storage control
4042
^^^^^^^^^^^^^^^^^^^^
4143
The :func:`ncdata.netcdf4.to_nc4` cannot control compression or storage options
4244
provided by :meth:`netCDF4.Dataset.createVariable`, which means you can't
4345
control the data compression and translation facilities of the NetCDF file
4446
library.
45-
If required, you should use :mod:`iris` or :mod:`xarray` for this.
47+
If required, you should use :mod:`iris` or :mod:`xarray` for this, i.e. use
48+
:meth:`xarray.Dataset.to_netcdf` or :func:`iris.save` instead of
49+
:func:`ncdata.netcdf4.to_nc4`, as these provide more special options for controlling
50+
netcdf file creation.
51+
52+
File-specific storage aspects, such as chunking, data-paths or compression
53+
strategies, are not recorded in the core objects. However, array representations in
54+
variable and attribute data (notably dask lazy arrays) may hold such information.
55+
56+
The concept of "unlimited" dimensions is also, you might think, outside the abstract
57+
model of NetCDF data and not of concern to Ncdata . However, in fact this concept is
58+
present as a core property of dimensions in the classic NetCDF data model (see
59+
"Dimension" in the `NetCDF Classic Data Model`_), so that is why it **is** an essential
60+
property of an NcDimension also.
4661

4762

4863
Dask chunking control
4964
^^^^^^^^^^^^^^^^^^^^^
5065
Loading from netcdf files generates variables whose data arrays are all Dask
5166
lazy arrays. These are created with the "chunks='auto'" setting.
52-
There is currently no control for this : If required, load via Iris or Xarray
53-
instead.
67+
68+
However there is a simple per-dimension chunking control available on loading.
69+
See :func:`ncdata.netcdf4.from_nc4`.
5470

5571

5672
Xarray Compatibility
@@ -94,3 +110,4 @@ see : `support added in v3.7.0 <https://scitools-iris.readthedocs.io/en/stable/w
94110

95111

96112
.. _Continuous Integration testing on GitHub: https://github.com/pp-mo/ncdata/blob/main/.github/workflows/ci-tests.yml
113+
.. _NetCDF Classic Data Model: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html#classic_model
Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _known-issues:
2+
13
Outstanding Issues
24
==================
35

@@ -21,6 +23,19 @@ To be fixed
2123

2224
* `issue#66 <https://github.com/pp-mo/ncdata/issues/66>`_
2325

26+
* in conversion to/from netCDF4 files
27+
28+
* netCDF4 performs automatic encoding/decoding of byte data to characters, triggered
29+
by the existence of an ``_Encoding`` attribute on a character type variable.
30+
Ncdata does not currently account for this, and may fail to read/write correctly.
31+
32+
33+
.. _todo:
34+
35+
Incomplete Documentation
36+
^^^^^^^^^^^^^^^^^^^^^^^^
37+
(PLACEHOLDER: documentation is incomplete, please fix me !)
38+
2439

2540
Identified Design Limitations
2641
-----------------------------
@@ -36,7 +51,7 @@ There are no current plans to address these, but could be considered in future
3651
* notably, includes compound and variable-length types
3752

3853
* ..and especially **variable-length strings in variables**.
39-
see : :ref:`string_and_character_data`
54+
see : :ref:`string-and-character-data`, :ref:`data-types`
4055

4156

4257
Features planned
Lines changed: 49 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,23 @@
1+
.. _thread-safety:
2+
13
NetCDF Thread Locking
24
=====================
3-
Ncdata includes support for "unifying" the thread-safety mechanisms between
4-
ncdata and the format packages it supports (Iris and Ncdata).
5+
Ncdata provides the :mod:`ncdata.threadlock_sharing` module, which can ensure that all
6+
multiple relevant data-format packages use a "unified" thread-safety mechanism to
7+
prevent them disturbing each other.
58

69
This concerns the safe use of the common NetCDF library by multiple threads.
710
Such multi-threaded access usually occurs when your code has Dask arrays
811
created from netcdf file data, which it is either computing or storing to an
912
output netcdf file.
1013

11-
The netCDF4 package (and the underlying C library) does not implement any
12-
threadlock, neither is it thread-safe (re-entrant) by design.
13-
Thus contention is possible unless controlled by the calling packages.
14-
*Each* of the data-format packages (Ncdata, Iris and Xarray) defines its own
15-
locking mechanism to prevent overlapping calls into the netcdf library.
16-
17-
All 3 data-format packages can map variable data into Dask lazy arrays. Iris and
18-
Xarray can also create delayed write operations (but ncdata currently does not).
19-
20-
However, those mechanisms cannot protect an operation of that package from
21-
overlapping with one in *another* package.
14+
In short, this is not needed when all your data is loaded with only **one** of the data
15+
packages (Iris, Xarray or ncdata). The problem only occurs when you try to
16+
realise/calculate/save results which combine data loaded from a mixture of sources.
2217

23-
The :mod:`ncdata.threadlock_sharing` module can ensure that all of the relevant
24-
packages use the *same* thread lock,
25-
so that they can safely co-operate in parallel operations.
18+
sample code:
2619

27-
sample code::
20+
.. code-block:: python
2821
2922
from ncdata.threadlock_sharing import enable_lockshare, disable_lockshare
3023
from ncdata.xarray import from_xarray
@@ -40,11 +33,49 @@ sample code::
4033
4134
disable_lockshare()
4235
43-
or::
36+
... *or* ...
37+
38+
.. code-block:: python
4439
4540
with lockshare_context(iris=True):
4641
ncdata = NcData(source_filepath)
4742
ncdata.variables['x'].attributes['units'] = 'K'
4843
cubes = ncdata.iris.to_iris(ncdata)
4944
iris.save(cubes, output_filepath)
5045
46+
47+
Background
48+
^^^^^^^^^^
49+
In practice, Iris, Xarray and Ncdata are all capable of scanning netCDF files and interpreting their metadata, while
50+
not reading all the core variable data contained in them.
51+
52+
This generates objects containing Dask :class:`~dask.array.Array`\s, which provide
53+
deferred access to bulk data in files, with certain key benefits :
54+
55+
* no data loading or calculation happens until needed
56+
* the work is divided into sectional "tasks", of which only some may ultimately be needed
57+
* it may be possible to perform multiple sections of calculation (including data fetch) in parallel
58+
* it may be possible to localise operations (fetch or calculate) near to data distributed across a cluster
59+
60+
Usually, the most efficient parallelisation of array operations is by multi-threading, since that can use memory
61+
sharing of large data arrays in memory.
62+
63+
However, the python netCDF4 library (and the underlying C library) is not threadsafe
64+
(re-entrant) by design, neither does it implement any thread locking itself, therefore
65+
the “netcdf fetch” call in each input operation must be guarded by a mutex.
66+
Thus, contention is possible unless controlled by the calling packages.
67+
68+
Each of Xarray, Iris and ncdata create input data tasks to fetch sections of data from
69+
the input files. Each uses a mutex lock around netcdf accesses in those tasks, to stop
70+
them accessing the netCDF4 interface at the same time as any of the others.
71+
72+
This works beautifully until ncdata connects (for example) lazy data loaded *with Iris*
73+
with lazy data loaded *from Xarray*. These would then unfortunately each be using their
74+
own *separate* mutexes to protect the same netcdf library. So, if we then attempt to
75+
calculate or save the result, which combines data from both sources, we could get
76+
sporadic and unpredictable system-level errors, even a core-dump type failure.
77+
78+
So, the function of :mod:`ncdata.threadlock_sharing` is to connect the thread-locking
79+
schemes of the separate libraries, so that they cannot accidentally overlap an access
80+
call in a different thread *from the other package*, just as they already cannot
81+
overlap *one of their own*.

docs/index.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,9 @@ User Documentation
3838
User Guide <./userdocs/user_guide/user_guide>
3939

4040

41-
Reference
42-
---------
41+
Reference Documentation
42+
-----------------------
43+
4344
.. toctree::
4445
:maxdepth: 2
4546

docs/userdocs/getting_started/installation.rst

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,28 @@ Ncdata is available on PyPI and conda-forge
44

55
Install from conda-forge with conda
66
-----------------------------------
7-
Like this::
8-
conda install -c conda-forge ncdata
7+
Like this:
8+
9+
.. code-block:: bash
10+
11+
$ conda install -c conda-forge ncdata
912
1013
1114
Install from PyPI with pip
1215
--------------------------
13-
Like this::
16+
Like this:
17+
18+
.. code-block:: bash
19+
1420
pip install ncdata
1521
1622
23+
Check install
24+
^^^^^^^^^^^^^
25+
26+
.. code-block:: bash
27+
28+
$ python -c "from ncdata import NcData; print(NcData())"
29+
<NcData: <'no-name'>
30+
>
31+

0 commit comments

Comments
 (0)