Skip to content

Commit 9934bff

Browse files
committed
Add a chapter on compliance to the docs
Details how to filter warnings or convert them into exceptions and details the class hierarchy of warnings and errors defined by compoundfiles.
1 parent a44bb44 commit 9934bff

File tree

12 files changed

+550
-38
lines changed

12 files changed

+550
-38
lines changed

README.rst

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,18 @@ compoundfiles
44

55
|pypi| |rtd| |travis|
66

7-
This package provides a library for reading Microsoft's `OLE Compound
8-
Document`_ format, which also forms the basis of the `Advanced Authoring
9-
Format`_ (AAF) published by Microsoft Corporation. It is compatible with
10-
Python 2.7 (or above) and Python 3.2 (or above).
11-
12-
The code is pure Python and should run on any platform. The library has an
13-
emphasis on rigour and performs numerous validity checks on opened files. By
14-
default, the library merely warnings when it comes across non-fatal errors in
15-
source files but this behaviour is configurable by developers through Python's
16-
``warnings`` mechanisms.
7+
This package provides a library for reading Microsoft's `Compound File Binary`_
8+
format (CFB), formerly known as `OLE Compound Documents`_, the `Advanced
9+
Authoring Format`_ (AAF), or just plain old Microsoft Office files (the non-XML
10+
sort). This format is also widely used with certain media systems and a number
11+
of scientific applications (tomography and microscopy).
12+
13+
The code is pure Python and should run on any platform; it is compatible with
14+
Python 2.7 (or above) and Python 3.2 (or above). The library has an emphasis
15+
on rigour and performs numerous validity checks on opened files. By default,
16+
the library merely warns when it comes across non-fatal errors in source files
17+
but this behaviour is configurable by developers through Python's ``warnings``
18+
mechanisms.
1719

1820
Links
1921
=====
@@ -28,7 +30,8 @@ Links
2830
.. _documentation: http://compound-files.readthedocs.org/
2931
.. _source code: https://github.com/waveform80/compoundfiles
3032
.. _bug tracker: https://github.com/waveform80/compoundfiles/issues
31-
.. _OLE Compound Document: http://www.openoffice.org/sc/compdocfileformat.pdf
33+
.. _Compound File Binary: http://msdn.microsoft.com/en-gb/library/dd942138.aspx
34+
.. _OLE Compound Documents: http://www.openoffice.org/sc/compdocfileformat.pdf
3235
.. _Advanced Authoring Format: http://www.amwa.tv/downloads/specifications/aafcontainerspec-v1.0.1.pdf
3336
.. _MIT license: http://opensource.org/licenses/MIT
3437
.. _build status: https://travis-ci.org/waveform80/compoundfiles

compoundfiles/reader.py

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -77,25 +77,24 @@
7777
)
7878

7979

80-
# A quick personal rant: the AAF or OLE Compound Document format is yet another
81-
# example of bad implementations of a bad specification (thanks Microsoft! See
82-
# the W3C log file format for previous examples of MS' incompetence in this
83-
# area)...
80+
# Good grief! Since my last in-source rant it appears someone in MS actually
81+
# figured out how to write a decent spec! Unfortunately it appears someone in
82+
# the marketing department also thought that yet another name change was in
83+
# order so the Advanced Authoring Format (formerly known as OLE Compound
84+
# Documents) is now known as the Compound File Binary File Format.
8485
#
85-
# The specification doesn't try and keep the design simple (the DIFAT could be
86-
# fully in the header or partially in the header, and the header itself doesn't
87-
# necessarily match the sector size), whoever wrote the spec didn't quite
88-
# understand what version numbers are used for (several versions exist, but the
89-
# spec doesn't specify exactly which bits of the header became relevant in
90-
# which versions), and the spec has huge amounts of redundancy (always fun as
91-
# it inevitably leads to implementations getting one bit right and another bit
92-
# wrong, leaving readers to guess which is correct).
86+
# Anyway, silly name changes aside, the point is that someone's actually
87+
# written a decent spec this time rather than the half-assed AAF spec which
88+
# read like adhoc notes on a reference implementation. The URL is (currently)
9389
#
94-
# TL;DR: if you're looking for a nice fast binary format with good random
95-
# access characteristics this may look attractive, but please don't use it.
96-
# Ideally, loop-mounting a proper file-system would be the way to go, although
97-
# it generally involves jumping through several hoops due to mount being a
98-
# privileged operation.</rant>
90+
# http://msdn.microsoft.com/en-gb/library/dd942138.aspx
91+
#
92+
# But given how MSDN changes its URLs you might just be better off Googling for
93+
# "MS CFB" which'll find it (assuming they haven't changed the name again for
94+
# kicks). The file format is still a pile of steaming underwear in places
95+
# (unicode names with byte-length fields...) but as long as the spec is clear
96+
# and well written I can forgive that (after all, it's hard to change something
97+
# as established as this).
9998
#
10099
# In the interests of trying to keep naming vaguely consistent and sensible
101100
# here's a translation list with the names we'll be using first and the names

docs/Makefile

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ SPHINXOPTS =
66
SPHINXBUILD = sphinx-build
77
PAPER =
88
BUILDDIR = _build
9+
DOT_DIAGRAMS = $(wildcard *.dot)
10+
MSC_DIAGRAMS = $(wildcard *.mscgen)
11+
SVG_IMAGES = $(wildcard *.svg) $(DOT_DIAGRAMS:%.dot=%.svg) $(MSC_DIAGRAMS:%.mscgen=%.svg)
12+
PDF_IMAGES = $(SVG_IMAGES:%.svg=%.pdf)
913

1014
# Internal variables.
1115
PAPEROPT_a4 = -D latex_paper_size=a4
@@ -41,17 +45,17 @@ help:
4145
clean:
4246
-rm -rf $(BUILDDIR)/*
4347

44-
html:
48+
html: $(SVG_IMAGES)
4549
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
4650
@echo
4751
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
4852

49-
dirhtml:
53+
dirhtml: $(SVG_IMAGES)
5054
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
5155
@echo
5256
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
5357

54-
singlehtml:
58+
singlehtml: $(SVG_IMAGES)
5559
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
5660
@echo
5761
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
@@ -95,14 +99,14 @@ epub:
9599
@echo
96100
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
97101

98-
latex:
102+
latex: $(PDF_IMAGES)
99103
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
100104
@echo
101105
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
102106
@echo "Run \`make' in that directory to run these through (pdf)latex" \
103107
"(use \`make latexpdf' here to do that automatically)."
104108

105-
latexpdf:
109+
latexpdf: $(PDF_IMAGES)
106110
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
107111
@echo "Running LaTeX files through pdflatex..."
108112
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@@ -151,3 +155,13 @@ doctest:
151155
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
152156
@echo "Testing of doctests in the sources finished, look at the " \
153157
"results in $(BUILDDIR)/doctest/output.txt."
158+
159+
%.svg: %.msc
160+
mscgen -T svg -o $@ $<
161+
162+
%.svg: %.dot
163+
dot -T svg -o $@ $<
164+
165+
%.pdf: %.svg
166+
inkscape -A $@ $<
167+

docs/compliance.rst

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
.. _compliance:
2+
3+
=====================
4+
Compliance mechanisms
5+
=====================
6+
7+
As noted in the `CFB`_ specification, the compound document format presents a
8+
number of validation challenges. For example, maliciously constructed files
9+
might include circular references in their FAT table, leading a naive reader
10+
into an infinite loop, or they may allocate a large number of DIFAT sectors
11+
hoping to cause resource exhaustion when the reader goes to allocate memory for
12+
reading the FAT.
13+
14+
The compoundfiles library goes to some lengths to detect erroneous structures
15+
(whether malicious in intent or otherwise) and work around them where possible.
16+
Some issues are considered fatal and will always raise an exception (circular
17+
chains in the FAT are an example of this). Other issues are considered
18+
non-fatal and will raise a warning (unusual sector sizes are an example of
19+
this). Python :mod:`warnings` are a special sort of exception with particularly
20+
flexible handling.
21+
22+
With Python's defaults, a specific warning will print a message to the console
23+
the first time it is encountered and will then do nothing if it's encountered
24+
again (this avoids spamming the console in case a warning is raised in a tight
25+
loop). With some simple code, you can specify alternative behaviours: warnings
26+
can be raised as full-blown exceptions, or suppressed entirely. The
27+
compoundfiles library defines a large hierarchy of errors and warnings to
28+
enable developers to finetune their handling.
29+
30+
For example, consider a developer writing an application for working with
31+
computed tomography (CT) scans. The files produced by the scanner's software
32+
are compound documents, but they use an unusual sector size. Whenever the
33+
developer's Python script opens a file the following warning is emitted::
34+
35+
/usr/lib/pyshared/python2.7/compoundfiles/compoundfiles/reader.py:275: CompoundFileSectorSizeWarning: unexpected sector size in v3 file (1024)
36+
37+
Other than this, the script runs successfully. The developer decides the
38+
warning is unimportant (after all there's nothing he can do about it given he
39+
can't change the scanner's software) and wishes to suppress it entirely, so he
40+
adds the following line to the top of his script::
41+
42+
import warnings
43+
import compoundfiles as cf
44+
45+
warnings.filterwarnings('ignore', category=cf.CompoundFileSectorSizeWarning)
46+
47+
Another developer is working on a file validation service. She wishes to use
48+
the compoundfiles library to extract and examine the contents of such files.
49+
For safety, she decides to treat any violation of the specification as an
50+
error, so she adds the following line to the top of her script to tell Python
51+
to convert all compound file warnings into exceptions::
52+
53+
import warnings
54+
import compoundfiles as cf
55+
56+
warnings.filterwarnings('error', category=cf.CompoundFileWarning)
57+
58+
The class hierarchies for compoundfiles warnings and errors is illustrated
59+
below:
60+
61+
.. image:: warnings.*
62+
:align: center
63+
64+
.. image:: errors.*
65+
:align: center
66+
67+
To set filters on all warnings in the hierarchy, simply use the category
68+
:exc:`~compoundfiles.CompoundFileWarning`. Otherwise, you can use intermediate
69+
or leaf classes in the hierarchy for more specific filters. Likewise, when
70+
catching exceptions you can target the root of the hierarchy
71+
(:exc:`~compoundfiles.CompoundFileError`) to catch any error that the
72+
compoundfiles library might raise, or a more specific class to deal with a
73+
particular error.
74+
75+
.. _CFB: http://msdn.microsoft.com/en-gb/library/dd942138.aspx

docs/errors.dot

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
digraph G {
2+
graph [rankdir="LR"];
3+
4+
node [shape=rect,style=filled,color="#000000",fillcolor="#99aadd",fontname=Arial,fontsize=12.0];
5+
CompoundFileError->IOError;
6+
CompoundFileHeaderError->CompoundFileError;
7+
CompoundFileMasterFatError->CompoundFileError;
8+
CompoundFileNormalFatError->CompoundFileError;
9+
CompoundFileMiniFatError->CompoundFileError;
10+
CompoundFileDirEntryError->CompoundFileError;
11+
CompoundFileInvalidMagicError->CompoundFileHeaderError;
12+
CompoundFileInvalidBomError->CompoundFileHeaderError;
13+
CompoundFileLargeNormalFatError->CompoundFileNormalFatError;
14+
CompoundFileNormalLoopError->CompoundFileNormalFatError;
15+
CompoundFileLargeMiniFatError->CompoundFileMiniFatError;
16+
CompoundFileNoMiniFatError->CompoundFileMiniFatError;
17+
CompoundFileMasterLoopError->CompoundFileMasterFatError;
18+
CompoundFileDirLoopError->CompoundFileDirEntryError;
19+
CompoundFileNotFoundError->CompoundFileError;
20+
CompoundFileNotStreamError->CompoundFileError;
21+
}
22+

docs/errors.pdf

14.7 KB
Binary file not shown.

0 commit comments

Comments
 (0)