Skip to content

Commit c1a9ab9

Browse files
committed
State 20180129
1 parent e71963b commit c1a9ab9

File tree

15 files changed

+3930
-4234
lines changed

15 files changed

+3930
-4234
lines changed

src/main/xar-resources/data/backup/backup.xml

Lines changed: 302 additions & 293 deletions
Large diffs are not rendered by default.

src/main/xar-resources/data/beginners-guide-to-xrx-v4/beginners-guide-to-xrx-v4.xml

Lines changed: 697 additions & 930 deletions
Large diffs are not rendered by default.

src/main/xar-resources/data/building/building.xml

Lines changed: 0 additions & 580 deletions
This file was deleted.

src/main/xar-resources/data/configuration/configuration.xml

Lines changed: 1206 additions & 1150 deletions
Large diffs are not rendered by default.
Lines changed: 77 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -1,119 +1,93 @@
11
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng"
22
schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" type="application/xml"
33
schematypens="http://purl.oclc.org/dsdl/schematron"?><article xmlns="http://docbook.org/ns/docbook" version="5.0">
4-
<info>
5-
<title>Extracting Content from Binary Files</title>
6-
<date>September 2012</date>
7-
<keywordset>
8-
<keyword>TBD</keyword>
9-
</keywordset>
10-
</info>
4+
<info>
5+
<title>Extracting Content from Binary Files</title>
6+
<date>1Q18</date>
7+
<keywordset>
8+
<keyword>application-development</keyword>
9+
</keywordset>
10+
</info>
1111

12-
<!-- ================================================================== -->
12+
<!-- ================================================================== -->
1313

14-
<sect1>
15-
<title>Overview</title>
14+
<para>The Content Extraction module extends eXist-db's XML abilities to binary files. The module contains functions for extracting the content of
15+
the binary files, and returning the content as XML. In this form, the content can then be queried, indexed, and manipulated. It useful especially
16+
in conjunction with Lucene indexes.</para>
17+
<para>The Content Extraction is built on the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://tika.apache.org/">Apache
18+
Tika</link> library. Tika understands a large variety of formats, ranging from PDF documents to spreadsheets and image metadata.</para>
1619

17-
<para>eXist-db excels at querying, indexing, and manipulating XML files. The Content
18-
Extraction module extends eXist-db's XML abilities to binary files. The module
19-
contains functions for extracting the content of the binary files, and returning the
20-
content as XML. In this form, the content can then be queried, indexed, and
21-
manipulated. This module is built on the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://tika.apache.org/">Apache
22-
Tika</link> library. Tika understands a large variety of formats, ranging from
23-
PDF documents to spreadsheets and image metadata.</para>
24-
</sect1>
20+
<!-- ================================================================== -->
2521

26-
<!-- ================================================================== -->
22+
<sect1>
23+
<title>Enabling the Content Extraction Module</title>
2724

28-
<sect1>
29-
<title>Enabling the Content Extraction Module</title>
25+
<para>To enable the content extraction module, edit <code>$EXIST_HOME/extensions/build.properties</code> and set the
26+
<code>include.feature.contentextraction</code> property to true:</para>
27+
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-1.txt"/>
28+
<para>Next, run <code>bin/build.sh</code> or <code>bin/build.bat</code> to build the module. You should see in the output the various libraries
29+
from the Tika project downloaded and installed.</para>
30+
</sect1>
3031

31-
<para>To enable the content extraction module, edit
32-
EXIST_HOME/extensions/build.properties and set the include.feature.contentextraction
33-
property to true:</para>
34-
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-1.txt"/>
35-
<para>Next, call build.sh/build.bat from eXist's top directory to build the module. You
36-
should see in the output how the various libraries from the Tika project are
37-
downloaded and installed.</para>
38-
</sect1>
32+
<!-- ================================================================== -->
3933

40-
<!-- ================================================================== -->
34+
<sect1>
35+
<title>Usage</title>
4136

42-
<sect1>
43-
<title>Usage</title>
37+
<para>To import the module use the following import statement:</para>
38+
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-2.txt"/>
4439

45-
<para>To import the module use an import statement as follows:</para>
46-
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-2.txt"/>
47-
<para>The module provides three functions:</para>
48-
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-3.txt"/>
49-
<para>The first two functions need little explanation: get-metadata just returns some
50-
metadata extracted from the resource, while get-metadata-and-content will also
51-
provide the text body of the resource—if there is any. The third function is a
52-
streaming variant of the other two and is used to process larger resources, whose
53-
content may not fit into memory.</para>
54-
<para>All functions produce XHTML. The metadata will be contained in the HTML head, the
55-
contents go into the body. The structure of the body HTML varies depending on the
56-
media type of the binary file. For example, the HTML resulting from a PDF is a
57-
sequence of divs (one per page), but that of a word processing document is more
58-
often a sequence of paragraphs.</para>
59-
</sect1>
40+
<para>The module provides three functions:</para>
41+
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-3.txt"/>
6042

61-
<!-- ================================================================== -->
43+
<para>The first two functions need little explanation: <code>get-metadata()</code> just returns some metadata extracted from the resource, while
44+
<code>get-metadata-and-content()</code> will also provide the text body of the resource (if any). The third function is a streaming variant of
45+
the other two and is used to process larger resources whose content may not fit into memory.</para>
46+
<para>All functions produce XHTML. The metadata will be contained in the HTML head, the contents goes into the body. The structure of the body
47+
HTML varies depending on the media type of the binary file. For example, the HTML resulting from a PDF is a sequence of <tag>div</tag> elements,
48+
one per page. That of a word processing document is often a sequence of paragraphs.</para>
49+
</sect1>
6250

63-
<sect1>
64-
<title>Storage and Indexing Strategies</title>
51+
<!-- ================================================================== -->
6552

66-
<para>While you could decide to just store the HTML returned by the content extraction
67-
functions as an XML resource into the database, this may not be efficient for
68-
certain applications. For example, a document search applications may not need to retain the
69-
extracted HTML.</para>
70-
<para>In such cases the ft:index() function from the full text indexing module can be
71-
useful. This function allows users to associate a full text index with any database
72-
resource, be it binary or XML. The index will be linked to the resource, meaning
73-
that the same permissions apply; if the resource is deleted, the index will be
74-
removed as well.</para>
75-
<para>To create an index, call the index function with the following arguments: </para>
76-
<orderedlist>
77-
<listitem>
78-
<para>The path of the resource to which the index should be linked as a
79-
string.</para>
80-
</listitem>
81-
<listitem>
82-
<para>An XML fragment describing the fields you want to add and the text
83-
content to index.</para>
84-
</listitem>
85-
</orderedlist>
86-
<para>For example, to associate an index with the document test.txt one may call index
87-
as follows: </para>
88-
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-4.txt"/>
89-
<para>This creates a lucene index document, indexes the content using the configured
90-
analyzers, and links it to the eXist document with the given path. You may link more
91-
than one Lucene document to the same eXist resource. The field elements map to
92-
Lucene fields. You can use as many fields as you want or add multiple fields with
93-
the same name. The store="yes" attribute tells the indexer to also store the text
94-
string, so you can retrieve it later. It is also possible to configure the analyzers
95-
used by Lucene for indexing a given feed as well as other options in the collection
96-
configuration. To query the created index, use the search function:
97-
</para>
98-
<programlisting>ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")</programlisting>
99-
<para>
100-
The first parameter is the path to the resource or collection to query, the second
101-
specifies a Lucene query string. Note how we prefix the query term by the name of
102-
the field. Executing this query returns: </para>
103-
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-6.txt"/>
104-
<para>Each matching resource is described by a search element. The score attribute
105-
expresses the relevance lucene computed for the resource (the higher the better).
106-
Within the search element, every field which contributed to the query result is
107-
returned, but only if store="yes" was defined for this field at indexing time (if
108-
not, the field content won't be available). Note how the matches in the text are
109-
enclosed in match elements, just as if you did a full text query on an XML document.
110-
This makes it easy to post-process the query result, for example to create a
111-
keywords in context display using eXist's standard KWIC module.</para>
112-
<para>The document the index is linked to does not need to be a binary resource. One can
113-
also create additional indexes on xml documents. This is a useful feature, because
114-
it allows us to index and query information which is not directly contained in the
115-
XML itself. For example, one could add metadata fields and retrieve them later using
116-
get-field. Or we could use fields to pre-process and normalize information already
117-
present in the XML to speed up later access.</para>
118-
</sect1>
53+
<sect1>
54+
<title>Storage and Indexing Strategies</title>
55+
56+
<para>While you could decide to just store the HTML returned by the content extraction functions as an XML resource into the database, this may
57+
not be efficient. For example, a document search applications may not need to retain the extracted HTML.</para>
58+
<para>In such cases the <code>ft:index()</code> function from the full text indexing module can be useful. This function allows users to associate
59+
a full text index with any database resource, be it binary or XML. The index will be linked to the resource.</para>
60+
<para>To create an index, call the function with the following arguments: </para>
61+
<orderedlist>
62+
<listitem>
63+
<para>The path of the resource to which the index should be linked as a string.</para>
64+
</listitem>
65+
<listitem>
66+
<para>An XML fragment describing the fields you want to add and the text content to index.</para>
67+
</listitem>
68+
</orderedlist>
69+
<para>For example, to associate an index with the document <code>test.txt</code>, call <code>ft:index()</code> as follows: </para>
70+
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-4.txt"/>
71+
72+
<para>This creates a Lucene index document, indexes the content using the configured analyzers, and links it to the eXist document with the given
73+
path. You may link more than one Lucene document to the same eXist resource. The field elements map to Lucene fields. You can use as many fields
74+
as you want or add multiple fields with the same name.</para>
75+
<para>The <code>store="yes"</code> attribute tells the indexer to also store the text string, so you can retrieve it later.</para>
76+
<para>To query the created index, use the <code>ft:search()</code> function: </para>
77+
<programlisting>ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")</programlisting>
78+
<para> The first parameter is the path to the resource or collection to query. Tthe second specifies a Lucene query string. Note how we prefix the
79+
query term by the name of the field.</para>
80+
<para>Executing this query returns:</para>
81+
<programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-6.txt"/>
82+
83+
<para>Each matching resource is described by a search element. The score attribute expresses the relevance Lucene computed for the resource (the
84+
higher the better). Within the search element, every field which contributed to the query result is returned, but only if
85+
<code>store="yes"</code> was defined for this field at indexing time.</para>
86+
<para> Note how the matches in the text are enclosed in <tag>match</tag> elements, just as if you did a full text query on an XML document. This
87+
makes it easy to post-process the query result, for example to create a keywords in context display using eXist's standard KWIC module.</para>
88+
<para>The document the index is linked to does not need to be a binary resource. One can also create additional indexes on XML documents. This is
89+
a useful feature, because it allows us to index and query information which is not directly contained in the XML itself. For example, one could
90+
add metadata fields and retrieve them later using <tag>get-field</tag>. Or we could use fields to pre-process and normalize information already
91+
present in the XML to speed up later access.</para>
92+
</sect1>
11993
</article>

0 commit comments

Comments
 (0)