11<?xml-model href =" http://docbook.org/xml/5.0/rng/docbook.rng"
22 schematypens =" http://relaxng.org/ns/structure/1.0" ?><?xml-model href =" http://docbook.org/xml/5.0/rng/docbook.rng" type =" application/xml"
33 schematypens =" http://purl.oclc.org/dsdl/schematron" ?><article xmlns =" http://docbook.org/ns/docbook" version =" 5.0" >
4- <info >
5- <title >Extracting Content from Binary Files</title >
6- <date >September 2012 </date >
7- <keywordset >
8- <keyword >TBD </keyword >
9- </keywordset >
10- </info >
4+ <info >
5+ <title >Extracting Content from Binary Files</title >
6+ <date >1Q18 </date >
7+ <keywordset >
8+ <keyword >application-development </keyword >
9+ </keywordset >
10+ </info >
1111
12- <!-- ================================================================== -->
12+ <!-- ================================================================== -->
1313
14- <sect1 >
15- <title >Overview</title >
14+ <para >The Content Extraction module extends eXist-db's XML abilities to binary files. The module contains functions for extracting the content of
15+ the binary files, and returning the content as XML. In this form, the content can then be queried, indexed, and manipulated. It useful especially
16+ in conjunction with Lucene indexes.</para >
17+ <para >The Content Extraction is built on the <link xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" http://tika.apache.org/" >Apache
18+ Tika</link > library. Tika understands a large variety of formats, ranging from PDF documents to spreadsheets and image metadata.</para >
1619
17- <para >eXist-db excels at querying, indexing, and manipulating XML files. The Content
18- Extraction module extends eXist-db's XML abilities to binary files. The module
19- contains functions for extracting the content of the binary files, and returning the
20- content as XML. In this form, the content can then be queried, indexed, and
21- manipulated. This module is built on the <link xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" http://tika.apache.org/" >Apache
22- Tika</link > library. Tika understands a large variety of formats, ranging from
23- PDF documents to spreadsheets and image metadata.</para >
24- </sect1 >
20+ <!-- ================================================================== -->
2521
26- <!-- ================================================================== -->
22+ <sect1 >
23+ <title >Enabling the Content Extraction Module</title >
2724
28- <sect1 >
29- <title >Enabling the Content Extraction Module</title >
25+ <para >To enable the content extraction module, edit <code >$EXIST_HOME/extensions/build.properties</code > and set the
26+ <code >include.feature.contentextraction</code > property to true:</para >
27+ <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-1.txt" />
28+ <para >Next, run <code >bin/build.sh</code > or <code >bin/build.bat</code > to build the module. You should see in the output the various libraries
29+ from the Tika project downloaded and installed.</para >
30+ </sect1 >
3031
31- <para >To enable the content extraction module, edit
32- EXIST_HOME/extensions/build.properties and set the include.feature.contentextraction
33- property to true:</para >
34- <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-1.txt" />
35- <para >Next, call build.sh/build.bat from eXist's top directory to build the module. You
36- should see in the output how the various libraries from the Tika project are
37- downloaded and installed.</para >
38- </sect1 >
32+ <!-- ================================================================== -->
3933
40- <!-- ================================================================== -->
34+ <sect1 >
35+ <title >Usage</title >
4136
42- < sect1 >
43- < title >Usage</ title >
37+ < para >To import the module use the following import statement:</ para >
38+ < programlisting xmlns : xlink = " http://www.w3.org/1999/xlink " xlink : href = " listings/listing-2.txt " / >
4439
45- <para >To import the module use an import statement as follows:</para >
46- <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-2.txt" />
47- <para >The module provides three functions:</para >
48- <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-3.txt" />
49- <para >The first two functions need little explanation: get-metadata just returns some
50- metadata extracted from the resource, while get-metadata-and-content will also
51- provide the text body of the resource—if there is any. The third function is a
52- streaming variant of the other two and is used to process larger resources, whose
53- content may not fit into memory.</para >
54- <para >All functions produce XHTML. The metadata will be contained in the HTML head, the
55- contents go into the body. The structure of the body HTML varies depending on the
56- media type of the binary file. For example, the HTML resulting from a PDF is a
57- sequence of divs (one per page), but that of a word processing document is more
58- often a sequence of paragraphs.</para >
59- </sect1 >
40+ <para >The module provides three functions:</para >
41+ <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-3.txt" />
6042
61- <!-- ================================================================== -->
43+ <para >The first two functions need little explanation: <code >get-metadata()</code > just returns some metadata extracted from the resource, while
44+ <code >get-metadata-and-content()</code > will also provide the text body of the resource (if any). The third function is a streaming variant of
45+ the other two and is used to process larger resources whose content may not fit into memory.</para >
46+ <para >All functions produce XHTML. The metadata will be contained in the HTML head, the contents goes into the body. The structure of the body
47+ HTML varies depending on the media type of the binary file. For example, the HTML resulting from a PDF is a sequence of <tag >div</tag > elements,
48+ one per page. That of a word processing document is often a sequence of paragraphs.</para >
49+ </sect1 >
6250
63- <sect1 >
64- <title >Storage and Indexing Strategies</title >
51+ <!-- ================================================================== -->
6552
66- <para >While you could decide to just store the HTML returned by the content extraction
67- functions as an XML resource into the database, this may not be efficient for
68- certain applications. For example, a document search applications may not need to retain the
69- extracted HTML.</para >
70- <para >In such cases the ft:index() function from the full text indexing module can be
71- useful. This function allows users to associate a full text index with any database
72- resource, be it binary or XML. The index will be linked to the resource, meaning
73- that the same permissions apply; if the resource is deleted, the index will be
74- removed as well.</para >
75- <para >To create an index, call the index function with the following arguments: </para >
76- <orderedlist >
77- <listitem >
78- <para >The path of the resource to which the index should be linked as a
79- string.</para >
80- </listitem >
81- <listitem >
82- <para >An XML fragment describing the fields you want to add and the text
83- content to index.</para >
84- </listitem >
85- </orderedlist >
86- <para >For example, to associate an index with the document test.txt one may call index
87- as follows: </para >
88- <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-4.txt" />
89- <para >This creates a lucene index document, indexes the content using the configured
90- analyzers, and links it to the eXist document with the given path. You may link more
91- than one Lucene document to the same eXist resource. The field elements map to
92- Lucene fields. You can use as many fields as you want or add multiple fields with
93- the same name. The store="yes" attribute tells the indexer to also store the text
94- string, so you can retrieve it later. It is also possible to configure the analyzers
95- used by Lucene for indexing a given feed as well as other options in the collection
96- configuration. To query the created index, use the search function:
97- </para >
98- <programlisting >ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")</programlisting >
99- <para >
100- The first parameter is the path to the resource or collection to query, the second
101- specifies a Lucene query string. Note how we prefix the query term by the name of
102- the field. Executing this query returns: </para >
103- <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-6.txt" />
104- <para >Each matching resource is described by a search element. The score attribute
105- expresses the relevance lucene computed for the resource (the higher the better).
106- Within the search element, every field which contributed to the query result is
107- returned, but only if store="yes" was defined for this field at indexing time (if
108- not, the field content won't be available). Note how the matches in the text are
109- enclosed in match elements, just as if you did a full text query on an XML document.
110- This makes it easy to post-process the query result, for example to create a
111- keywords in context display using eXist's standard KWIC module.</para >
112- <para >The document the index is linked to does not need to be a binary resource. One can
113- also create additional indexes on xml documents. This is a useful feature, because
114- it allows us to index and query information which is not directly contained in the
115- XML itself. For example, one could add metadata fields and retrieve them later using
116- get-field. Or we could use fields to pre-process and normalize information already
117- present in the XML to speed up later access.</para >
118- </sect1 >
53+ <sect1 >
54+ <title >Storage and Indexing Strategies</title >
55+
56+ <para >While you could decide to just store the HTML returned by the content extraction functions as an XML resource into the database, this may
57+ not be efficient. For example, a document search applications may not need to retain the extracted HTML.</para >
58+ <para >In such cases the <code >ft:index()</code > function from the full text indexing module can be useful. This function allows users to associate
59+ a full text index with any database resource, be it binary or XML. The index will be linked to the resource.</para >
60+ <para >To create an index, call the function with the following arguments: </para >
61+ <orderedlist >
62+ <listitem >
63+ <para >The path of the resource to which the index should be linked as a string.</para >
64+ </listitem >
65+ <listitem >
66+ <para >An XML fragment describing the fields you want to add and the text content to index.</para >
67+ </listitem >
68+ </orderedlist >
69+ <para >For example, to associate an index with the document <code >test.txt</code >, call <code >ft:index()</code > as follows: </para >
70+ <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-4.txt" />
71+
72+ <para >This creates a Lucene index document, indexes the content using the configured analyzers, and links it to the eXist document with the given
73+ path. You may link more than one Lucene document to the same eXist resource. The field elements map to Lucene fields. You can use as many fields
74+ as you want or add multiple fields with the same name.</para >
75+ <para >The <code >store="yes"</code > attribute tells the indexer to also store the text string, so you can retrieve it later.</para >
76+ <para >To query the created index, use the <code >ft:search()</code > function: </para >
77+ <programlisting >ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")</programlisting >
78+ <para > The first parameter is the path to the resource or collection to query. Tthe second specifies a Lucene query string. Note how we prefix the
79+ query term by the name of the field.</para >
80+ <para >Executing this query returns:</para >
81+ <programlisting xmlns : xlink =" http://www.w3.org/1999/xlink" xlink : href =" listings/listing-6.txt" />
82+
83+ <para >Each matching resource is described by a search element. The score attribute expresses the relevance Lucene computed for the resource (the
84+ higher the better). Within the search element, every field which contributed to the query result is returned, but only if
85+ <code >store="yes"</code > was defined for this field at indexing time.</para >
86+ <para > Note how the matches in the text are enclosed in <tag >match</tag > elements, just as if you did a full text query on an XML document. This
87+ makes it easy to post-process the query result, for example to create a keywords in context display using eXist's standard KWIC module.</para >
88+ <para >The document the index is linked to does not need to be a binary resource. One can also create additional indexes on XML documents. This is
89+ a useful feature, because it allows us to index and query information which is not directly contained in the XML itself. For example, one could
90+ add metadata fields and retrieve them later using <tag >get-field</tag >. Or we could use fields to pre-process and normalize information already
91+ present in the XML to speed up later access.</para >
92+ </sect1 >
11993</article >
0 commit comments