eXist-db
diff --git a/‎src/main/xar-resources/data/backup/backup.xml‎
Lines changed: 302 additions & 293 deletions b/‎src/main/xar-resources/data/backup/backup.xml‎
Lines changed: 302 additions & 293 deletions
diff --git a/‎src/main/xar-resources/data/beginners-guide-to-xrx-v4/beginners-guide-to-xrx-v4.xml‎
Lines changed: 697 additions & 930 deletions b/‎src/main/xar-resources/data/beginners-guide-to-xrx-v4/beginners-guide-to-xrx-v4.xml‎
Lines changed: 697 additions & 930 deletions
diff --git a/‎src/main/xar-resources/data/building/building.xml‎
Lines changed: 0 additions & 580 deletions b/‎src/main/xar-resources/data/building/building.xml‎
Lines changed: 0 additions & 580 deletions
diff --git a/‎src/main/xar-resources/data/configuration/configuration.xml‎
Lines changed: 1206 additions & 1150 deletions b/‎src/main/xar-resources/data/configuration/configuration.xml‎
Lines changed: 1206 additions & 1150 deletions
diff --git a/‎src/main/xar-resources/data/contentextraction/contentextraction.xml‎
Lines changed: 77 additions & 103 deletions b/‎src/main/xar-resources/data/contentextraction/contentextraction.xml‎
Lines changed: 77 additions & 103 deletions
@@ -1,119 +1,93 @@
 <?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng"
         schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" type="application/xml"
         schematypens="http://purl.oclc.org/dsdl/schematron"?><article xmlns="http://docbook.org/ns/docbook" version="5.0">
-   <info>
-      <title>Extracting Content from Binary Files</title>
-      <date>September 2012</date>
-      <keywordset>
-         <keyword>TBD</keyword>
-      </keywordset>
-   </info>
+  <info>
+    <title>Extracting Content from Binary Files</title>
+    <date>1Q18</date>
+    <keywordset>
+      <keyword>application-development</keyword>
+    </keywordset>
+  </info>
 
-   <!-- ================================================================== -->
+  <!-- ================================================================== -->
 
-   <sect1>
-      <title>Overview</title>
+  <para>The Content Extraction module extends eXist-db's XML abilities to binary files. The module contains functions for extracting the content of
+    the binary files, and returning the content as XML. In this form, the content can then be queried, indexed, and manipulated. It useful especially
+    in conjunction with Lucene indexes.</para>
+  <para>The Content Extraction is built on the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://tika.apache.org/">Apache
+      Tika</link> library. Tika understands a large variety of formats, ranging from PDF documents to spreadsheets and image metadata.</para>
 
-      <para>eXist-db excels at querying, indexing, and manipulating XML files. The Content
-                Extraction module extends eXist-db's XML abilities to binary files. The module
-                contains functions for extracting the content of the binary files, and returning the
-                content as XML. In this form, the content can then be queried, indexed, and
-                manipulated. This module is built on the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://tika.apache.org/">Apache
-                    Tika</link> library. Tika understands a large variety of formats, ranging from
-                PDF documents to spreadsheets and image metadata.</para>
-   </sect1>
+  <!-- ================================================================== -->
 
-   <!-- ================================================================== -->
+  <sect1>
+    <title>Enabling the Content Extraction Module</title>
 
-   <sect1>
-      <title>Enabling the Content Extraction Module</title>
+    <para>To enable the content extraction module, edit <code>$EXIST_HOME/extensions/build.properties</code> and set the
+        <code>include.feature.contentextraction</code> property to true:</para>
+    <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-1.txt"/>
+    <para>Next, run <code>bin/build.sh</code> or <code>bin/build.bat</code> to build the module. You should see in the output the various libraries
+      from the Tika project downloaded and installed.</para>
+  </sect1>
 
-      <para>To enable the content extraction module, edit
-                EXIST_HOME/extensions/build.properties and set the include.feature.contentextraction
-                property to true:</para>
-      <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-1.txt"/>
-      <para>Next, call build.sh/build.bat from eXist's top directory to build the module. You
-                should see in the output how the various libraries from the Tika project are
-                downloaded and installed.</para>
-   </sect1>
+  <!-- ================================================================== -->
 
-   <!-- ================================================================== -->
+  <sect1>
+    <title>Usage</title>
 
-   <sect1>
-      <title>Usage</title>
+    <para>To import the module use the following import statement:</para>
+    <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-2.txt"/>
 
-      <para>To import the module use an import statement as follows:</para>
-      <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-2.txt"/>
-      <para>The module provides three functions:</para>
-      <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-3.txt"/>
-      <para>The first two functions need little explanation: get-metadata just returns some
-                metadata extracted from the resource, while get-metadata-and-content will also
-                provide the text body of the resource—if there is any. The third function is a
-                streaming variant of the other two and is used to process larger resources, whose
-                content may not fit into memory.</para>
-      <para>All functions produce XHTML. The metadata will be contained in the HTML head, the
-                contents go into the body. The structure of the body HTML varies depending on the
-                media type of the binary file. For example, the HTML resulting from a PDF is a
-                sequence of divs (one per page), but that of a word processing document is more
-                often a sequence of paragraphs.</para>
-   </sect1>
+    <para>The module provides three functions:</para>
+    <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-3.txt"/>
 
-   <!-- ================================================================== -->
+    <para>The first two functions need little explanation: <code>get-metadata()</code> just returns some metadata extracted from the resource, while
+        <code>get-metadata-and-content()</code> will also provide the text body of the resource (if any). The third function is a streaming variant of
+      the other two and is used to process larger resources whose content may not fit into memory.</para>
+    <para>All functions produce XHTML. The metadata will be contained in the HTML head, the contents goes into the body. The structure of the body
+      HTML varies depending on the media type of the binary file. For example, the HTML resulting from a PDF is a sequence of <tag>div</tag> elements,
+      one per page. That of a word processing document is often a sequence of paragraphs.</para>
+  </sect1>
 
-   <sect1>
-      <title>Storage and Indexing Strategies</title>
+  <!-- ================================================================== -->
 
-      <para>While you could decide to just store the HTML returned by the content extraction
-                functions as an XML resource into the database, this may not be efficient for
-                certain applications. For example, a document search applications may not need to retain the
-                extracted HTML.</para>
-      <para>In such cases the ft:index() function from the full text indexing module can be
-                useful. This function allows users to associate a full text index with any database
-                resource, be it binary or XML. The index will be linked to the resource, meaning
-                that the same permissions apply; if the resource is deleted, the index will be
-                removed as well.</para>
-      <para>To create an index, call the index function with the following arguments: </para>
-      <orderedlist>
-         <listitem>
-            <para>The path of the resource to which the index should be linked as a
-                            string.</para>
-         </listitem>
-         <listitem>
-            <para>An XML fragment describing the fields you want to add and the text
-                            content to index.</para>
-         </listitem>
-      </orderedlist>
-      <para>For example, to associate an index with the document test.txt one may call index
-                as follows: </para>
-      <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-4.txt"/>
-      <para>This creates a lucene index document, indexes the content using the configured
-                analyzers, and links it to the eXist document with the given path. You may link more
-                than one Lucene document to the same eXist resource. The field elements map to
-                Lucene fields. You can use as many fields as you want or add multiple fields with
-                the same name. The store="yes" attribute tells the indexer to also store the text
-                string, so you can retrieve it later. It is also possible to configure the analyzers
-                used by Lucene for indexing a given feed as well as other options in the collection
-                configuration. To query the created index, use the search function:
-                </para>
-      <programlisting>ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")</programlisting>
-      <para>
-                The first parameter is the path to the resource or collection to query, the second
-                specifies a Lucene query string. Note how we prefix the query term by the name of
-                the field. Executing this query returns: </para>
-      <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-6.txt"/>
-      <para>Each matching resource is described by a search element. The score attribute
-                expresses the relevance lucene computed for the resource (the higher the better).
-                Within the search element, every field which contributed to the query result is
-                returned, but only if store="yes" was defined for this field at indexing time (if
-                not, the field content won't be available). Note how the matches in the text are
-                enclosed in match elements, just as if you did a full text query on an XML document.
-                This makes it easy to post-process the query result, for example to create a
-                keywords in context display using eXist's standard KWIC module.</para>
-      <para>The document the index is linked to does not need to be a binary resource. One can
-                also create additional indexes on xml documents. This is a useful feature, because
-                it allows us to index and query information which is not directly contained in the
-                XML itself. For example, one could add metadata fields and retrieve them later using
-                get-field. Or we could use fields to pre-process and normalize information already
-                present in the XML to speed up later access.</para>
-   </sect1>
+  <sect1>
+    <title>Storage and Indexing Strategies</title>
+
+    <para>While you could decide to just store the HTML returned by the content extraction functions as an XML resource into the database, this may
+      not be efficient. For example, a document search applications may not need to retain the extracted HTML.</para>
+    <para>In such cases the <code>ft:index()</code> function from the full text indexing module can be useful. This function allows users to associate
+      a full text index with any database resource, be it binary or XML. The index will be linked to the resource.</para>
+    <para>To create an index, call the function with the following arguments: </para>
+    <orderedlist>
+      <listitem>
+        <para>The path of the resource to which the index should be linked as a string.</para>
+      </listitem>
+      <listitem>
+        <para>An XML fragment describing the fields you want to add and the text content to index.</para>
+      </listitem>
+    </orderedlist>
+    <para>For example, to associate an index with the document <code>test.txt</code>, call <code>ft:index()</code> as follows: </para>
+    <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-4.txt"/>
+
+    <para>This creates a Lucene index document, indexes the content using the configured analyzers, and links it to the eXist document with the given
+      path. You may link more than one Lucene document to the same eXist resource. The field elements map to Lucene fields. You can use as many fields
+      as you want or add multiple fields with the same name.</para>
+    <para>The <code>store="yes"</code> attribute tells the indexer to also store the text string, so you can retrieve it later.</para>
+    <para>To query the created index, use the <code>ft:search()</code> function: </para>
+    <programlisting>ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")</programlisting>
+    <para> The first parameter is the path to the resource or collection to query. Tthe second specifies a Lucene query string. Note how we prefix the
+      query term by the name of the field.</para>
+    <para>Executing this query returns:</para>
+    <programlisting xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="listings/listing-6.txt"/>
+
+    <para>Each matching resource is described by a search element. The score attribute expresses the relevance Lucene computed for the resource (the
+      higher the better). Within the search element, every field which contributed to the query result is returned, but only if
+        <code>store="yes"</code> was defined for this field at indexing time.</para>
+    <para> Note how the matches in the text are enclosed in <tag>match</tag> elements, just as if you did a full text query on an XML document. This
+      makes it easy to post-process the query result, for example to create a keywords in context display using eXist's standard KWIC module.</para>
+    <para>The document the index is linked to does not need to be a binary resource. One can also create additional indexes on XML documents. This is
+      a useful feature, because it allows us to index and query information which is not directly contained in the XML itself. For example, one could
+      add metadata fields and retrieve them later using <tag>get-field</tag>. Or we could use fields to pre-process and normalize information already
+      present in the XML to speed up later access.</para>
+  </sect1>
 </article>