Skip to content

leslie-lau/fulltextsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Full Text Search w/ PDFBox

Fulltextsearch uses the Apache PDFBox library to parse a given PDF document and search for a given term. Specifically fulltextsearch will retrieve data about the search term such as it's text coordinates and page number. Below are instructions to package fulltextsearch's PrintTextLocations.java as a jar for use with IA BookReader.

Dependencies

Get Eclipse (I used Oxygen v4.7.0) at https://www.eclipse.org/downloads/

Get the following Apache PDFBox libraries (I used v2.0.9) at https://pdfbox.apache.org/

  • debugger-app-2.0.9.jar
  • fontbox-2.0.9.jar
  • pdfbox-2.0.9.jar
  • pdfbox-app-2.0.9.jar
  • pdfbox-tools-2.0.9.jar
  • preflight-2.0.9.jar
  • preflight-app-2.0.9.jar
  • xmpbox-2.0.9.jar

Instructions

Using Eclipse:

  1. First get a copy of the project and open it up on Eclipse.

    • In Eclipse, click File -> Open Projects from File System.
    • Open .../path/to/fulltextsearch.
  2. Configure the build path to include libraries.

    • Open the Package Explorer View: Window -> Show View -> Package Explorer.
    • Drop-down fulltextsearch and right-click Referenced Libraries in Package Explorer.
    • Go to Build Path -> Configure Build Path.
    • Under the Libraries tab, click Add External JARs.
    • Add all the PDFBox libraries mentioned above.
  3. Run PrintTextLocations.java

    • Locate in fulltextsearch/src/fulltextsearch/PrintTextLocations.java in the Package Explorer.
    • Run the application and the console should output a message: Usage: java -jar...
    • Note the message is actually an error message but is the expected output in this case.
  4. Export PrintTextLocations as a runnable JAR file.

    • File -> Export
    • In the Java folder, choose Runnable JAR file.
    • The Launch configuration should be set to PrintTextLocations - fulltextsearch.
    • The destination should be in the same folder that includes the search_inside.php file.
      • To keep things easy, you should name the jar file pdfbox_search.jar.
      • If you use some other name, you must change $cmd = "java -jar pdfbox_search.jar ... in search_inside.php to use the name you've given it.
    • Under Library handling, select Extract required libraries into generated JAR.
    • Hit finished and you should now have an executable JAR file
  5. Running the executable jar (optional).

    • If you would like to run the jar file to see the output, all you will need is a PDF.
    • Open up cmd and cd into the folder containing the jar file.
    • Enter java -jar pdfbox_search.jar <item-id> <file-path> <query-term> <callback> <css-or-abbyy> replacing the items in <> brackets with values.
      • The only values that need to be "real" or "truthful" are <file-path>, the path to the pdf file, and <query-term>, the text you are trying to look up.
      • The other values do not have meaning in a local demo, with the exception of <css-or-abbyy>, however the following will default to 'abbyy' whenever 'css' is not entered.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages