Skip to content

Query document up until a specified point in the text

Notifications You must be signed in to change notification settings

BKaperick/book-indexing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Book-indexing project

The idea is to build a simple search index of a document based on all the data up to a specific point in the text. The motivation is to be able to query what has already been read for refreshing context about characters and events previously occurring in the book without spoiling things that have not yet happened.

Getting started

Once the repository is cloned, you need to populate the raw data.

  1. Run .\scrape.sh to download some books from Project Gutenberg to start with.
  2. Run julia main.jl -write -overwrite -ingest -clean to populate the database with the cleaned data from the downloaded books.

On future runs, you can remove -overwrite if you have added new books via .\scrape.sh and just want to update the database with their values.

Then, you should be able to query the data with a command such as

julia main.jl -read -title "The Great Gatsby" -query "Daisy Gatsby soup for dinner"

which will query the text from "The Great Gatsby", and return an excerpt that best matches the query, with two lines of context before and after:

It was dark now, and as we dipped under a little bridge I put my arm
around Jordan’s golden shoulder and drew her toward me and asked her
**to dinner. Suddenly I wasn’t thinking of Daisy and Gatsby any more,**
but of this clean, hard, limited person, who dealt in universal
scepticism, and who leaned back jauntily just within the circle of my

Running web interface locally

There are a few additional steps to follow:

  1. Run julia --startup-file=no -e 'using DaemonMode; serve()' in the background.
  2. Run php -S localhost:8000 process.php from the ./src/ directory.
  3. Open a web browser and navigate to http://localhost:8000/main.html.

Here, the specific calls to the julia scripts are wrapped in the php back end.

Development

schema

TODO

  • Automated scraper and cleaning pipeline
  • TF weighting within document
  • TF-IDF weighting based on totality of downloaded books
  • Query by sentence rather than by line
  • Model for identifying character names
  • Better cleaning of things like chapter headings, front/end matter
  • Code for extracting n-grams instead of just individual words
    • Don't extract n-grams across sentence breaks
  • Test performance and efficacy of n-gram indexing for n=2,3
  • Create word vector embedding for semantic matches
  • [Perf] Serialize data in between runs like word frequencies
    • Create table for global word frequencies and document frequencies

De-prioritized/Abandoned

  • Refactor to use a struct with metadata for words

About

Query document up until a specified point in the text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published