The idea is to build a simple search index of a document based on all the data up to a specific point in the text. The motivation is to be able to query what has already been read for refreshing context about characters and events previously occurring in the book without spoiling things that have not yet happened.
Once the repository is cloned, you need to populate the raw data.
- Run
.\scrape.shto download some books from Project Gutenberg to start with. - Run
julia main.jl -write -overwrite -ingest -cleanto populate the database with the cleaned data from the downloaded books.
On future runs, you can remove -overwrite if you have added new books via .\scrape.sh and just want to update the database with their values.
Then, you should be able to query the data with a command such as
julia main.jl -read -title "The Great Gatsby" -query "Daisy Gatsby soup for dinner"
which will query the text from "The Great Gatsby", and return an excerpt that best matches the query, with two lines of context before and after:
It was dark now, and as we dipped under a little bridge I put my arm
around Jordan’s golden shoulder and drew her toward me and asked her
**to dinner. Suddenly I wasn’t thinking of Daisy and Gatsby any more,**
but of this clean, hard, limited person, who dealt in universal
scepticism, and who leaned back jauntily just within the circle of my
There are a few additional steps to follow:
- Run
julia --startup-file=no -e 'using DaemonMode; serve()'in the background. - Run
php -S localhost:8000 process.phpfrom the ./src/ directory. - Open a web browser and navigate to
http://localhost:8000/main.html.
Here, the specific calls to the julia scripts are wrapped in the php back end.
- Automated scraper and cleaning pipeline
- TF weighting within document
- TF-IDF weighting based on totality of downloaded books
- Query by sentence rather than by line
- Model for identifying character names
- Better cleaning of things like chapter headings, front/end matter
- Code for extracting n-grams instead of just individual words
- Don't extract n-grams across sentence breaks
- Test performance and efficacy of n-gram indexing for n=2,3
- Create word vector embedding for semantic matches
- [Perf] Serialize data in between runs like word frequencies
- Create table for global word frequencies and document frequencies
- Refactor to use a struct with metadata for words