Using OCR to generate Typst code based on images of math formulas as a fully client-side webapp.
The model is hosted here.
We use oxen to version control our data. To get the oxen executable, run nix develop. Then, from the root of this repo, clone the oxen repo:
oxen clone https://hub.oxen.ai/DiracDelta/dataThe datasets we use for this project will now be available in data/.
Detypstify uses a custom dataset which was generated by transpiling the
im2latex-230k with pandoc and
cleaning the resulting data (see scraper/). The final dataset is available on
Kaggle.
- Download the dataset and unzip it
- Run
poetry run train_val_splitto perform a train validation split - Generate
formulas.txtby runningscripts/mk_formulas_txt.shon thetrainandvaldirectories - Install
pix2tex- Follow the instructions to generate
tokenizer, train.pkl, val.pkl - Create a
config.yamlbased on the template - Train the model with
python -m pix2tex.train --config config.yaml
- Follow the instructions to generate