sceptix-club · jxipaul · Oct 4, 2024 · Oct 4, 2024 · Oct 4, 2024
diff --git a/README.md b/README.md
@@ -1,27 +1,13 @@
-# Data Pre-processor
-De-constructing regular pdf's,docx format based information into structured JSON format.
+# The Aim
+To convert data from safety data sheets (pdfs) into machine readable JSON format. 
 
----
 
-## How to Contribute
+# The Solution (?)
+I analysed the steps to be taken in order to perform a task like this and came up with following approach: 
 
-To contribute to our documentation:
+First, we use the pymupdf library in order to extract the raw text.
 
-1. **Fork the Repository:** Click the "Fork" button at the top right of this repository to create a copy in your GitHub account. 🍴
+Then we preprocess the text and eliminate any unwanted stuff like headers, footers etc. + categorize the important stuff by using spaCy/Regex. Both are different in their approach: spaCy is an NLP library that uses ML models to process text in a more contexual sense, while Regex uses pattern matching to find specific sequences of data,  so I guess it's better for static patterns of data. 
 
-2. **Clone Your Fork:** Clone the forked repository to your local machine using Git. 🖥️
-
-   ```bash
-   git clone https://github.com/<your/user/name>/data_preprocessor.git
-   ```
-
-3. **Create a Branch:** Create a new branch for your contribution. 🌿
-
-   ```bash
-   git checkout -b <new-branch-name>
-   ```
-4. **Virtual Evnvironment:** Create necessary virtual environtment or docker container; prefer if you look into docker and stuff.
-5. Use Git CLI to add your files and track it.
-6. Once pushed to your branch give a pull request.
-
----
+pymupdf seems to have an inbuilt function for extracting tabular data.
+However, I was unable to do this due to enviroment problems.