Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 8 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,13 @@
# Data Pre-processor
De-constructing regular pdf's,docx format based information into structured JSON format.
# The Aim
To convert data from safety data sheets (pdfs) into machine readable JSON format.

---

## How to Contribute
# The Solution (?)
I analysed the steps to be taken in order to perform a task like this and came up with following approach:

To contribute to our documentation:
First, we use the pymupdf library in order to extract the raw text.

1. **Fork the Repository:** Click the "Fork" button at the top right of this repository to create a copy in your GitHub account. 🍴
Then we preprocess the text and eliminate any unwanted stuff like headers, footers etc. + categorize the important stuff by using spaCy/Regex. Both are different in their approach: spaCy is an NLP library that uses ML models to process text in a more contexual sense, while Regex uses pattern matching to find specific sequences of data, so I guess it's better for static patterns of data.

2. **Clone Your Fork:** Clone the forked repository to your local machine using Git. 🖥️

```bash
git clone https://github.com/<your/user/name>/data_preprocessor.git
```

3. **Create a Branch:** Create a new branch for your contribution. 🌿

```bash
git checkout -b <new-branch-name>
```
4. **Virtual Evnvironment:** Create necessary virtual environtment or docker container; prefer if you look into docker and stuff.
5. Use Git CLI to add your files and track it.
6. Once pushed to your branch give a pull request.

---
pymupdf seems to have an inbuilt function for extracting tabular data.
However, I was unable to do this due to enviroment problems.
Loading