NLP task

My Approach

Firstly , I cleaned the dataset involving duplicate rows and removed the rows that contain nill values and is of no use.
Then I removed the stopwords, other punctuations, extra spaces, trailing and leading spaces from the paragraphs dataset and then I have analysed the lenghts of paragraphs and the most common words, after that I have removed first 1000 most common words saved the dataset as test1.csv.
I have also created the other dataset named for_bot.csv which is just the original dataset with the indexing is according to the test1.csv.

Now i will convert the paragraphs inside the test1.csv into vectors using TF-IDF vectorizer then main behind TF-idf is that, it takes account of the uniqueness of the words inside the corpus, which will be very usefull to retrieve the best 5 paragraphs according to the query.
The query will also be processed in the same way as the test1.csv was gone through, then got converted into vector using tf-idf vector.
Then the indices of the best 5 paragraphs will be then retrieved using cosine-similarity.
Then paragraphs on those indices from the "for_bot.csv" will be sent to the llm to process.

pull this repo using the following command: "git clone https://github.com/omgupta-iitk/iitk_secy_recruitment_nlp_project.git" "cd iitk_secy_recruitment_nlp_project"
Install the requirements : "pip install requirements.txt"
just run this command to launch the browser: "streamlit run qa_bot_gemini.py"

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.env		.env
README.md		README.md
Screenshot 2024-05-23 232526.png		Screenshot 2024-05-23 232526.png
Screenshot 2024-05-23 232548.png		Screenshot 2024-05-23 232548.png
data_cleaning.ipynb		data_cleaning.ipynb
for_bot.csv		for_bot.csv
preprocessing.py		preprocessing.py
qa_bot_gemini.py		qa_bot_gemini.py
requirements.txt		requirements.txt
test1.csv		test1.csv