I have deployed this model on streamlit: https://iitksecyrecruitmentnlpproject-b4puxphxumdjmej6zh2ls6.streamlit.app/
-
Firstly , I cleaned the dataset involving duplicate rows and removed the rows that contain nill values and is of no use.
-
Then I removed the stopwords, other punctuations, extra spaces, trailing and leading spaces from the paragraphs dataset and then I have analysed the lenghts of paragraphs and the most common words, after that I have removed first 1000 most common words saved the dataset as test1.csv.
-
I have also created the other dataset named for_bot.csv which is just the original dataset with the indexing is according to the test1.csv.
-
Now i will convert the paragraphs inside the test1.csv into vectors using TF-IDF vectorizer then main behind TF-idf is that, it takes account of the uniqueness of the words inside the corpus, which will be very usefull to retrieve the best 5 paragraphs according to the query.
-
The query will also be processed in the same way as the test1.csv was gone through, then got converted into vector using tf-idf vector.
-
Then the indices of the best 5 paragraphs will be then retrieved using cosine-similarity.
-
Then paragraphs on those indices from the "for_bot.csv" will be sent to the llm to process.
-
pull this repo using the following command: "git clone https://github.com/omgupta-iitk/iitk_secy_recruitment_nlp_project.git" "cd iitk_secy_recruitment_nlp_project"
-
Install the requirements : "pip install requirements.txt"
-
just run this command to launch the browser: "streamlit run qa_bot_gemini.py"