Skip to content

Commit b5be1b9

Browse files
authored
Update README.md (major)
Output images , methods and discussion added
1 parent a2c3193 commit b5be1b9

File tree

1 file changed

+146
-5
lines changed

1 file changed

+146
-5
lines changed

README.md

Lines changed: 146 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,160 @@
1-
## Create virtual environment
1+
<h1 align="center">
2+
Project Setup
3+
</h1>
4+
5+
### Create virtual environment
26

37
``` python -m venv search_engine ```
48

5-
## Activate virtual environment (from parent folder)
9+
### Activate virtual environment (from parent folder)
610

711
```Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass```
812

913
``` .\search_engine\Scripts\activate ```
1014

11-
## Install required libraries
15+
### Install required libraries
1216

1317
``` pip install version_requirements.txt ```
1418

15-
## Select kernel for Jupiter Notebook ( ipynb )
19+
### Select kernel for Jupiter Notebook ( ipynb )
1620

1721
From top right corner select ``` (parent_folder)/Scripts/python.exe ``` as kernel and run
1822

19-
For python file use ``` python file_name ``` to execute
23+
For python file use ``` python file_name ``` to execute
24+
25+
<h1 align="center">
26+
Output View
27+
</h1>
28+
**Pinecone Side Vectors**
29+
![pinecone](https://github.com/user-attachments/assets/0bd8a37f-510c-471e-9206-76135b905bd1)
30+
31+
**Client Side Search Results :**
32+
33+
![Screenshot 2024-04-02 211102](https://github.com/user-attachments/assets/0919f4ca-4dc4-4ca4-9ccb-b65507d44f09)
34+
35+
![Screenshot 2024-04-02 210855](https://github.com/user-attachments/assets/11d6e250-1818-4ec9-aeaf-4e146a0fcb55)
36+
37+
![Screenshot 2024-04-02 210658](https://github.com/user-attachments/assets/4a87cff9-89ff-42fc-887b-ea1a1aee1765)
38+
39+
40+
41+
**Server Side Messages :**
42+
43+
![Screenshot 2024-04-02 211726](https://github.com/user-attachments/assets/d5ee1091-24ac-48b5-9253-3120f3f6305d)
44+
45+
![image](https://github.com/user-attachments/assets/750f1b92-2595-4397-9302-3f9d180d5724)
46+
47+
<h1 align="center">
48+
Design and Discussion
49+
</h1>
50+
51+
**Group Members:** Tamal Mallick , Sushanta Das , Suvam Manna
52+
53+
**Problem Description**:
54+
55+
Building a **search engine for a specific domain** (Python) with the help of **web crawling, Socket**
56+
57+
**programming, sentence embedding and Vector database**, to get relevant result for specific domain. User
58+
59+
will get result of a search query based on **cosine similarity search** in vector database. Also implemented and
60+
61+
integrated **Knapsack Cryptosystem** for securely transmitting user query and search results over network.
62+
63+
**Algorithm and Design:**
64+
65+
**1) Collecting web pages to make our search engine database**
66+
67+
i) Implemented web crawling using multithreading with some starting links containing keyword
68+
69+
“Python”, to collect link of webpage and then metadata (such as title, heading tag, some
70+
71+
paragraph) to gather valuable information about each link. This information will be used to
72+
73+
search for the webpage URL.
74+
75+
ii) We are collecting only the valid link / webpages by ignoring the links with status code in
76+
77+
between (400 ,499)
78+
79+
iii) These data are stored in Vector database (Pinecone) as Word Embeddings (High dimensional
80+
81+
Vectors ,768 dimensions). Now we can search some query, database will return most similar
82+
83+
results (each result having title, link and description).
84+
85+
**2) Building and running the webserver**
86+
87+
i) Webserver is implemented using socket programming with multithreading which will handle
88+
89+
multiple HTTP request coming from clients.
90+
91+
ii) Once the server is running, if anyone search the server URL using web client (Browser) first
92+
93+
the server is sends Required html, CSS and JavaScript file to run the main client program and
94+
95+
access frontend to see outputs.
96+
97+
iii) Once the dedicated client’s program (client.js) is running, in case of Secure mode public key
98+
99+
and private key is generated using knapsack cryptographic algorithm. First server and client
100+
101+
exchange their public keys. Then it will continue to listen for search requests from client.
102+
103+
iv) Now onwards whenever the server receives a search request from client it performs a cosine
104+
105+
similarity search on the vector database/local file and send the top results to the client. In
106+
107+
secure mode encryption and decryption are performed before each send operation and after
108+
109+
each receive operation through socket.
110+
111+
**3) Searching some query**
112+
113+
i) User will visit the link at which server is running (like <http://192.168.29.37:8080/>[).](http://192.168.29.37:8080/)[ ](http://192.168.29.37:8080/)A
114+
115+
webpage will open which has input box and search button. In Secure mode client will receive
116+
117+
public key of server.
118+
119+
120+
121+
ii) When user types some search query and hits submit button, client will send the query to the
122+
123+
server. Server has an endpoint for accepting Post request (POST /submit HTTP/1.1). Server
124+
125+
will accept the request.
126+
127+
iii) Now server will call a function for searching on vector database and finally send the relevant
128+
129+
search results to the client.
130+
131+
iv) Results will be displayed in the client’s webpage.
132+
133+
134+
**Future Work**
135+
136+
Future enhancements and developments for the search engine may include increasing the searching speed,
137+
138+
understand and collect more valuable metadata, providing summery of the top results, refining search
139+
140+
algorithms for enhanced accuracy and relevance, expanding the search database to encompass a broader
141+
142+
range of Python-related content, integrating advanced security features to mitigate emerging threats, and
143+
144+
incorporating user feedback to continuously enhance the user experience and functionality.
145+
146+
**Conclusion**
147+
148+
The creation of a domain-specific search engine tailored for Python represents a significant leap forward in
149+
150+
providing users with a robust platform for accessing relevant information within the Python programming
151+
152+
ecosystem. By seamlessly integrating advanced technologies such as web crawling, vector databases, socket
153+
154+
programming, and cryptographic protocols, this search engine delivers not only swift and accurate search
155+
156+
results but also ensures the security of user interactions. Also, implementation of multithreading and query-
157+
158+
based client holding allow us to save the resources and serve large number of clients at a time.
159+
160+

0 commit comments

Comments
 (0)