|
1 |
| -## Create virtual environment |
| 1 | +<h1 align="center"> |
| 2 | +Project Setup |
| 3 | +</h1> |
| 4 | + |
| 5 | +### Create virtual environment |
2 | 6 |
|
3 | 7 | ``` python -m venv search_engine ```
|
4 | 8 |
|
5 |
| -## Activate virtual environment (from parent folder) |
| 9 | +### Activate virtual environment (from parent folder) |
6 | 10 |
|
7 | 11 | ```Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass```
|
8 | 12 |
|
9 | 13 | ``` .\search_engine\Scripts\activate ```
|
10 | 14 |
|
11 |
| -## Install required libraries |
| 15 | +### Install required libraries |
12 | 16 |
|
13 | 17 | ``` pip install version_requirements.txt ```
|
14 | 18 |
|
15 |
| -## Select kernel for Jupiter Notebook ( ipynb ) |
| 19 | +### Select kernel for Jupiter Notebook ( ipynb ) |
16 | 20 |
|
17 | 21 | From top right corner select ``` (parent_folder)/Scripts/python.exe ``` as kernel and run
|
18 | 22 |
|
19 |
| -For python file use ``` python file_name ``` to execute |
| 23 | +For python file use ``` python file_name ``` to execute |
| 24 | + |
| 25 | +<h1 align="center"> |
| 26 | +Output View |
| 27 | +</h1> |
| 28 | +**Pinecone Side Vectors** |
| 29 | + |
| 30 | + |
| 31 | +**Client Side Search Results :** |
| 32 | + |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +**Server Side Messages :** |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +<h1 align="center"> |
| 48 | +Design and Discussion |
| 49 | +</h1> |
| 50 | + |
| 51 | +**Group Members:** Tamal Mallick , Sushanta Das , Suvam Manna |
| 52 | + |
| 53 | +**Problem Description**: |
| 54 | + |
| 55 | +Building a **search engine for a specific domain** (Python) with the help of **web crawling, Socket** |
| 56 | + |
| 57 | +**programming, sentence embedding and Vector database**, to get relevant result for specific domain. User |
| 58 | + |
| 59 | +will get result of a search query based on **cosine similarity search** in vector database. Also implemented and |
| 60 | + |
| 61 | +integrated **Knapsack Cryptosystem** for securely transmitting user query and search results over network. |
| 62 | + |
| 63 | +**Algorithm and Design:** |
| 64 | + |
| 65 | +**1) Collecting web pages to make our search engine database** |
| 66 | + |
| 67 | +i) Implemented web crawling using multithreading with some starting links containing keyword |
| 68 | + |
| 69 | +“Python”, to collect link of webpage and then metadata (such as title, heading tag, some |
| 70 | + |
| 71 | +paragraph) to gather valuable information about each link. This information will be used to |
| 72 | + |
| 73 | +search for the webpage URL. |
| 74 | + |
| 75 | +ii) We are collecting only the valid link / webpages by ignoring the links with status code in |
| 76 | + |
| 77 | +between (400 ,499) |
| 78 | + |
| 79 | +iii) These data are stored in Vector database (Pinecone) as Word Embeddings (High dimensional |
| 80 | + |
| 81 | +Vectors ,768 dimensions). Now we can search some query, database will return most similar |
| 82 | + |
| 83 | +results (each result having title, link and description). |
| 84 | + |
| 85 | +**2) Building and running the webserver** |
| 86 | + |
| 87 | +i) Webserver is implemented using socket programming with multithreading which will handle |
| 88 | + |
| 89 | +multiple HTTP request coming from clients. |
| 90 | + |
| 91 | +ii) Once the server is running, if anyone search the server URL using web client (Browser) first |
| 92 | + |
| 93 | +the server is sends Required html, CSS and JavaScript file to run the main client program and |
| 94 | + |
| 95 | +access frontend to see outputs. |
| 96 | + |
| 97 | +iii) Once the dedicated client’s program (client.js) is running, in case of Secure mode public key |
| 98 | + |
| 99 | +and private key is generated using knapsack cryptographic algorithm. First server and client |
| 100 | + |
| 101 | +exchange their public keys. Then it will continue to listen for search requests from client. |
| 102 | + |
| 103 | +iv) Now onwards whenever the server receives a search request from client it performs a cosine |
| 104 | + |
| 105 | +similarity search on the vector database/local file and send the top results to the client. In |
| 106 | + |
| 107 | +secure mode encryption and decryption are performed before each send operation and after |
| 108 | + |
| 109 | +each receive operation through socket. |
| 110 | + |
| 111 | +**3) Searching some query** |
| 112 | + |
| 113 | +i) User will visit the link at which server is running (like <http://192.168.29.37:8080/>[).](http://192.168.29.37:8080/)[ ](http://192.168.29.37:8080/)A |
| 114 | + |
| 115 | +webpage will open which has input box and search button. In Secure mode client will receive |
| 116 | + |
| 117 | +public key of server. |
| 118 | + |
| 119 | + |
| 120 | + |
| 121 | +ii) When user types some search query and hits submit button, client will send the query to the |
| 122 | + |
| 123 | +server. Server has an endpoint for accepting Post request (POST /submit HTTP/1.1). Server |
| 124 | + |
| 125 | +will accept the request. |
| 126 | + |
| 127 | +iii) Now server will call a function for searching on vector database and finally send the relevant |
| 128 | + |
| 129 | +search results to the client. |
| 130 | + |
| 131 | +iv) Results will be displayed in the client’s webpage. |
| 132 | + |
| 133 | + |
| 134 | +**Future Work** |
| 135 | + |
| 136 | +Future enhancements and developments for the search engine may include increasing the searching speed, |
| 137 | + |
| 138 | +understand and collect more valuable metadata, providing summery of the top results, refining search |
| 139 | + |
| 140 | +algorithms for enhanced accuracy and relevance, expanding the search database to encompass a broader |
| 141 | + |
| 142 | +range of Python-related content, integrating advanced security features to mitigate emerging threats, and |
| 143 | + |
| 144 | +incorporating user feedback to continuously enhance the user experience and functionality. |
| 145 | + |
| 146 | +**Conclusion** |
| 147 | + |
| 148 | +The creation of a domain-specific search engine tailored for Python represents a significant leap forward in |
| 149 | + |
| 150 | +providing users with a robust platform for accessing relevant information within the Python programming |
| 151 | + |
| 152 | +ecosystem. By seamlessly integrating advanced technologies such as web crawling, vector databases, socket |
| 153 | + |
| 154 | +programming, and cryptographic protocols, this search engine delivers not only swift and accurate search |
| 155 | + |
| 156 | +results but also ensures the security of user interactions. Also, implementation of multithreading and query- |
| 157 | + |
| 158 | +based client holding allow us to save the resources and serve large number of clients at a time. |
| 159 | + |
| 160 | + |
0 commit comments