searchEngine

A search engine written in c++.

Implementation

Idea

The search engine includes 3 main processes: Crawling, Parsing, Indexing.

Crawling

Using the idea of the BFS algorithm, links will be store in a queue. Then, each of them will be extracted to download the html files.

Optimization:

Multithreading: crawl multiple links at the same time.
Bloom filter: detect crawled links.

Parsing

After receiving HTML files, links and web content will be parsed by finding specific tags. These links will be pushed into the queue in the crawling process.

Optimization:

Multithreading: parse and crawl at the same time.
Libcurl and Regex_search: to simplify code.

Indexing

The web content from the parsing process will be split into single words. Then, they will be written to files based on their first character.

Optimization:

Oleander Stemming Library: stem words to reduce data storage. For example: "interested", "interesting", and "interest" will be store as "interest".
Stop words: remove unimportant words (ex: I, to, in,...)
Remove invalid characters (ex: !, ^, *,...).

Search results

The ranking is based on the number of times the keywords appear on each page. Instead of starting crawling again each time the user search, this program will return results based on the indexed data, hence reduce search time.

External libraries

curlpp: Download web content.
Oleander Stemming Library: Reducing the amount of saving data.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.vscode		.vscode
bin		bin
include		include
lib		lib
src		src
Makefile		Makefile
README.md		README.md
demo.gif		demo.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

searchEngine

Implementation

Idea

Crawling

Optimization:

Parsing

Optimization:

Indexing

Optimization:

Search results

External libraries

Video Walkthrough

About

Uh oh!

Releases

Packages

Languages

truongdd03/searchEngine

Folders and files

Latest commit

History

Repository files navigation

searchEngine

Implementation

Idea

Crawling

Optimization:

Parsing

Optimization:

Indexing

Optimization:

Search results

External libraries

Video Walkthrough

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages