Reusability

The service is designed to be flexible and adaptable for different topics or customized configurations. Below are detailed instructions on how to customize and reuse the system effectively.

1️⃣ Customizing the Scraping Behavior

The core scraping functionality is handled by the following function:

Core Function:

Path: functions/scraping.py
Function: scrape_webpages_to_db(keywords_list: list, collection)

Default Behavior:

This function queries the Google API to fetch the first 3 results for each keyword.
It checks the existing database to prevent storing duplicate documents.

How to Expand the Search: To increase the number of search results (useful for broader searches or when rerunning the pipeline):

Stop the Docker Container:
- Run the following command to stop the container:
```
docker-compose down
```
Modify the Scraping Behavior:
- Open functions/scraping.py.
- Locate the num_results parameter (around line 181):
```
for url in search(combined, num_results=3): # Limited to 3 results
```
- Change its value from 3 to a higher number (e.g., 10) to expand the search results.
Restart the Docker Container:
- After saving the changes, restart the container to apply the updates:
```
docker-compose up -d
```
Rerun the Pipelines:
- Go to [base_url]/launch-pipeline and restart the data collection and NLP processing pipelines.

===============

2. Adapting the System for Different Topics

The system can easily be adapted for different topics by updating the vocabulary file:

Path: data/Vocabulaire_Expert_CSV.csv

File Structure: The file must retain the same name and follow the exact format with three columns:
- Vocabulaire de recherche (Search Vocabulary)
- Localisation de recherche (Search Location)
- Vocabulaire d'analyse (Analysis Vocabulary)
Usage:
- The first two columns are combined to generate search queries.
- All three columns are used for word tracking, document filtering, and powering the NLP tasks.

By updating this CSV file with topic-specific terms while maintaining its structure, the service can be repurposed for new domains without additional modifications.

Home | Contributors | Report an Issue | Licence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusability

1️⃣ Customizing the Scraping Behavior

2. Adapting the System for Different Topics

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally