Skip to content

Reusability

thibaut-dst edited this page Jan 17, 2025 · 4 revisions

The service is designed to be flexible and adaptable for different topics or customized configurations. Below are detailed instructions on how to customize and reuse the system effectively.

1️⃣ Customizing the Scraping Behavior

The core scraping functionality is handled by the following function:

Core Function:

  • Path: functions/scraping.py
  • Function: scrape_webpages_to_db(keywords_list: list, collection)

Default Behavior:

  • This function queries the Google API to fetch the first 3 results for each keyword.
  • It checks the existing database to prevent storing duplicate documents.

How to Expand the Search: To increase the number of search results (useful for broader searches or when rerunning the pipeline):

  1. Stop the Docker Container:

    • Run the following command to stop the container:
      docker-compose down
  2. Modify the Scraping Behavior:

    • Open functions/scraping.py.
    • Locate the num_results parameter (around line 181):
    for url in search(combined, num_results=3): # Limited to 3 results
    • Change its value from 3 to a higher number (e.g., 10) to expand the search results.
  3. Restart the Docker Container:

    • After saving the changes, restart the container to apply the updates:
      docker-compose up -d
  4. Rerun the Pipelines:

    • Go to [base_url]/launch-pipeline and restart the data collection and NLP processing pipelines.

===============

2. Adapting the System for Different Topics

The system can easily be adapted for different topics by updating the vocabulary file:

Path: data/Vocabulaire_Expert_CSV.csv

  • File Structure: The file must retain the same name and follow the exact format with three columns:

    • Vocabulaire de recherche (Search Vocabulary)
    • Localisation de recherche (Search Location)
    • Vocabulaire d'analyse (Analysis Vocabulary)
  • Usage:

    • The first two columns are combined to generate search queries.
    • All three columns are used for word tracking, document filtering, and powering the NLP tasks.

By updating this CSV file with topic-specific terms while maintaining its structure, the service can be repurposed for new domains without additional modifications.

Clone this wiki locally