recursionpharma · ivanmilevtues · Jul 4, 2025
diff --git a/.codeboarding/HTML_Data_Extractor.md b/.codeboarding/HTML_Data_Extractor.md
@@ -0,0 +1,139 @@
+```mermaid
+
+graph LR
+
+    CLI_Handler["CLI Handler"]
+
+    Web_Scraper["Web Scraper"]
+
+    HTML_Data_Extractor["HTML Data Extractor"]
+
+    Data_Processor["Data Processor"]
+
+    Output_Handler["Output Handler"]
+
+    CLI_Handler -- "invokes" --> Data_Processor
+
+    Data_Processor -- "calls" --> Web_Scraper
+
+    Web_Scraper -- "provides" --> HTML_Data_Extractor
+
+    HTML_Data_Extractor -- "provides" --> Data_Processor
+
+    Data_Processor -- "sends" --> Output_Handler
+
+    click HTML_Data_Extractor href "https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/.codeboarding//HTML_Data_Extractor.md" "Details"
+
+```
+
+
+
+[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)
+
+
+
+## Details
+
+
+
+One paragraph explaining the functionality which is represented by this graph. What the main flow is and what is its purpose.
+
+
+
+### CLI Handler
+
+This component is responsible for parsing command-line arguments (search term, start date, end date, and optional output file), validating the input, and orchestrating the overall execution flow of the script. It serves as the primary entry point for the utility.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L72-L92" target="_blank" rel="noopener noreferrer">`extract_occurrences.py` (72:92)</a>
+
+
+
+
+
+### Web Scraper
+
+This component handles the fetching of web page content from Google Scholar. It constructs the appropriate URL with search parameters, sets user-agent headers, manages cookies for session persistence, and executes the HTTP request to retrieve the raw HTML response.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L16-L46" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_num_results` (16:46)</a>
+
+
+
+
+
+### HTML Data Extractor [[Expand]](./HTML_Data_Extractor.md)
+
+This component is dedicated to parsing the raw HTML content received from the `Web Scraper`. It utilizes BeautifulSoup to navigate the HTML structure, specifically locating the `div` element containing the search result count (`id="gs_ab_md"`). It then employs regular expressions to accurately extract and format the numerical count of search results, handling cases where no results are found.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L16-L46" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_num_results` (16:46)</a>
+
+
+
+
+
+### Data Processor
+
+This component manages the core logic of iterating through the specified date range (year by year). For each year, it calls the `Web Scraper` and `HTML Data Extractor` (via `get_num_results`) to obtain the search result count. It also incorporates a delay (`time.sleep`) to mitigate potential rate-limiting by Google Scholar and prepares the `year,results` data for output.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L49-L69" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_range` (49:69)</a>
+
+
+
+
+
+### Output Handler
+
+This component is responsible for formatting and presenting the final results of the keyword analysis. It writes the `year,results` data to a specified CSV file and simultaneously prints the same information to the console, ensuring both persistence and immediate feedback to the user. It also handles file opening and closing.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L49-L69" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_range` (49:69)</a>
+
+- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L49-L69" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_range` (49:69)</a>
+
+
+
+
+
+
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
diff --git a/.codeboarding/on_boarding.md b/.codeboarding/on_boarding.md
@@ -0,0 +1,167 @@
+```mermaid
+
+graph LR
+
+    Application_Controller["Application Controller"]
+
+    Cookie_Manager["Cookie Manager"]
+
+    Data_Retrieval_Loop["Data Retrieval Loop"]
+
+    Web_Scraper["Web Scraper"]
+
+    HTML_Data_Extractor["HTML Data Extractor"]
+
+    Output_Writer["Output Writer"]
+
+    Application_Controller -- "initiates" --> Data_Retrieval_Loop
+
+    Application_Controller -- "instructs to save cookies" --> Cookie_Manager
+
+    Cookie_Manager -- "provides session cookies to" --> Web_Scraper
+
+    Data_Retrieval_Loop -- "calls" --> Web_Scraper
+
+    Data_Retrieval_Loop -- "passes extracted data to" --> Output_Writer
+
+    Web_Scraper -- "uses for requests" --> Cookie_Manager
+
+    Web_Scraper -- "provides raw HTML to" --> HTML_Data_Extractor
+
+    HTML_Data_Extractor -- "returns extracted results to" --> Data_Retrieval_Loop
+
+    click HTML_Data_Extractor href "https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/.codeboarding//HTML_Data_Extractor.md" "Details"
+
+```
+
+
+
+[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)
+
+
+
+## Details
+
+
+
+The `academic-keyword-occurrence` project, a command-line utility for web scraping and data analysis, exhibits a monolithic script architecture with clear functional separation. Based on the CFG and Source analysis, the project's high-level data flow can be described through six core components, each with distinct responsibilities and well-defined interactions.
+
+
+
+### Application Controller
+
+The central orchestrator of the application. It parses command-line arguments, initializes the data extraction process, and ensures session cookies are saved upon completion.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- `Application Controller` (72:92)
+
+
+
+
+
+### Cookie Manager
+
+Responsible for loading and persisting HTTP cookies, crucial for maintaining session state with Google Scholar and bypassing CAPTCHA challenges.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- `Cookie Manager` (11:15)
+
+- `Cookie Manager` (92:92)
+
+
+
+
+
+### Data Retrieval Loop
+
+Manages the iterative process of fetching data for each year within the specified range. It coordinates calls to the Web Scraper and HTML Data Extractor, and passes the processed results to the Output Writer.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- `Data Retrieval Loop` (50:70)
+
+
+
+
+
+### Web Scraper
+
+Constructs Google Scholar query URLs, sends HTTP requests, and retrieves the raw HTML content, utilizing managed cookies for session continuity.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- `Web Scraper` (17:29)
+
+
+
+
+
+### HTML Data Extractor [[Expand]](./HTML_Data_Extractor.md)
+
+Parses the raw HTML content received from the Web Scraper using BeautifulSoup and regular expressions to accurately extract the numerical count of search results.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- `HTML Data Extractor` (31:47)
+
+
+
+
+
+### Output Writer
+
+Formats the extracted year and result count into a CSV string and writes this data to both the specified output file and the console.
+
+
+
+
+
+**Related Classes/Methods**:
+
+
+
+- `Output Writer` (53:57)
+
+- `Output Writer` (65:67)
+
+
+
+
+
+
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)