Skip to content

Commit 984411a

Browse files
Merge pull request #360 from harshitg927/updates/week9
chore(docs): Add documentation for text-phrases-bulk for week 5, week… Reviewed-by: shaheem.azmal@siemens.com
2 parents a5d26a4 + 4f0bb74 commit 984411a

File tree

5 files changed

+393
-1
lines changed

5 files changed

+393
-1
lines changed

docs/2025/text-phrases-bulk/updates/2025-07,01.md renamed to docs/2025/text-phrases-bulk/updates/2025-07-01.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ SPDX-License-Identifier: CC-BY-SA-4.0
99
SPDX-FileCopyrightText: 2025 Harshit Gandhi <gandhiharshit716@gmail.com>
1010
-->
1111

12-
# Week 4
12+
# Week 5
1313

1414
_(July 01, 2025 – July 07, 2025)_
1515

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Week 6
3+
author: Harshit Gandhi
4+
tags: [gsoc25]
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
SPDX-FileCopyrightText: 2025 Harshit Gandhi <gandhiharshit716@gmail.com>
10+
-->
11+
12+
# Week 6
13+
14+
_(July 08, 2025 – July 15, 2025)_
15+
16+
## Meeting 1
17+
18+
**Date:** July 08, 2025
19+
**Attendees:**
20+
21+
- [Harshit Gandhi](https://github.com/harshitg927)
22+
- [Kaushlendra](https://github.com/Kaushl2208)
23+
- [Sushant](https://github.com/its-sushant)
24+
- [Soham](https://github.com/soham4abc)
25+
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
26+
27+
## Summary
28+
29+
- Presented a comprehensive overview of the frontend implementation completed last week.
30+
- Got valuable feedback from mentors to enhance to frontend functionality and user experience.
31+
- **Pagination Implementation**: Shift from client-side to server-side pagination in the Custom Text Managment page to improve performance.
32+
- **UI Correction**: Addressed and incorrect title in the decider agent senction within the uploads tab for better clarity.
33+
34+
## Progress
35+
36+
- Successfully refactored the pagination logic in the Custom Text Managment system to operate server-side, ensuring optimal performance.
37+
- Performed minor refinements to align the implementation with production-ready standards.
38+
39+
### Next Steps
40+
41+
With the frontend development now complete, the focus will shift to developing the agent part of the project.
Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
---
2+
title: Week 7
3+
author: Harshit Gandhi
4+
tags: [gsoc25]
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
SPDX-FileCopyrightText: 2025 Harshit Gandhi <gandhiharshit716@gmail.com>
10+
-->
11+
12+
# Week 7
13+
14+
_(July 15, 2025 – July 22, 2025)_
15+
16+
## Meeting 1
17+
18+
**Date:** July 15, 2025
19+
**Attendees:**
20+
21+
- [Harshit Gandhi](https://github.com/harshitg927)
22+
- [Kaushlendra](https://github.com/Kaushl2208)
23+
- [Sushant](https://github.com/its-sushant)
24+
- [Soham](https://github.com/soham4abc)
25+
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
26+
27+
## Summary
28+
29+
- Presented the fixes made to make the frontend part of the project complete. One of the major enhancement was the implementation of server-side pagination.
30+
- Got insights on how to start the implementation of the new agent which was to be developed.
31+
32+
## Progress
33+
34+
- Conducted a thorough review of the existing codebase to identify agents with comparable functionality to the new agent.
35+
- Discovered that the MONKBULK agent closely align with the requirements of the new agent, serving as an ideal reference implementation.
36+
- Performed an in-depth analysis of the MONKBULK agent, resulting in comprehensive documentation detailing it architecture, logic and workflow.
37+
38+
### MONKBULK Documentation
39+
40+
#### Overview
41+
The MonkBulk agent is a specialized FOSSology component that performs bulk license scanning operations. It extends the core Monk agent to handle batch processing of license text matching across multiple files within an upload tree structure.
42+
43+
#### Core Architecture
44+
45+
##### Main Components
46+
- **Entry Point**: `monkbulk.c` - Main executable with scheduler integration
47+
- **Database Layer**: `database.c` - PostgreSQL interactions and queries
48+
- **Text Processing**: `string_operations.c` - Tokenization with custom delimiters
49+
- **File Operations**: `file_operations.c` - File I/O and content reading
50+
- **License Processing**: `license.c` - License text handling and extraction
51+
- **Pattern Matching**: `match.c` - Core matching algorithms
52+
- **Shared Headers**: `monk.h`, `monkbulk.h` - Data structures and constants
53+
54+
##### Key Data Structures
55+
56+
```c
57+
// Core bulk operation parameters
58+
typedef struct {
59+
long bulkId;
60+
long uploadTreeId;
61+
long uploadTreeLeft, uploadTreeRight;
62+
long licenseId;
63+
int uploadId, jobId, userId, groupId;
64+
char* refText;
65+
char* delimiters;
66+
bool ignoreIrre;
67+
bool scanFindings;
68+
BulkAction** actions;
69+
} BulkArguments;
70+
71+
// Individual license actions
72+
typedef struct {
73+
long licenseId;
74+
int removing; // 1 = removing, 0 = adding
75+
char* comment;
76+
char* reportinfo;
77+
char* acknowledgement;
78+
} BulkAction;
79+
80+
// Agent state management
81+
typedef struct {
82+
fo_dbManager* dbManager;
83+
int agentId;
84+
int scanMode; // MODE_BULK = 3
85+
int verbosity;
86+
bool ignoreFilesWithMimeType;
87+
void* ptr; // Points to BulkArguments
88+
} MonkState;
89+
```
90+
91+
#### Processing Flow
92+
93+
##### 1. Initialization
94+
- Connect to database via FOSSology scheduler
95+
- Query agent ID and register with system
96+
- Set scan mode to `MODE_BULK`
97+
98+
##### 2. Job Processing Loop
99+
```c
100+
while (fo_scheduler_next() != NULL) {
101+
// Parse bulk ID from job parameters
102+
// Query bulk arguments from database
103+
// Create ARS (Agent Result Set) entry
104+
// Execute bulk identification
105+
// Update ARS with results
106+
// Clean up resources
107+
}
108+
```
109+
110+
##### 3. Bulk Identification Process
111+
- Parse `BulkArguments` from database queries
112+
- Tokenize reference license text using custom delimiters
113+
- Query files within upload tree boundaries (left/right traversal)
114+
- Multi-threaded processing with OpenMP
115+
- Match files against license patterns
116+
- Save results to database via callbacks
117+
118+
#### Database Integration
119+
120+
##### Key Tables
121+
- `license_ref_bulk` - Bulk operation parameters
122+
- `license_set_bulk` - License actions for bulk operations
123+
- `uploadtree` - File system tree structure
124+
- `clearing_event` - License clearing decisions
125+
- `highlight_bulk` - Match highlighting information
126+
127+
##### Critical SQL Queries
128+
```sql
129+
-- Bulk parameters retrieval
130+
SELECT ut.upload_fk, ut.uploadtree_pk, lrb.user_fk, lrb.group_fk,
131+
lrb.rf_text, lrb.ignore_irrelevant, lrb.bulk_delimiters, lrb.scan_findings
132+
FROM license_ref_bulk lrb
133+
INNER JOIN uploadtree ut ON ut.uploadtree_pk = lrb.uploadtree_fk
134+
WHERE lrb_pk = $1
135+
136+
-- File selection within tree boundaries
137+
SELECT DISTINCT pfile_fk FROM uploadtree
138+
WHERE upload_fk = $1 AND (lft BETWEEN $2 AND $3) AND pfile_fk != 0
139+
```
140+
141+
#### Text Processing & Tokenization
142+
143+
##### Default Configuration
144+
```c
145+
#define DELIMITERS " \t\n\r\f#^%,*"
146+
#define MAX_ALLOWED_DIFF_LENGTH 256
147+
#define MIN_ADJACENT_MATCHES 3
148+
#define MIN_ALLOWED_RANK 66
149+
```
150+
151+
##### Token Structure
152+
```c
153+
typedef struct {
154+
unsigned int length;
155+
unsigned int removedBefore;
156+
uint32_t hashedContent;
157+
} Token;
158+
```
159+
160+
##### Tokenization Process
161+
- Custom delimiter support via `bulk_delimiters` field
162+
- Escape sequence processing (`\n`, `\t`, etc.)
163+
- Special handling for comment delimiters (`//`, `/*`, `*/`)
164+
- Hash-based token comparison for efficiency
165+
166+
#### Multi-threading Implementation
167+
168+
##### OpenMP Integration
169+
```c
170+
#ifdef MONK_MULTI_THREAD
171+
#pragma omp parallel
172+
#endif
173+
{
174+
MonkState* threadLocalState = &threadLocalStateStore;
175+
threadLocalState->dbManager = fo_dbManager_fork(state->dbManager);
176+
177+
#pragma omp for schedule(dynamic)
178+
for (int i = 0; i < resultsCount; i++) {
179+
// Process files in parallel
180+
}
181+
}
182+
```
183+
184+
##### Thread Safety
185+
- Each thread gets isolated database connection
186+
- Thread-local state copies prevent race conditions
187+
- Shared resources protected by OpenMP directives
188+
189+
#### Memory Management
190+
191+
##### Allocation Patterns
192+
- `BulkArguments`: Dynamic allocation with custom cleanup
193+
- Token arrays: GLib `GArray` structures
194+
- String handling: Mix of GLib (`g_strdup`) and standard C (`malloc`)
195+
196+
##### Cleanup Functions
197+
```c
198+
void bulkArguments_contents_free(BulkArguments* bulkArguments);
199+
```
200+
201+
#### Build System
202+
203+
##### CMake Configuration
204+
- Shared source files with main Monk agent
205+
- OpenMP support (`-fopenmp`)
206+
- Large file support (`-D_FILE_OFFSET_BITS=64`)
207+
- Case insensitive matching (`-DMONK_CASE_INSENSITIVE`)
208+
209+
##### Dependencies
210+
- `libfossology` - Core FOSSology library
211+
- `libpq` - PostgreSQL client
212+
- `glib-2.0` - Utility functions
213+
- `magic` - File type detection
214+
- OpenMP - Multi-threading
215+
216+
#### Scheduler Integration
217+
218+
##### Job Queue Processing
219+
- Integrates with FOSSology's job scheduler system
220+
- Heartbeat mechanism for progress reporting
221+
- ARS (Agent Result Set) tracking for results
222+
223+
##### Agent Registration
224+
```c
225+
queryAgentId(state, AGENT_BULK_NAME, AGENT_BULK_DESC);
226+
```
227+
228+
#### Result Processing
229+
230+
##### Match Callback System
231+
- `bulk_onAllMatches()` - Processes matching results
232+
- Database transaction management
233+
- Clearing event insertion with user context
234+
- Highlight information storage
235+
236+
##### Transaction Handling
237+
- ACID compliance for result storage
238+
- Rollback on processing errors
239+
- Referential integrity maintenance
240+
241+
#### Configuration Options
242+
243+
##### Scanning Modes
244+
- `ignoreIrre` - Skip irrelevant files
245+
- `scanFindings` - Process only files with existing findings
246+
- Custom delimiter configuration per bulk operation
247+
248+
##### Performance Tuning
249+
- Multi-threading control via OpenMP
250+
- Memory limits for token arrays
251+
- Database connection pooling
252+
253+
#### Error Handling
254+
255+
##### Database Errors
256+
- Connection failure handling
257+
- Query result validation
258+
- Transaction rollback on errors
259+
260+
##### File System Errors
261+
- Permission checking
262+
- File access validation
263+
- Resource cleanup on failures
264+
265+
### Similarities between MONKBULK and the new agent
266+
267+
- Both agents scan for exact matches and partial matches as dis-regarded.
268+
- The overall matching algorithm of MONKBULK agent was same when compared tot he new agent.
269+
270+
### Dissimilarites between MONKBULK and the new agent
271+
272+
- The new agent should use cutom_phrase table instead of license_ref_bulk table.
273+
- User should be able to trigger the new agent from the uploads page instead of going to the license page for a particular file like it is done for MONKBULK agent.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: Week 8
3+
author: Harshit Gandhi
4+
tags: [gsoc25]
5+
---
6+
7+
<!--
8+
SPDX-License-Identifier: CC-BY-SA-4.0
9+
SPDX-FileCopyrightText: 2025 Harshit Gandhi <gandhiharshit716@gmail.com>
10+
-->
11+
12+
# Week 8
13+
14+
_(July 22, 2025 – July 29, 2025)_
15+
16+
## Meeting 1
17+
18+
**Date:** July 22, 2025
19+
**Attendees:**
20+
21+
- [Harshit Gandhi](https://github.com/harshitg927)
22+
- [Kaushlendra](https://github.com/Kaushl2208)
23+
- [Sushant](https://github.com/its-sushant)
24+
- [Soham](https://github.com/soham4abc)
25+
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
26+
27+
## Summary
28+
29+
- Presented findings from the analysis of the MONKBULK agent to mentors, highlighting its relevance to the new agent's development.
30+
- Implementation feedback was deferred since the actual development of the new agent had not yet start by me.
31+
32+
## Progress
33+
34+
- Began the implementation phase of the new agent, tentatively name "Kotoba" (derived from the Japanese word "Kotoba" which means word). The naming of this new agent might be revised in the future.
35+
- I dedicated most of this week's time to understand the interaction between scheduler and agent within the FOSSology codebase, which is critical for the development of this new agent.

0 commit comments

Comments
 (0)