@@ -56,17 +56,6 @@ The InnerSource Measurement Tool is designed to analyze GitHub repositories and
5656- Handle default values and required parameters
5757- Support multiple authentication methods
5858
59- ** Key Classes:**
60-
61- - ` EnvVars ` : Immutable configuration object
62- - Helper functions for type conversion and validation
63-
64- ** Design Patterns:**
65-
66- - Configuration Object Pattern
67- - Builder Pattern (for environment variable parsing)
68- - Validation Chain Pattern
69-
7059### 2. Authentication Manager (` auth.py ` )
7160
7261** Responsibilities:**
@@ -76,16 +65,6 @@ The InnerSource Measurement Tool is designed to analyze GitHub repositories and
7665- Manage JWT tokens for GitHub App authentication
7766- Provide unified authentication interface
7867
79- ** Key Functions:**
80-
81- - ` auth_to_github() ` : Main authentication orchestrator
82- - ` get_github_app_installation_token() ` : JWT token exchange
83-
84- ** Design Patterns:**
85-
86- - Strategy Pattern (for different authentication methods)
87- - Factory Pattern (for creating GitHub clients)
88-
8968### 3. Core Analysis Engine (` measure_innersource.py ` )
9069
9170** Responsibilities:**
@@ -95,19 +74,6 @@ The InnerSource Measurement Tool is designed to analyze GitHub repositories and
9574- Manage chunked processing for large repositories
9675- Calculate InnerSource metrics and ratios
9776
98- ** Key Algorithms:**
99-
100- - Team boundary detection
101- - Contribution aggregation
102- - Chunked data processing
103- - Progress tracking and error handling
104-
105- ** Design Patterns:**
106-
107- - Pipeline Pattern (for staged processing)
108- - Iterator Pattern (for chunked processing)
109- - Observer Pattern (for progress tracking)
110-
11177### 4. Report Generation (` markdown_writer.py ` )
11278
11379** Responsibilities:**
@@ -117,11 +83,6 @@ The InnerSource Measurement Tool is designed to analyze GitHub repositories and
11783- Handle edge cases and missing data
11884- Provide consistent report structure
11985
120- ** Design Patterns:**
121-
122- - Template Method Pattern
123- - Null Object Pattern (for handling missing data)
124-
12586### 5. Utility Functions (` markdown_helpers.py ` )
12687
12788** Responsibilities:**
@@ -130,11 +91,6 @@ The InnerSource Measurement Tool is designed to analyze GitHub repositories and
13091- Split large files intelligently
13192- Preserve content integrity during splits
13293
133- ** Design Patterns:**
134-
135- - Strategy Pattern (for file splitting)
136- - Utility/Helper Pattern
137-
13894## Data Flow Architecture
13995
14096### Processing Pipeline
@@ -215,190 +171,7 @@ The InnerSource Measurement Tool is designed to analyze GitHub repositories and
215171└─────────────────┘
216172```
217173
218- ### Data Structures
219-
220- #### Configuration Data Flow
221-
222- ``` python
223- Environment Variables → EnvVars Object → Component Configuration
224-
225- Examples:
226- GH_TOKEN → env_vars.gh_token → auth_to_github(token = ... )
227- REPOSITORY → env_vars.owner, env_vars.repo → github_connection.repository(owner, repo)
228- CHUNK_SIZE → env_vars.chunk_size → chunked_processing(chunk_size = ... )
229- ```
230-
231- #### Analysis Data Flow
232-
233- ``` python
234- GitHub API → Raw Data → Processed Data → Aggregated Results → Report
235-
236- Examples:
237- repo.commits() → commit_list → commit_author_counts → contribution_totals → markdown_report
238- repo.pull_requests() → pr_list → pr_author_counts → innersource_metrics → formatted_output
239- repo.issues() → issue_list → issue_author_counts → team_analysis → final_report
240- ```
241-
242- ## Key Algorithms
243-
244- ### Team Boundary Detection Algorithm
245-
246- The team boundary detection algorithm is central to the tool's functionality:
247-
248- ``` python
249- def detect_team_boundaries (original_author : str , org_data : dict ) -> set :
250- """
251- Detect team boundaries using organizational hierarchy
252-
253- Algorithm:
254- 1. Start with original commit author
255- 2. Add their direct manager
256- 3. Add all peers (people with same manager)
257- 4. Recursively add anyone who reports to team members
258- 5. Continue until no new members are found
259- """
260- team_members = {original_author}
261-
262- # Add original author's manager
263- if original_author in org_data:
264- manager = org_data[original_author][" manager" ]
265- team_members.add(manager)
266-
267- # Add all peers (same manager)
268- for user, data in org_data.items():
269- if data[" manager" ] == manager:
270- team_members.add(user)
271-
272- # Recursive expansion
273- changed = True
274- while changed:
275- changed = False
276- initial_size = len (team_members)
277-
278- # Add anyone who reports to current team members
279- for user, data in org_data.items():
280- if data[" manager" ] in team_members:
281- team_members.add(user)
282-
283- # Check if we added anyone new
284- changed = len (team_members) > initial_size
285-
286- return team_members
287- ```
288-
289- ### Chunked Processing Algorithm
290-
291- For memory efficiency with large repositories:
292-
293- ``` python
294- def process_in_chunks (iterator , chunk_size : int , processor_func ):
295- """
296- Process large datasets in memory-efficient chunks
297-
298- Benefits:
299- - Prevents memory overflow
300- - Provides progress feedback
301- - Allows for configurable memory usage
302- - Handles API rate limiting gracefully
303- """
304- results = {}
305- total_processed = 0
306-
307- while True :
308- # Collect chunk
309- chunk = []
310- for _ in range (chunk_size):
311- try :
312- chunk.append(next (iterator))
313- except StopIteration :
314- break
315-
316- if not chunk:
317- break
318-
319- # Process chunk
320- chunk_results = processor_func(chunk)
321-
322- # Merge results
323- for key, value in chunk_results.items():
324- results[key] = results.get(key, 0 ) + value
325-
326- total_processed += len (chunk)
327- print (f " Processed { total_processed} items... " )
328-
329- return results
330- ```
331-
332- ### Contribution Aggregation Algorithm
333-
334- ``` python
335- def aggregate_contributions (commit_counts , pr_counts , issue_counts ):
336- """
337- Aggregate different types of contributions
338-
339- Combines:
340- - Commit authorship
341- - Pull request creation
342- - Issue creation
343-
344- Returns unified contribution counts per user
345- """
346- all_users = set (commit_counts.keys()) | set (pr_counts.keys()) | set (issue_counts.keys())
347-
348- aggregated = {}
349- for user in all_users:
350- aggregated[user] = (
351- commit_counts.get(user, 0 ) +
352- pr_counts.get(user, 0 ) +
353- issue_counts.get(user, 0 )
354- )
355-
356- return aggregated
357- ```
358-
359- ## Performance Considerations
360-
361- ### Memory Management
362-
363- 1 . ** Chunked Processing** : Large datasets are never loaded entirely into memory
364- 2 . ** Lazy Evaluation** : Use GitHub API iterators instead of loading full lists
365- 3 . ** Result Streaming** : Process and aggregate results incrementally
366- 4 . ** Garbage Collection** : Explicitly manage object lifecycles for large datasets
367-
368- ### API Rate Limiting
369-
370- 1 . ** Authentication Strategy** : Use GitHub App tokens for higher limits (5,000/hour vs 1,000/hour)
371- 2 . ** Request Batching** : Minimize API calls through efficient query patterns
372- 3 . ** Respectful Processing** : Honor rate limits and provide backoff mechanisms
373- 4 . ** Progress Tracking** : Provide feedback during long-running operations
374-
375- ### Scalability Patterns
376-
377- 1 . ** Horizontal Scaling** : Tool can be run on multiple repositories simultaneously
378- 2 . ** Configurable Resources** : Chunk size can be adjusted based on available memory
379- 3 . ** Incremental Processing** : Future enhancement for processing only recent changes
380- 4 . ** Caching Strategy** : Store intermediate results to avoid reprocessing
381-
382- ## Error Handling Strategy
383-
384- ### Graceful Degradation
385-
386- ``` python
387- def handle_api_errors (func ):
388- """
389- Decorator for graceful API error handling
390- """
391- def wrapper (* args , ** kwargs ):
392- try :
393- return func(* args, ** kwargs)
394- except requests.exceptions.RequestException as e:
395- print (f " API request failed: { e} " )
396- return None
397- except Exception as e:
398- print (f " Unexpected error: { e} " )
399- return None
400- return wrapper
401- ```
174+ ## Error Handling
402175
403176### Error Categories
404177
@@ -408,110 +181,3 @@ def handle_api_errors(func):
4081814 . ** Data Errors** : Missing org-data.json or invalid format
4091825 . ** Processing Errors** : Unexpected data structures or edge cases
410183
411- ## Testing Strategy
412-
413- ### Test Architecture
414-
415- ```
416- ┌─────────────────────────────────────────────────────────────────┐
417- │ Test Suite │
418- │ │
419- │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐ │
420- │ │ Unit │ │ Integration │ │ End-to-End │ │ Edge │ │
421- │ │ Tests │ │ Tests │ │ Tests │ │ Cases │ │
422- │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘ │
423- │ │
424- │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐ │
425- │ │Performance │ │ Security │ │ Reliability │ │ Mock │ │
426- │ │ Tests │ │ Tests │ │ Tests │ │ Tests │ │
427- │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘ │
428- └─────────────────────────────────────────────────────────────────┘
429- ```
430-
431- ### Test Coverage Strategy
432-
433- 1 . ** Unit Tests** : Each function and class method
434- 2 . ** Integration Tests** : Component interactions
435- 3 . ** Configuration Tests** : Environment variable handling
436- 4 . ** Mock Tests** : GitHub API interactions
437- 5 . ** Edge Case Tests** : Empty repositories, missing data, error conditions
438-
439- ## Security Considerations
440-
441- ### Authentication Security
442-
443- 1 . ** Token Management** : Never log or expose authentication tokens
444- 2 . ** Permission Principle** : Use minimum required permissions
445- 3 . ** Token Rotation** : Support for token refresh and rotation
446- 4 . ** Secure Storage** : Environment variables for sensitive data
447-
448- ### Data Privacy
449-
450- 1 . ** No Data Persistence** : Tool doesn't store user data beyond processing
451- 2 . ** Minimal Data Access** : Only accesses necessary repository information
452- 3 . ** User Privacy** : Respects GitHub's privacy settings and permissions
453- 4 . ** Audit Trail** : Provides logs for security auditing
454-
455- ## Future Enhancements
456-
457- ### Planned Features
458-
459- 1 . ** Incremental Processing** : Process only recent changes
460- 2 . ** Historical Analysis** : Track InnerSource trends over time
461- 3 . ** Additional Metrics** : More sophisticated collaboration measurements
462- 4 . ** Multiple Platforms** : Support for GitLab, Bitbucket, etc.
463- 5 . ** Real-time Processing** : Webhook-based analysis
464-
465- ### Architecture Extensions
466-
467- 1 . ** Plugin System** : Allow custom analysis algorithms
468- 2 . ** Database Integration** : Store historical data for trending
469- 3 . ** API Interface** : REST API for programmatic access
470- 4 . ** Dashboard UI** : Web interface for visualization
471- 5 . ** Notification System** : Alerts for significant changes
472-
473- ## Deployment Architecture
474-
475- ### Container Strategy
476-
477- ``` dockerfile
478- # Multi-stage build for optimization
479- FROM python:3.10-slim as builder
480- WORKDIR /app
481- COPY requirements.txt .
482- RUN pip install --no-cache-dir -r requirements.txt
483-
484- FROM python:3.10-slim
485- WORKDIR /app
486- COPY --from=builder /usr/local/lib/python3.10/site-packages /usr/local/lib/python3.10/site-packages
487- COPY . .
488- CMD ["python" , "measure_innersource.py" ]
489- ```
490-
491- ### GitHub Actions Integration
492-
493- The tool is designed to run seamlessly in GitHub Actions with:
494-
495- - Minimal resource requirements
496- - Efficient processing patterns
497- - Clear progress reporting
498- - Graceful error handling
499-
500- ## Monitoring and Observability
501-
502- ### Metrics Collection
503-
504- 1 . ** Processing Time** : Track analysis duration
505- 2 . ** Memory Usage** : Monitor resource consumption
506- 3 . ** API Usage** : Track rate limit consumption
507- 4 . ** Error Rates** : Monitor failure patterns
508- 5 . ** Success Metrics** : Track successful analyses
509-
510- ### Logging Strategy
511-
512- 1 . ** Structured Logging** : JSON format for machine processing
513- 2 . ** Log Levels** : DEBUG, INFO, WARNING, ERROR
514- 3 . ** Contextual Information** : Include request IDs and user context
515- 4 . ** No Sensitive Data** : Sanitize logs of tokens and personal information
516-
517- This architecture provides a solid foundation for the InnerSource measurement tool while maintaining flexibility for future enhancements and scaling requirements.
0 commit comments