Skip to content

Commit cd2e016

Browse files
Initial version for pitfalls detection
1 parent 42ff8f3 commit cd2e016

File tree

935 files changed

+818685
-31216
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

935 files changed

+818685
-31216
lines changed

README.md

Lines changed: 88 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,88 @@
1-
# pitfalls-detection
1+
# Software Repository Metadata Pitfall Detection Tool
2+
3+
This project provides an automated tool for detecting common metadata quality issues (pitfalls) in software repositories. The tool analyzes SoMEF (Software Metadata Extraction Framework) output files to identify various problems in repository metadata files such as `codemeta.json`, `package.json`, `setup.py`, `DESCRIPTION`, and others.
4+
5+
## Overview
6+
7+
The pitfall detection system identifies **27 different types of metadata quality issues** across multiple programming languages (Python, Java, C++, C, R, Rust). These pitfalls range from version mismatches and license template placeholders to broken URLs and improperly formatted metadata fields.
8+
9+
### Supported Pitfall Types
10+
11+
The tool detects the following categories of issues:
12+
13+
- **Version-related pitfalls**: Version mismatches between metadata files and releases
14+
- **License-related pitfalls**: Template placeholders, copyright-only licenses, missing version specifications
15+
- **URL validation pitfalls**: Broken links for CI, software requirements, download URLs
16+
- **Metadata format pitfalls**: Improper field formatting, multiple authors in single fields etc...
17+
- **Identifier pitfalls**: Invalid or missing unique identifiers, bare DOIs
18+
- **Repository reference pitfalls**: Mismatched code repositories, Git shorthand usage
19+
20+
## Requirements
21+
22+
- **Python 3.10 or higher**
23+
- Required Python packages:
24+
- `requests` (for URL validation)
25+
- `pathlib` (built-in)
26+
- `json` (built-in)
27+
- `re` (built-in)
28+
29+
## Setup and Usage
30+
31+
### 1. Prepare SoMEF Output Files
32+
33+
Ensure you have SoMEF output JSON files ready for analysis. These files should be placed in a directory named `somef_outputs` in the same location as the main script.
34+
35+
**Important**: Keep the directory name as `somef_outputs` exactly as shown.
36+
37+
### 2. Directory Structure
38+
```
39+
project/ detect_pitfalls_main.py
40+
├── somef_outputs/ # Directory containing SoMEF JSON files │
41+
├── repository1.json │
42+
├── repository2.json │
43+
└── ...
44+
├── scripts/ # Individual pitfall detector modules │
45+
├── p001.py │
46+
├── p002.py │
47+
└── ...
48+
└── all_pitfalls_results.json # Generated output file
49+
50+
```
51+
52+
### 3. Run the Detection Tool
53+
54+
Execute the main script from the command line:
55+
56+
`python detect_pitfalls_main.py
57+
`
58+
59+
### 4. Output
60+
61+
The tool will:
62+
- Process all JSON files in the `somef_outputs` directory
63+
- Display progress messages showing detected pitfalls
64+
- Generate a comprehensive report in `all_pitfalls_results.json`
65+
66+
The output file contains:
67+
- Summary statistics of analyzed repositories
68+
- Count and percentage for each pitfall type
69+
- Language-specific breakdown for repositories with target languages
70+
71+
72+
## Troubleshooting
73+
74+
### Common Issues
75+
76+
1. **"Directory not found" error**: Ensure the `somef_outputs` directory exists and contains JSON files
77+
3. **JSON parsing errors**: Verify that input files are valid JSON format
78+
4. **Network timeouts**: Some pitfalls (P014, P028) validate URLs and may timeout - this is normal behavior
79+
80+
### Performance Notes
81+
82+
- URL validation pitfalls (P014, P028) may take longer due to network requests
83+
- Large datasets may require several minutes to complete analysis
84+
- Progress is displayed in real-time showing which pitfalls are found
85+
86+
## Contributing
87+
88+
The system is designed with modularity in mind. Each pitfall detector is implemented as a separate module in the `scripts/` directory, making it easy to add new pitfall types or modify existing detection logic.

0 commit comments

Comments
 (0)