|
1 | | -# pitfalls-detection |
| 1 | +# Software Repository Metadata Pitfall Detection Tool |
| 2 | + |
| 3 | +This project provides an automated tool for detecting common metadata quality issues (pitfalls) in software repositories. The tool analyzes SoMEF (Software Metadata Extraction Framework) output files to identify various problems in repository metadata files such as `codemeta.json`, `package.json`, `setup.py`, `DESCRIPTION`, and others. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The pitfall detection system identifies **27 different types of metadata quality issues** across multiple programming languages (Python, Java, C++, C, R, Rust). These pitfalls range from version mismatches and license template placeholders to broken URLs and improperly formatted metadata fields. |
| 8 | + |
| 9 | +### Supported Pitfall Types |
| 10 | + |
| 11 | +The tool detects the following categories of issues: |
| 12 | + |
| 13 | +- **Version-related pitfalls**: Version mismatches between metadata files and releases |
| 14 | +- **License-related pitfalls**: Template placeholders, copyright-only licenses, missing version specifications |
| 15 | +- **URL validation pitfalls**: Broken links for CI, software requirements, download URLs |
| 16 | +- **Metadata format pitfalls**: Improper field formatting, multiple authors in single fields etc... |
| 17 | +- **Identifier pitfalls**: Invalid or missing unique identifiers, bare DOIs |
| 18 | +- **Repository reference pitfalls**: Mismatched code repositories, Git shorthand usage |
| 19 | + |
| 20 | +## Requirements |
| 21 | + |
| 22 | +- **Python 3.10 or higher** |
| 23 | +- Required Python packages: |
| 24 | + - `requests` (for URL validation) |
| 25 | + - `pathlib` (built-in) |
| 26 | + - `json` (built-in) |
| 27 | + - `re` (built-in) |
| 28 | + |
| 29 | +## Setup and Usage |
| 30 | + |
| 31 | +### 1. Prepare SoMEF Output Files |
| 32 | + |
| 33 | +Ensure you have SoMEF output JSON files ready for analysis. These files should be placed in a directory named `somef_outputs` in the same location as the main script. |
| 34 | + |
| 35 | +**Important**: Keep the directory name as `somef_outputs` exactly as shown. |
| 36 | + |
| 37 | +### 2. Directory Structure |
| 38 | +``` |
| 39 | +project/ detect_pitfalls_main.py |
| 40 | + ├── somef_outputs/ # Directory containing SoMEF JSON files │ |
| 41 | + ├── repository1.json │ |
| 42 | + ├── repository2.json │ |
| 43 | + └── ... |
| 44 | + ├── scripts/ # Individual pitfall detector modules │ |
| 45 | + ├── p001.py │ |
| 46 | + ├── p002.py │ |
| 47 | + └── ... |
| 48 | + └── all_pitfalls_results.json # Generated output file |
| 49 | +
|
| 50 | +``` |
| 51 | + |
| 52 | +### 3. Run the Detection Tool |
| 53 | + |
| 54 | +Execute the main script from the command line: |
| 55 | + |
| 56 | +`python detect_pitfalls_main.py |
| 57 | +` |
| 58 | + |
| 59 | +### 4. Output |
| 60 | + |
| 61 | +The tool will: |
| 62 | +- Process all JSON files in the `somef_outputs` directory |
| 63 | +- Display progress messages showing detected pitfalls |
| 64 | +- Generate a comprehensive report in `all_pitfalls_results.json` |
| 65 | + |
| 66 | +The output file contains: |
| 67 | +- Summary statistics of analyzed repositories |
| 68 | +- Count and percentage for each pitfall type |
| 69 | +- Language-specific breakdown for repositories with target languages |
| 70 | + |
| 71 | + |
| 72 | +## Troubleshooting |
| 73 | + |
| 74 | +### Common Issues |
| 75 | + |
| 76 | +1. **"Directory not found" error**: Ensure the `somef_outputs` directory exists and contains JSON files |
| 77 | +3. **JSON parsing errors**: Verify that input files are valid JSON format |
| 78 | +4. **Network timeouts**: Some pitfalls (P014, P028) validate URLs and may timeout - this is normal behavior |
| 79 | + |
| 80 | +### Performance Notes |
| 81 | + |
| 82 | +- URL validation pitfalls (P014, P028) may take longer due to network requests |
| 83 | +- Large datasets may require several minutes to complete analysis |
| 84 | +- Progress is displayed in real-time showing which pitfalls are found |
| 85 | + |
| 86 | +## Contributing |
| 87 | + |
| 88 | +The system is designed with modularity in mind. Each pitfall detector is implemented as a separate module in the `scripts/` directory, making it easy to add new pitfall types or modify existing detection logic. |
0 commit comments