Skip to content

Commit 99705c1

Browse files
authored
Merge pull request #10 from Ganymede-Bio/fix-release-ci
Fix release ci
2 parents 5e3b64f + 97e086a commit 99705c1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+4209
-1122
lines changed

.github/workflows/release.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ jobs:
142142
- name: Create GitHub Release
143143
uses: softprops/action-gh-release@v1
144144
with:
145+
tag_name: v${{ needs.build.outputs.version }}
145146
name: v${{ needs.build.outputs.version }}
146147
body_path: release_notes.md
147148
files: |

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/pre-commit/pre-commit-hooks
3-
rev: v4.5.0
3+
rev: v5.0.0
44
hooks:
55
- id: trailing-whitespace
66
- id: end-of-file-fixer
@@ -18,14 +18,14 @@ repos:
1818
args: ['--pytest-test-first']
1919

2020
- repo: https://github.com/astral-sh/ruff-pre-commit
21-
rev: v0.1.15
21+
rev: v0.12.7
2222
hooks:
2323
- id: ruff
2424
args: [--fix, --exit-non-zero-on-fix]
2525
- id: ruff-format
2626

2727
- repo: https://github.com/pre-commit/mirrors-mypy
28-
rev: v1.5.1
28+
rev: v1.17.0
2929
hooks:
3030
- id: mypy
3131
args: [--config-file=pyproject.toml]

CHANGELOG.md

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,28 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [Unreleased]
9+
10+
## [0.3.4] - 2025-07-29
11+
12+
### Added
13+
- BoxTableDetector for high-confidence detection of tables with complete borders (95% confidence)
14+
- Headers are now always extracted from the first row of detected tables
15+
- Header extraction support for all detection methods (SimpleCaseDetector, IslandDetector, BoxTableDetector)
16+
17+
### Changed
18+
- Improved header detection to focus on bold text and data type differences
19+
- Reduced emphasis on background color for header detection (based on user feedback)
20+
- Updated detection pipeline to include box table detection as a fast path
21+
22+
### Fixed
23+
- Fixed header extraction in island detection - headers were not being returned
24+
- Fixed sheet_data parameter bug in SimpleCaseDetector.convert_to_table_info
25+
- Fixed has_headers property to correctly transfer from islands to TableInfo objects
26+
827
## [0.3.1] - 2025-07-29
928

1029
### Changed
11-
- **Project Rename**: Renamed from GridPorter to GridGulp
12-
- Updated all package references throughout codebase
13-
- Renamed source directory from `src/gridporter/` to `src/gridgulp/`
14-
- Updated project metadata and documentation
1530
- CI improvements:
1631
- Added Python version matrix testing (3.10, 3.11, 3.12, 3.13)
1732
- Updated ruff target version to py310 (minimum supported)

CLAUDE.md

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@ The system follows a hierarchical detection strategy:
1010

1111
1. **File Type Detection**: Use file magic and content analysis to determine actual file type
1212
2. **Single Table Check**: Fast check if file/sheet contains only one table (handles ~80% of cases)
13-
3. **Excel Metadata**: For Excel files, check native table objects and named ranges
14-
4. **Island Detection**: Algorithm to find disconnected data regions for multi-table sheets
15-
5. **Heuristic Analysis**: Apply header/format analysis for improved accuracy
13+
3. **Box Table Detection**: For Excel files, detect tables with complete borders (95% confidence)
14+
4. **Excel Metadata**: For Excel files, check native table objects and named ranges
15+
5. **Island Detection**: Algorithm to find disconnected data regions for multi-table sheets
16+
6. **Heuristic Analysis**: Apply header/format analysis for improved accuracy
1617

1718
### Detection Components
1819

@@ -21,12 +22,21 @@ The system follows a hierarchical detection strategy:
2122
- Fast path optimization for common cases
2223
- Uses gap detection and data density analysis
2324
- Returns high confidence scores for clear single-table layouts
25+
- Always extracts headers from the first row
26+
27+
#### BoxTableDetector
28+
- Detects tables with complete borders on all four sides
29+
- Assigns 95% confidence to these tables (addresses user feedback)
30+
- Verifies data density to avoid empty bordered regions
31+
- Ideal for formatted Excel tables with clear boundaries
32+
- Extracts headers with formatting-based detection
2433

2534
#### IslandDetector
2635
- Identifies multiple disconnected data regions
2736
- Creates binary mask of non-empty cells
2837
- Uses connected component analysis
2938
- Handles complex multi-table layouts
39+
- Always extracts headers from first row of each island
3040

3141
#### ExcelMetadataExtractor
3242
- Extracts Excel ListObjects (native tables)
@@ -47,25 +57,26 @@ class TableInfo(BaseModel):
4757
range: CellRange = Field(..., description="Table boundaries")
4858
confidence: float = Field(..., ge=0.0, le=1.0)
4959
detection_method: str
50-
headers: list[str] | None = None
60+
headers: list[str] | None = None # Always extracted from first row
61+
has_headers: bool = True # Header detection confidence
5162
shape: tuple[int, int] = Field(..., description="(rows, columns)")
5263
```
5364

5465
### File Handling Strategy
5566

5667
#### Excel Files
57-
- Use openpyxl for .xlsx/.xlsm/.xlsb files
68+
- Use openpyxl for .xlsx/.xlsm files
5869
- Use xlrd for legacy .xls files
5970
- Preserve formatting metadata for detection
6071
- Handle multiple sheets independently
61-
- Support for python-calamine for fast parsing
6272

6373
#### CSV/Text Files
6474
- Auto-detect delimiter using csv.Sniffer
6575
- Sophisticated encoding detection (BOM, chardet, pattern-based)
6676
- Handle various delimiters (comma, tab, pipe, semicolon)
6777
- Support UTF-8, UTF-16 (LE/BE), Latin-1, and more
68-
- Detect header rows using heuristics
78+
- Detect header rows using heuristics (bold text, data type differences)
79+
- Background color is no longer a primary header indicator
6980

7081
#### File Type Detection
7182
- Check file signatures before trusting extensions

NOTICE

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,6 @@ rich (https://github.com/Textualize/rich)
2727
Copyright (c) 2020 Will McGugan
2828
Licensed under the MIT License.
2929

30-
python-calamine (https://github.com/dimastbk/python-calamine)
31-
Licensed under the MIT License.
3230

3331
## BSD Licensed Dependencies
3432

README.md

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,22 @@
11
# GridGulp
22

3+
[![PyPI version](https://badge.fury.io/py/gridgulp.svg)](https://pypi.org/project/gridgulp/)
4+
[![Python Versions](https://img.shields.io/pypi/pyversions/gridgulp.svg)](https://pypi.org/project/gridgulp/)
5+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6+
[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://ganymede-bio.github.io/gridgulp/)
7+
38
Automatically detect and extract tables from Excel, CSV, and text files.
49

510
## What is GridGulp?
611

7-
GridGulp finds tables in your spreadsheets - even when there are multiple tables on one sheet or when tables don't start at cell A1. It comes with reasonable defaults and is fully configurable.
12+
GridGulp finds tables in your spreadsheets, even when
13+
14+
- there are multiple tables on one sheet
15+
- tables don't start at cell A1
16+
- file extensions do not reflect its file type
17+
- the file encoding is opaque
818

9-
**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.csv`, `.tsv`, `.txt`
19+
**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.csv`, `.tsv`, `.txt`
1020

1121
## Installation
1222

@@ -16,22 +26,38 @@ pip install gridgulp
1626

1727
## Quick Start
1828

29+
### Trying GridGulp Out
30+
31+
To quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run
32+
33+
```bash
34+
python scripts/test_example_files.py
35+
```
36+
37+
You will receive output that looks like, representing identified ranges:
38+
39+
📁 tests/manual
40+
----------------------------------------------------------------------------------------------------
41+
✓ sample.xlsx | Tables: 1 | Time: 1.099s | Size: 122.6KB | Method: magika
42+
📄 Sheet: Sheet
43+
└─ A1:CV203 | 203×100 | Conf: 70%
44+
45+
1946
### Table Ranges vs DataFrames
2047

2148
GridGulp provides two ways to work with detected tables:
2249

23-
1. **Table Ranges** - Lightweight metadata about where tables are located (e.g., "A1:E100")
50+
1. **Table Ranges** - JSON metadata about where tables are located (e.g., "A1:E100")
2451
- Fast and memory-efficient
25-
- Perfect for mapping table locations or visualizing spreadsheet structure
52+
- Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure
2653
- No actual data is loaded into memory
2754

2855
2. **DataFrames** - The actual data extracted from those ranges as pandas DataFrames
2956
- Contains the full data with proper types
3057
- Ready for analysis, transformation, or export
31-
- Requires more memory but provides full data access
3258

3359
Choose based on your needs:
34-
- Use **ranges only** when you need to know where tables are or want to process them later
60+
- Use **ranges only** when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content
3561
- Use **DataFrames** when you need to analyze or transform the actual data
3662

3763
### Getting Table Ranges Only
@@ -185,14 +211,14 @@ if all_dataframes:
185211
- **Smart Headers** - Detects single and multi-row headers automatically
186212
- **Multiple Tables** - Handles sheets with multiple separate tables
187213
- **Quality Scoring** - Confidence scores for each detected table
188-
- **Fast** - Processes most files in under a second
214+
- **Fast** - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables
189215

190216
## Documentation
191217

192218
- [Full Usage Guide](docs/USAGE_GUIDE.md) - Detailed examples and configuration
193219
- [API Reference](docs/API_REFERENCE.md) - Complete API documentation
194220
- [Architecture](docs/ARCHITECTURE.md) - How GridGulp works internally
195-
- [Testing Guide](docs/TESTING_WITH_SCRIPT.md) - Test spreadsheets in bulk with the unified test script
221+
- [Testing Guide](docs/TESTING_GUIDE.md) - Test spreadsheets in bulk with the unified test script
196222

197223
## License
198224

docs/API_REFERENCE.md

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@ class Config(BaseModel):
5454
max_sheets: int = 10 # Max sheets to process
5555

5656
# Performance
57-
excel_reader: str = "calamine" # "calamine" or "openpyxl"
5857
max_memory_mb: int = 1000 # Max memory usage
5958
chunk_size: int = 10000 # Streaming chunk size
6059

@@ -180,7 +179,8 @@ class TableInfo(BaseModel):
180179
suggested_name: str | None = None # Optional name
181180
confidence: float # Detection confidence (0-1)
182181
detection_method: str # Method used
183-
headers: list[str] | None = None # Column headers
182+
headers: list[str] | None = None # Column headers (always extracted from first row)
183+
has_headers: bool = True # Whether headers were detected (confidence)
184184
data_preview: list[dict] | None = None # Sample data
185185

186186
@property
@@ -244,7 +244,7 @@ class FileType(str, Enum):
244244
XLSX = "xlsx" # Modern Excel
245245
XLS = "xls" # Legacy Excel
246246
XLSM = "xlsm" # Excel with macros
247-
XLSB = "xlsb" # Excel binary
247+
XLSB = "xlsb" # Excel binary (detected but not supported)
248248
CSV = "csv" # Comma-separated
249249
TSV = "tsv" # Tab-separated
250250
TXT = "txt" # Text file
@@ -262,6 +262,8 @@ class SimpleCaseDetector:
262262
def detect_simple_table(self, sheet_data: SheetData) -> SimpleCaseResult:
263263
"""Detect a single table starting near A1.
264264
265+
Headers are always extracted from the first row of detected tables.
266+
265267
Args:
266268
sheet_data: Sheet data to analyze
267269
@@ -273,6 +275,37 @@ class SimpleCaseDetector:
273275
"""Check if sheet is a simple single-table case."""
274276
```
275277

278+
### BoxTableDetector
279+
280+
High-confidence detection for tables with complete borders.
281+
282+
```python
283+
class BoxTableDetector:
284+
def __init__(self,
285+
min_table_size: tuple[int, int] = (2, 2),
286+
box_confidence: float = 0.95):
287+
"""Initialize box table detector.
288+
289+
Args:
290+
min_table_size: Minimum (rows, cols) for valid table
291+
box_confidence: Confidence score for bordered tables (default: 0.95)
292+
"""
293+
294+
def detect_box_tables(self, sheet_data: SheetData) -> list[TableInfo]:
295+
"""Detect tables with complete borders on all four sides.
296+
297+
This detector assigns high confidence (95%) to tables that have
298+
borders on all sides, addressing cases where formatting clearly
299+
indicates table boundaries.
300+
301+
Args:
302+
sheet_data: Sheet data to analyze
303+
304+
Returns:
305+
List of TableInfo objects with high confidence
306+
"""
307+
```
308+
276309
### IslandDetector
277310

278311
Multi-table detection using connected components.
@@ -294,6 +327,9 @@ class IslandDetector:
294327
def detect_islands(self, sheet_data: SheetData) -> list[DataIsland]:
295328
"""Detect disconnected data regions.
296329
330+
Headers are always extracted from the first row of each detected
331+
island, regardless of header detection confidence.
332+
297333
Args:
298334
sheet_data: Sheet data to analyze
299335
@@ -584,6 +620,7 @@ MIN_TABLE_SIZE = (2, 2)
584620

585621
# Detection methods
586622
DETECTION_SIMPLE_CASE = "simple_case_fast"
623+
DETECTION_BOX_TABLE = "box_table_detection"
587624
DETECTION_ISLAND = "island_detection_fast"
588625
DETECTION_METADATA = "excel_metadata"
589626

docs/ARCHITECTURE.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ GridGulp is a streamlined table detection framework that uses proven algorithms
3030
│ Detection Pipeline │
3131
│ ┌─────────────────────────────────────────────────┐ │
3232
│ │ 1. SimpleCaseDetector (single table near A1) │ │
33-
│ │ 2. IslandDetector (multi-table detection) │ │
34-
│ │ 3. ExcelMetadataExtractor (ListObjects) │ │
33+
│ │ 2. BoxTableDetector (tables with complete borders)│ │
34+
│ │ 3. IslandDetector (multi-table detection) │ │
35+
│ │ 4. ExcelMetadataExtractor (ListObjects) │ │
3536
│ └─────────────────────────────────────────────────┘ │
3637
├─────────────────────────────────────────────────────────┤
3738
│ Output Models │
@@ -88,12 +89,21 @@ GridGulp is a streamlined table detection framework that uses proven algorithms
8889
- **Performance**: < 1ms for most sheets
8990
- **Accuracy**: 100% for standard tables
9091
- **Algorithm**: Find data bounds, check density
92+
- **Headers**: Always extracted from first row
93+
94+
#### BoxTableDetector
95+
- **Use Case**: Tables with complete borders on all four sides
96+
- **Performance**: < 10ms for most sheets
97+
- **Accuracy**: 95% confidence for bordered tables
98+
- **Algorithm**: Detect cells with borders on all sides, verify data density
99+
- **Headers**: Extracted with formatting-based detection
91100

92101
#### IslandDetector
93102
- **Use Case**: Multiple disconnected tables
94103
- **Performance**: < 100ms for complex sheets
95104
- **Accuracy**: 95%+ for well-formatted data
96105
- **Algorithm**: Connected component analysis
106+
- **Headers**: Always extracted, with header detection confidence
97107

98108
#### ExcelMetadataExtractor
99109
- **Use Case**: Excel tables with defined ListObjects
@@ -142,8 +152,16 @@ for sheet in file_data.sheets:
142152
# Try simple case first (fast path)
143153
if simple_detector.is_simple_case(sheet):
144154
tables = [simple_detector.detect_simple_table(sheet)]
155+
# Try box detection for Excel files with formatting
156+
elif file_type in [FileType.XLSX, FileType.XLS]:
157+
box_tables = box_detector.detect_box_tables(sheet)
158+
if box_tables:
159+
tables = box_tables
160+
else:
161+
# Fall back to island detection
162+
tables = island_detector.detect_tables(sheet)
145163
else:
146-
# Fall back to island detection
164+
# Use island detection for other formats
147165
tables = island_detector.detect_tables(sheet)
148166
```
149167

0 commit comments

Comments
 (0)