Skip to content

Commit 44226ba

Browse files
authored
Merge pull request #8 from Ganymede-Bio/ci-cleanup
CI / general cleanup
2 parents f72142a + 963fa40 commit 44226ba

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+5482
-846
lines changed

.github/workflows/ci.yml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,17 @@ on:
99
jobs:
1010
test:
1111
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ["3.10", "3.11", "3.12", "3.13"]
1215

1316
steps:
1417
- uses: actions/checkout@v4
1518

16-
- name: Set up Python 3.11
19+
- name: Set up Python ${{ matrix.python-version }}
1720
uses: actions/setup-python@v5
1821
with:
19-
python-version: "3.11"
22+
python-version: ${{ matrix.python-version }}
2023

2124
- name: Install uv
2225
uses: astral-sh/setup-uv@v6
@@ -33,6 +36,10 @@ jobs:
3336
run: |
3437
uv run ruff check src/ tests/
3538
39+
- name: Type check with mypy
40+
run: |
41+
uv run mypy src/
42+
3643
- name: Test with pytest
3744
run: |
3845
uv run pytest tests/ -v --cov=gridgulp --cov-report=xml

.github/workflows/docs.yml

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
name: Deploy Documentation
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
permissions:
12+
contents: write
13+
pages: write
14+
id-token: write
15+
16+
jobs:
17+
build-docs:
18+
runs-on: ubuntu-latest
19+
steps:
20+
- uses: actions/checkout@v4
21+
with:
22+
fetch-depth: 0 # Full history for git info
23+
24+
- name: Set up Python
25+
uses: actions/setup-python@v5
26+
with:
27+
python-version: '3.11'
28+
29+
- name: Cache dependencies
30+
uses: actions/cache@v3
31+
with:
32+
path: |
33+
~/.cache/pip
34+
~/.cache/uv
35+
key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}
36+
restore-keys: |
37+
${{ runner.os }}-pip-
38+
39+
- name: Install dependencies
40+
run: |
41+
pip install uv
42+
uv pip install --system -e ".[docs]"
43+
44+
- name: Build documentation
45+
run: mkdocs build --strict
46+
47+
- name: Upload artifact
48+
uses: actions/upload-pages-artifact@v3
49+
with:
50+
path: ./site
51+
52+
deploy-docs:
53+
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
54+
needs: build-docs
55+
runs-on: ubuntu-latest
56+
57+
environment:
58+
name: github-pages
59+
url: ${{ steps.deployment.outputs.page_url }}
60+
61+
steps:
62+
- name: Deploy to GitHub Pages
63+
id: deployment
64+
uses: actions/deploy-pages@v4

.github/workflows/test-outputs.yml

Lines changed: 0 additions & 93 deletions
This file was deleted.

.pre-commit-config.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ repos:
55
- id: trailing-whitespace
66
- id: end-of-file-fixer
77
- id: check-yaml
8+
exclude: mkdocs.yml
89
- id: check-added-large-files
910
args: ['--maxkb=1000']
1011
- id: check-json
@@ -22,3 +23,27 @@ repos:
2223
- id: ruff
2324
args: [--fix, --exit-non-zero-on-fix]
2425
- id: ruff-format
26+
27+
- repo: https://github.com/pre-commit/mirrors-mypy
28+
rev: v1.5.1
29+
hooks:
30+
- id: mypy
31+
args: [--config-file=pyproject.toml]
32+
additional_dependencies: [
33+
"types-aiofiles",
34+
"pandas-stubs>=2.0.0",
35+
"pydantic>=2.0,<3.0",
36+
"python-magic"
37+
]
38+
files: ^src/
39+
40+
- repo: local
41+
hooks:
42+
- id: pytest
43+
name: pytest
44+
entry: pytest
45+
language: system
46+
types: [python]
47+
pass_filenames: false
48+
always_run: true
49+
args: [--tb=short, -q]

CHANGELOG.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,26 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.3.1] - 2025-07-29
9+
10+
### Changed
11+
- **Project Rename**: Renamed from GridPorter to GridGulp
12+
- Updated all package references throughout codebase
13+
- Renamed source directory from `src/gridporter/` to `src/gridgulp/`
14+
- Updated project metadata and documentation
15+
- CI improvements:
16+
- Added Python version matrix testing (3.10, 3.11, 3.12, 3.13)
17+
- Updated ruff target version to py310 (minimum supported)
18+
19+
### Fixed
20+
- Fixed build configuration to match new project name
21+
- Fixed all linting issues identified by ruff
22+
- Added appropriate lint rule exceptions for tests, examples, and scripts
23+
- Fixed CellRange/TableRange instantiation to use keyword arguments
24+
- Fixed StructuredTextDetector dimension calculations
25+
- Fixed header extraction in StructuredTextDetector
26+
- Fixed test compatibility issues in DataFrameExtractor tests
27+
828
## [0.3.0] - 2025-07-28
929

1030
### Added
@@ -19,9 +39,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1939

2040
### Changed
2141
- **BREAKING**: Simplified architecture - removed all agent dependencies
22-
- Reduced codebase by ~77% while maintaining functionality
42+
- Reduced codebase substantially while maintaining functionality
2343
- Replaced complex agent orchestration with direct detection approach
24-
- SimpleCaseDetector and IslandDetector now handle 97% of use cases
44+
- SimpleCaseDetector and IslandDetector now handle most use cases
2545
- Improved file type detection to handle UTF-16 files correctly
2646
- capture_detection_outputs.py now processes ALL files in examples directory
2747

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# GridGulp Project Instructions
22

33
## Overview
4-
GridGulp is a lightweight, efficient spreadsheet table detection framework with zero external dependencies. It automatically detects and extracts tables from spreadsheets (Excel, CSV, and text files) using proven algorithmic detection methods that handle 97% of real-world use cases.
4+
GridGulp is a lightweight, efficient spreadsheet table detection framework with zero external dependencies. It automatically detects and extracts tables from spreadsheets (Excel, CSV, and text files) using proven algorithmic detection methods that handle most real-world use cases.
55

66
## Core Architecture
77

README.md

Lines changed: 36 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Automatically detect and extract tables from Excel, CSV, and text files.
44

55
## What is GridGulp?
66

7-
GridGulp finds tables in your spreadsheets - even when there are multiple tables on one sheet or when tables don't start at cell A1. No configuration required.
7+
GridGulp finds tables in your spreadsheets - even when there are multiple tables on one sheet or when tables don't start at cell A1. It comes with reasonable defaults and is fully configurable.
88

99
**Supported formats:** `.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.csv`, `.tsv`, `.txt`
1010

@@ -30,13 +30,40 @@ for sheet in result.sheets:
3030
print(f" - {table.range.excel_range}")
3131
```
3232

33+
### Jupyter Notebook Usage
34+
35+
In Jupyter notebooks, you can use synchronous methods for simplicity:
36+
37+
```python
38+
from gridgulp import GridGulp
39+
40+
# Create GridGulp instance
41+
gg = GridGulp()
42+
43+
# Use the sync method - works in Jupyter without any async complexity
44+
result = gg.detect_tables_sync("sales_report.xlsx")
45+
46+
# Display results
47+
print(f"📄 File: {result.file_info.path.name}")
48+
print(f"📊 Total tables found: {result.total_tables}\n")
49+
50+
for sheet in result.sheets:
51+
print(f"Sheet: {sheet.name}")
52+
for table in sheet.tables:
53+
print(f" - Table at {table.range.excel_range}")
54+
print(f" Size: {table.shape[0]} rows × {table.shape[1]} columns")
55+
print(f" Confidence: {table.confidence:.1%}")
56+
```
57+
3358
### Extract DataFrames
3459

60+
Extract detected tables as pandas DataFrames with automatic type inference and quality scoring:
61+
3562
```python
3663
from gridgulp.extractors import DataFrameExtractor
3764
from gridgulp.readers import get_reader
3865

39-
# Extract detected tables as pandas DataFrames
66+
# Example: Extract tables from a sales report
4067
reader = get_reader("sales_report.xlsx")
4168
file_data = reader.read_sync()
4269

@@ -47,12 +74,17 @@ for sheet_result in result.sheets:
4774
for table in sheet_result.tables:
4875
df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
4976
if df is not None:
50-
print(f"Extracted {len(df)} rows with quality score: {quality:.2f}")
77+
print(f"\n📊 Extracted table from {table.range.excel_range}")
78+
print(f" Shape: {df.shape} | Quality: {quality:.1%}")
79+
print(f" Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
80+
print(f"\nFirst few rows:")
81+
print(df.head())
5182
```
5283

5384
## Key Features
5485

55-
- **Automatic Detection** - Finds all tables without configuration
86+
- **Automatic Detection** - Finds all tables with sensible defaults
87+
- **Fully Configurable** - Customize detection thresholds and behavior
5688
- **Smart Headers** - Detects single and multi-row headers automatically
5789
- **Multiple Tables** - Handles sheets with multiple separate tables
5890
- **Quality Scoring** - Confidence scores for each detected table

docs/ARCHITECTURE.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ GridGulp is a streamlined table detection framework that uses proven algorithms
66

77
## Core Design Principles
88

9-
1. **Fast Path First**: 97% of use cases handled by simple algorithms
9+
1. **Fast Path First**: most use cases handled by simple algorithms
1010
2. **No External Dependencies**: Pure algorithmic detection without AI/ML services
1111
3. **Format Agnostic**: Unified interface for Excel, CSV, and text files
1212
4. **Memory Efficient**: Streaming processing for large files
@@ -16,25 +16,25 @@ GridGulp is a streamlined table detection framework that uses proven algorithms
1616

1717
```
1818
┌─────────────────────────────────────────────────────────┐
19-
│ GridGulp API │
19+
│ GridGulp API
2020
├─────────────────────────────────────────────────────────┤
21-
│ File Type Detection
22-
│ (Magika + Magic)
21+
│ File Type Detection │
22+
│ (Magika + Magic) │
2323
├─────────────────────────────────────────────────────────┤
24-
│ File Readers
25-
│ ┌─────────────┬──────────────┬────────────────────┐ │
26-
│ │ ExcelReader │ CSVReader │ TextReader │ │
27-
│ │ (openpyxl) │ (csv.reader)│ (encoding detect) │ │
28-
│ └─────────────┴──────────────┴────────────────────┘ │
24+
│ File Readers │
25+
│ ┌─────────────┬──────────────┬────────────────────┐
26+
│ │ ExcelReader │ CSVReader │ TextReader │
27+
│ │ (openpyxl) │ (csv.reader)│ (encoding detect) │
28+
│ └─────────────┴──────────────┴────────────────────┘
2929
├─────────────────────────────────────────────────────────┤
30-
│ Detection Pipeline
31-
│ ┌─────────────────────────────────────────────────┐ │
32-
│ │ 1. SimpleCaseDetector (single table near A1) │ │
33-
│ │ 2. IslandDetector (multi-table detection) │ │
34-
│ │ 3. ExcelMetadataExtractor (ListObjects) │ │
35-
│ └─────────────────────────────────────────────────┘ │
30+
│ Detection Pipeline │
31+
│ ┌─────────────────────────────────────────────────┐
32+
│ │ 1. SimpleCaseDetector (single table near A1) │
33+
│ │ 2. IslandDetector (multi-table detection) │
34+
│ │ 3. ExcelMetadataExtractor (ListObjects) │
35+
│ └─────────────────────────────────────────────────┘
3636
├─────────────────────────────────────────────────────────┤
37-
│ Output Models
37+
│ Output Models │
3838
│ DetectionResult → SheetResult → TableInfo │
3939
└─────────────────────────────────────────────────────────┘
4040
```

0 commit comments

Comments
 (0)