Kaggle Kernel Data Challenge

This repository implements machine learning kernel methods from scratch for an image classification Kaggle data challenge.

Setup and Requirements

We use uv for minimal and reproducible dependency management. This solution complies with strict constraints restricting the use of external machine learning libraries.

Install Dependencies
```
uv sync
```
Kaggle Authentication Ensure your Kaggle API token is appropriately configured at ~/.kaggle/kaggle.json.
Code Formatting and Linting Install the pre-commit hooks to ensure automated linting (Python using Ruff and Markdown using mdformat) before every commit:
```
uv run pre-commit install
```

Data Management

Ensure the dataset structure mirrors the following in the data/ directory before running: Xtr.csv, Xte.csv, Ytr.csv.

To download the required data directly via Kaggle CLI within the environment:

uv run kaggle competitions download -c data-challenge-kernel-methods-2025-2026 -p data/
cd data && unzip data-challenge-kernel-methods-2025-2026.zip && rm data-challenge-kernel-methods-2025-2026.zip && cd ..

Submitting Solutions

The repository now uses one config-driven path for both local evaluation and Kaggle submission export. Submission files are written with the required Id and Prediction columns.

To train one validated config on the full training set and produce Yte.csv, use the evaluation script in full-data mode:

uv run python scripts/evaluate.py \
   --config configs/experiments/hog_scat_j3_ckn_cov_concat.yaml \
   --full-data \
   --output data/Yte.csv

The start script remains available as a thin wrapper around the same full-data path:

uv run start --config configs/experiments/hog_scat_j3_ckn_cov_concat.yaml

To submit the generated solution directly to Kaggle via the CLI:

uv run kaggle competitions submit -c data-challenge-kernel-methods-2025-2026 -f data/Yte.csv -m "Message describing the submission logic"

Running Tests

Ensure robust coverage. To run the full suite:

uv run pytest --cov=kernel_methods tests/

Running Experiments

To iteratively solve and evaluate specific kernels locally without burning Kaggle submissions, use the evaluate.py script. Experiments are configured as separate .yaml files in the configs/experiments/ directory.

To evaluate one or multiple specific experiments, you can either:

# Evaluate the default experiment (linear)
make evaluate

# Evaluate multiple specific experiments by name
make evaluate EXPS="linear rbf"

# Or natively via python
uv run python scripts/evaluate.py --experiments linear rbf

Augmentation is now resolved from the experiment config itself, with optional CLI overrides. For example, the current best config declares flip augmentation in YAML, and you can temporarily override it for a comparison run:

uv run python scripts/evaluate.py \
   --config configs/experiments/hog_scat_j3_ckn_cov_concat.yaml \
   --no-augment

To estimate a config locally and then export a submission from the same entrypoint:

# local validation
uv run python scripts/evaluate.py \
   --config configs/experiments/hog_scat_j3_ckn_cov_concat.yaml

# full-data retrain + export
uv run python scripts/evaluate.py \
   --config configs/experiments/hog_scat_j3_ckn_cov_concat.yaml \
   --full-data \
   --output data/Yte.csv

To explore the new scalable lane that avoids dense Gram-matrix training, run the RFF-based configs on top of explicit handcrafted feature stacks:

# scalable handcrafted baseline
uv run python scripts/evaluate.py \
   --config configs/experiments/rff_hog_scat_cov.yaml

# scalable handcrafted + CKN feature stack
uv run python scripts/evaluate.py \
   --config configs/experiments/rff_hog_scat_cov_ckn.yaml

These configs use primal ridge on Random Fourier Features rather than dense kernel ridge, which makes them the intended path for testing whether more augmentation and larger effective sample counts unlock additional score.

There is also an explicit experimental ZCA branch for raw-pixel RFF work:

uv run python scripts/evaluate.py \
   --config configs/experiments/rff_raw_zca.yaml

This branch is intentionally isolated. It is not part of the main HOG, scattering, covariance, or CKN pipeline and should be treated as a side experiment rather than a default preprocessing step.

Citation

@misc{data-challenge-kernel-methods-2025-2026,
    author = {Michael Arbel},
    title = {Data Challenge - Kernel methods (2025-2026)},
    year = {2026},
    howpublished = {\url{https://kaggle.com/competitions/data-challenge-kernel-methods-2025-2026}},
    note = {Kaggle}
}

The citation above corresponds to the Kaggle competition used for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs/experiments		configs/experiments
data		data
docs		docs
kernel_methods		kernel_methods
logs		logs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Kernel Data Challenge

Setup and Requirements

Data Management

Submitting Solutions

Running Tests

Running Experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kaggle Kernel Data Challenge

Setup and Requirements

Data Management

Submitting Solutions

Running Tests

Running Experiments

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages