-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Planned items for camera-ready deadline of NeurIPS 2025 (Oct 23rd) and future plans for 2025.
Remark
As we have already updated many models since the paper submission and rebuttal, we have three different states of results. (1) the state at submission time, (2) the state when uploading to arXiv, including the promises for rerunning models we made for the rebuttal [this is the current state of the live LB], and (3) the current state of all models we ran since then and which we will publish soon. The camera-ready version will be based on state (2), and the next version of the live leaderboard will be based on state (3).
P0
- Technical Debt
- Cleanup/remove old TabRepo scripts (@Innixma)
- Add easier way to run TabArena-Full into the runner (see the following: [WIP][New Model] Dynamic Programming Decision Trees #176 (comment))
- Improve the method metadata workflow much easier and make the place to add/edit metadata somewhere more usable
- Finalize integration of results for new models, which we already ran since submission
- LimiX [Not Verified]
- TabFlex [Not Verified]
- Beta-TabPFN [Not Verified]
- Update leaderboard
- Add test data leakage column (boolean or %)
- Add additional subsets: binary, multiclass (?)
- Consider removing some models from the plots in the LB (KNN, worse than RF, ...) but keep in the LB tables.
- TabArena Rank Metric? := (rank (based on Elo) + rank based on Improvability + harmonic rank)/3
- Communicate overtuning of TabArena
P1 (Nice to have)
-
Improve bagged/non-bagged use case: avoid calling predict twice for bagged models, store validation splits also for refited models
-
Release TabArena as a PyPi package (with API consistency and no backward-breaking changes)
-
Somehow get a stable, long-term upload place (NGO? some extra entity that someone pays for)
-
More Models
- TabICL with HPO [We got results but have not added them, need verification/confirmation from @dholzmueller]
- PerpetualBoosting [We got some results, but not verified]
- TabM version based on pip package and with varying inner seeds.
- LimiX with HPO
- Orion
- iLTM
-
Improve User Experience
- Create a technical API
- Create an onboarding page with different use cases and better step-by-step documentation (upgrade from https://github.com/TabArena/tabarena_benchmarking_examples)
P2 (Stretch Goal)
- Think about how to communicate the difference of HPO vs finetuning vs ICL-performance
- Improve tree-based models further
- Update hyperparameters and implementation (also use newer versions if they exist) for tree-based models (@dholzmueller, Tune criterion for RF / XT #203, [TabArena] Try better fillna / cat encoding strategies for RF/XT/KNN #124)
- Rerun all methods with varying inner seeds
- CPU models: KNN, Linear, RF, ExtraTrees, FastaiMLP, TorchMLP, CatBoost, LightGBM, XGBoost
- GPU models: Beta-TabPFN, ModernNCA (HPO)
- GPU models with HPO and refitting (so minor impact): TabPFNv2 (HPO), TabICL (HPO)
- GPU models without HPO and refitting (so likely no impact): TabFlex, TabDPT
- Portfolio building logic that created AutoGluon 1.4 extreme preset portfolio? (@Innixma)
- ?Reference Pipelines?
Done (old)
Finished TODOs
More Ecosystem
- More Models
- RealTabPFN
- Improve User Experience
- Polished end to end example of locally fitting a model -> evaluating on TabArena
- Polished installation instructions & FAQ (ex: TabDPT install error)
- Make LB and TabArena a HF paper (to get into trending papers)
Paper
- Regenerate plots/figs to include the fixed 95% CI logic for Elo.
- Finalize and add new figures (@Innixma)
- Pareto front of Improvability and inference time
- showing the performance over (tuning) time related to Improvability
- Validation Overfitting plot to replace the ensemble-weight plot (or similar) + adjusting writing (@Innixma)
- Finalize and add improvements to writing and the appendix based on feedback and our promises in the rebuttal (@LennartPurucker)
- Ensure writing around the integration of new figures aligns with the rebuttal promises (@LennartPurucker)
- Finalize and add improvements to writing and the appendix based on community feedback (@LennartPurucker)
- Ensure we mention and quantify likely dataset contamination for TabDPT
- Win-rate matrix (@Innixma)
- Remove the imputed dataset for KNN from tables for performance per dataset (@atschalz) and make more straightforward somewhere that KNN has imputed results for datasets without numerical features (we only state in C1 that we drop all categorical features).
- PR Merged to remove imputed KNN from tables (@atschalz)
- Re-run with corrected code on the paper results (don't use new RealMLP/EBM results) (@Innixma)
- New version of KNN preprocessing; rerun KNN; replot everything
- Fix Linear Model search space so C uses log=True, run 201 configs.
- Remove extra
_c2configs, only have 1 default per method.
- 201 runs for KNN and Linear?
- Replace
tuned + ensemblewithtuned + ensembled
Ecosystem
- Finalize integration of results for new models, which we already ran since submission
- RealMLP_GPU [Verifed]
- EBM [Verified]
- Mitra [Verified]
- xRFM [Verified]
- New run of TabDPT with verification from authors [context size, ensemble usage, ...]
- Better KNN baseline pipeline (improve preprocessing and search space), or consider removing it.
- Verify/improve linear model and its HPO if possible, go to 201 configs as for all other models.
- Technical Debt
- Rename Code package to
tabarena - Rename GitHub repository to TabArena
- Merge https://github.com/TabArena/tabarena_benchmarking_examples (as much as possible, maybe just examples) into TabArena codebase
- Rename Code package to
- Update leaderboard
- Ensure we are using the newest data from model runs that we have at the time of the update
- Integrate Parteo front and tuning over time as plots into each leaderboard, like the main figure
- Add support for results of unverified models
- Add additional subsets: "not small data"; consider removing/renaming TabICL-data and TabPFN-data.
- Add verified / unverified to existing models