[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items

Planned items for camera-ready deadline of NeurIPS 2025 (Oct 23rd) and future plans for 2025.

# Remark
As we have already updated many models since the paper submission and rebuttal, we have three different states of results. (1) the state at submission time, (2) the state when uploading to arXiv, including the promises for rerunning models we made for the rebuttal [this is the current state of the live LB], and (3) the current state of all models we ran since then and which we will publish soon. The camera-ready version will be based on state (2), and the next version of the live leaderboard will be based on state (3). 

# P0

- [ ] Technical Debt
	- [ ] Cleanup/remove old TabRepo scripts (@Innixma)
	- [ ] Add easier way to run TabArena-Full into the runner (see the following: https://github.com/autogluon/tabrepo/pull/176#issuecomment-3176871534)
	- [ ] Improve the method metadata workflow much easier and make the place to add/edit metadata somewhere more usable
- [ ] Finalize integration of results for new models, which we already ran since submission
	- [ ] LimiX [Not Verified]
	- [ ] TabFlex [Not Verified]
	- [ ] Beta-TabPFN [Not Verified]
- [ ] Update leaderboard
	- [ ] Add test data leakage column (boolean or %)
	- [ ] Add additional subsets: binary, multiclass (?)
	- [ ] Consider removing some models from the plots in the LB (KNN, worse than RF, ...) but keep in the LB tables.  
	- [ ] TabArena Rank Metric? := (rank (based on Elo) + rank based on Improvability + harmonic rank)/3
	- [ ] Communicate [overtuning](https://arxiv.org/abs/2506.19540) of TabArena


# P1 (Nice to have)

- [ ] Improve bagged/non-bagged use case: avoid calling predict twice for bagged models, store validation splits also for refited models
- [ ] Release TabArena as a PyPi package (with API consistency and no backward-breaking changes) 
- [ ] Somehow get a stable, long-term upload place (NGO? some extra entity that someone pays for)
- [ ] More Models
	- [ ] TabICL with HPO [We got results but have not added them, need verification/confirmation from @dholzmueller]
	- [ ] PerpetualBoosting [We got some results, but not verified]
	- [ ] TabM version based on pip package and with varying inner seeds.  
	- [ ] LimiX with HPO
	- [ ] Orion
	- [ ] iLTM

- [ ] Improve User Experience
	- [ ] Create a technical API 
	- [ ] Create an onboarding page with different use cases and better step-by-step documentation (upgrade from https://github.com/TabArena/tabarena_benchmarking_examples)
	
# P2 (Stretch Goal)

- [ ] Think about how to communicate the difference of HPO vs finetuning vs ICL-performance
- [ ] Improve tree-based models further
	- [ ] Update hyperparameters and implementation (also use newer versions if they exist) for tree-based models (@dholzmueller, #203, #124)
- [ ] Rerun all methods with varying inner seeds
	- [ ] CPU models: KNN, Linear, RF, ExtraTrees, FastaiMLP, TorchMLP, CatBoost, LightGBM, XGBoost
	- [ ] GPU models: Beta-TabPFN, ModernNCA (HPO)
	- [ ] GPU models with HPO and refitting (so minor impact): TabPFNv2 (HPO), TabICL (HPO)
	- [ ] GPU models without HPO and refitting (so likely no impact): TabFlex, TabDPT
- [ ] Portfolio building logic that created AutoGluon 1.4 extreme preset portfolio? (@Innixma)
- [ ] ?Reference Pipelines?
	- [ ] AutoGluon high, HQIL, good, medium quality runs (@Innixma)
	- [ ] AutoGluon w/ smaller time limit (5 min, 10 min, 30 min) (@Innixma)
	- [ ] Integration with AMLB for other AutoML system results and support for systems/agents

# Done (old)

<details>
  <summary>Finished TODOs </summary>

### More Ecosystem
- [x] More Models
	- [x] RealTabPFN
- [x] Improve User Experience
	- [x] Polished end to end example of locally fitting a model -> evaluating on TabArena
	- [x] Polished installation instructions & FAQ (ex: TabDPT install error)
- [x] Make LB and TabArena a HF paper (to get into trending papers)


### Paper 

- [x] Regenerate plots/figs to include the fixed 95% CI logic for Elo.
- [x] Finalize and add new figures (@Innixma)
	- [x] Pareto front of Improvability and inference time
	- [x] showing the performance over (tuning) time related to Improvability
	- [x] Validation Overfitting plot to replace the ensemble-weight plot (or similar) + adjusting writing (@Innixma)
- [x] Finalize and add improvements to writing and the appendix based on feedback and our promises in the rebuttal (@LennartPurucker)
	- [x] Ensure writing around the integration of new figures aligns with the rebuttal promises (@LennartPurucker)
- [x] Finalize and add improvements to writing and the appendix based on community feedback (@LennartPurucker)
- [x] Ensure we mention and quantify likely dataset contamination for TabDPT
- [x] Win-rate matrix (@Innixma)
- [x] Remove the imputed dataset for KNN from tables for performance per dataset  (@atschalz) and make more straightforward somewhere that KNN has imputed results for datasets without numerical features (we only state in C1 that we drop all categorical features). 
	- [x] PR Merged to remove imputed KNN from tables  (@atschalz)
	- [x] Re-run with corrected code on the paper results (don't use new RealMLP/EBM results) (@Innixma)
	- [x] New version of KNN preprocessing; rerun KNN; replot everything
	- [x] Fix Linear Model search space so C uses log=True, run 201 configs.
	- [x] Remove extra `_c2` configs, only have 1 default per method.
- [x] 201 runs for KNN and Linear?
- [x] Replace `tuned + ensemble` with `tuned + ensembled`

### Ecosystem
- [x] Finalize integration of results for new models, which we already ran since submission
	- [x] RealMLP_GPU [Verifed]
	- [x] EBM [Verified]
	- [x] Mitra [Verified]
	- [x] xRFM [Verified]
	- [x] New run of TabDPT with verification from authors [context size, ensemble usage, ...]
	- [x] Better KNN baseline pipeline (improve preprocessing and search space), or consider removing it.
	- [x] Verify/improve linear model and its HPO if possible, go to 201 configs as for all other models. 
- [x] Technical Debt
	- [x] Rename Code package to `tabarena`
	- [x] Rename GitHub repository to TabArena
	- [x] Merge https://github.com/TabArena/tabarena_benchmarking_examples (as much as possible, maybe just examples) into TabArena codebase
- [x] Update leaderboard
	- [x] Ensure we are using the newest data from model runs that we have at the time of the update
	- [x] Integrate Parteo front and tuning over time as plots into each leaderboard, like the main figure
	- [x] Add support for results of unverified models 
	- [x] Add additional subsets: "not small data"; consider removing/renaming TabICL-data and TabPFN-data.
- [x] Add verified / unverified to existing models
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items #213

Remark

P0

P1 (Nice to have)

P2 (Stretch Goal)

Done (old)

More Ecosystem

Paper

Ecosystem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items #213

Description

Remark

P0

P1 (Nice to have)

P2 (Stretch Goal)

Done (old)

More Ecosystem

Paper

Ecosystem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions