Skip to content

[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items #213

@Innixma

Description

@Innixma

Planned items for camera-ready deadline of NeurIPS 2025 (Oct 23rd) and future plans for 2025.

Remark

As we have already updated many models since the paper submission and rebuttal, we have three different states of results. (1) the state at submission time, (2) the state when uploading to arXiv, including the promises for rerunning models we made for the rebuttal [this is the current state of the live LB], and (3) the current state of all models we ran since then and which we will publish soon. The camera-ready version will be based on state (2), and the next version of the live leaderboard will be based on state (3).

P0

  • Technical Debt
  • Finalize integration of results for new models, which we already ran since submission
    • LimiX [Not Verified]
    • TabFlex [Not Verified]
    • Beta-TabPFN [Not Verified]
  • Update leaderboard
    • Add test data leakage column (boolean or %)
    • Add additional subsets: binary, multiclass (?)
    • Consider removing some models from the plots in the LB (KNN, worse than RF, ...) but keep in the LB tables.
    • TabArena Rank Metric? := (rank (based on Elo) + rank based on Improvability + harmonic rank)/3
    • Communicate overtuning of TabArena

P1 (Nice to have)

  • Improve bagged/non-bagged use case: avoid calling predict twice for bagged models, store validation splits also for refited models

  • Release TabArena as a PyPi package (with API consistency and no backward-breaking changes)

  • Somehow get a stable, long-term upload place (NGO? some extra entity that someone pays for)

  • More Models

    • TabICL with HPO [We got results but have not added them, need verification/confirmation from @dholzmueller]
    • PerpetualBoosting [We got some results, but not verified]
    • TabM version based on pip package and with varying inner seeds.
    • LimiX with HPO
    • Orion
    • iLTM
  • Improve User Experience

P2 (Stretch Goal)

  • Think about how to communicate the difference of HPO vs finetuning vs ICL-performance
  • Improve tree-based models further
  • Rerun all methods with varying inner seeds
    • CPU models: KNN, Linear, RF, ExtraTrees, FastaiMLP, TorchMLP, CatBoost, LightGBM, XGBoost
    • GPU models: Beta-TabPFN, ModernNCA (HPO)
    • GPU models with HPO and refitting (so minor impact): TabPFNv2 (HPO), TabICL (HPO)
    • GPU models without HPO and refitting (so likely no impact): TabFlex, TabDPT
  • Portfolio building logic that created AutoGluon 1.4 extreme preset portfolio? (@Innixma)
  • ?Reference Pipelines?
    • AutoGluon high, HQIL, good, medium quality runs (@Innixma)
    • AutoGluon w/ smaller time limit (5 min, 10 min, 30 min) (@Innixma)
    • Integration with AMLB for other AutoML system results and support for systems/agents

Done (old)

Finished TODOs

More Ecosystem

  • More Models
    • RealTabPFN
  • Improve User Experience
    • Polished end to end example of locally fitting a model -> evaluating on TabArena
    • Polished installation instructions & FAQ (ex: TabDPT install error)
  • Make LB and TabArena a HF paper (to get into trending papers)

Paper

  • Regenerate plots/figs to include the fixed 95% CI logic for Elo.
  • Finalize and add new figures (@Innixma)
    • Pareto front of Improvability and inference time
    • showing the performance over (tuning) time related to Improvability
    • Validation Overfitting plot to replace the ensemble-weight plot (or similar) + adjusting writing (@Innixma)
  • Finalize and add improvements to writing and the appendix based on feedback and our promises in the rebuttal (@LennartPurucker)
    • Ensure writing around the integration of new figures aligns with the rebuttal promises (@LennartPurucker)
  • Finalize and add improvements to writing and the appendix based on community feedback (@LennartPurucker)
  • Ensure we mention and quantify likely dataset contamination for TabDPT
  • Win-rate matrix (@Innixma)
  • Remove the imputed dataset for KNN from tables for performance per dataset (@atschalz) and make more straightforward somewhere that KNN has imputed results for datasets without numerical features (we only state in C1 that we drop all categorical features).
    • PR Merged to remove imputed KNN from tables (@atschalz)
    • Re-run with corrected code on the paper results (don't use new RealMLP/EBM results) (@Innixma)
    • New version of KNN preprocessing; rerun KNN; replot everything
    • Fix Linear Model search space so C uses log=True, run 201 configs.
    • Remove extra _c2 configs, only have 1 default per method.
  • 201 runs for KNN and Linear?
  • Replace tuned + ensemble with tuned + ensembled

Ecosystem

  • Finalize integration of results for new models, which we already ran since submission
    • RealMLP_GPU [Verifed]
    • EBM [Verified]
    • Mitra [Verified]
    • xRFM [Verified]
    • New run of TabDPT with verification from authors [context size, ensemble usage, ...]
    • Better KNN baseline pipeline (improve preprocessing and search space), or consider removing it.
    • Verify/improve linear model and its HPO if possible, go to 201 configs as for all other models.
  • Technical Debt
  • Update leaderboard
    • Ensure we are using the newest data from model runs that we have at the time of the update
    • Integrate Parteo front and tuning over time as plots into each leaderboard, like the main figure
    • Add support for results of unverified models
    • Add additional subsets: "not small data"; consider removing/renaming TabICL-data and TabPFN-data.
  • Add verified / unverified to existing models

Metadata

Metadata

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions