You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Project for an advanced lab investigating LLM benchmarks from an IR perspective. Instead of focusing on model performance, we evaluated benchmark robustness, identifying which questions truly differentiate models and whether leaderboard rankings reflect real differences or are dominated by easy, high-hubness items.