Skip to main content
Enterprise AI Analysis: Ensembling Tabular Foundation Models

Enterprise AI Analysis

Ensembling Tabular Foundation Models: A Diversity Ceiling and a Calibration Trap

Explore the findings from cutting-edge research on Tabular Foundation Models (TFMs) and ensemble strategies, revealing critical insights into performance, calibration, and computational costs in enterprise AI applications.

Executive Impact: Key Takeaways

TFMs show great promise but face challenges in ensemble diversity and calibration. Our analysis highlights the trade-offs between accuracy, compute, and reliability for critical business decisions.

0.961 Mean Pairwise Q-statistic: near-redundancy among TFMs
0.18% Accuracy Gain for Best Ensemble (Cascade_2level)
253x Compute Multiplier for Best Ensemble
8.13 Stacking_LR Log-Loss Rank (worst ensemble)

Executive Summary: Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961, close enough to 1 that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys +0.18% accuracy over the strongest single TFM at 253× the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly worse than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction to TFMs and Ensembling

Tabular foundation models (TFMs) have advanced rapidly, with growing architectural diversity. We show that this diversity is mostly nominal: a pool of six modern TFMs produces near-redundant predictions, and ensembling cannot exploit diversity that is not there. TabPFN [1, 2], TabICL [3], and variants like Mitra [4], Orion-Bix [5], TabDPT [6], and CARTE [7] all perform in-context learning (ICL) over synthetic or curated priors. Continued pre-training on real-world tables [8] now rivals heavy AutoML in a single forward pass: TabPFNv2.5 matches AutoGluon 1.4 on TabArena [9].

What these models do not do, however, is win uniformly. Across our 153-dataset benchmark, the per-dataset accuracy wins among six TFMs split as follows: TabPFNv2.5-52, TabICLv2-46, TabPFNv2.6-25, TabICL-13, LimiX-12, OrionMSPv1.5-5 (all approximate numbers). The median accuracy spread between best and worst TFM on a single dataset is 1.95%; on 24% of datasets it exceeds 5%, so committing to one TFM in advance loses meaningful accuracy on roughly a quarter of the benchmark. Ensembling is the textbook response [10, 11, 12].

The cost-benefit story for TFMs differs from the gradient-boosted decision tree (GBDT) setting where ensemble methods originated. Per-task TFM inference is cheap once the model is pretrained (under one second on most benchmark datasets), though pretraining itself is a substantial fixed cost. Cascade-style stacking layers K-fold OOF inference across all bases, dwarfing any single forward pass. Beyond compute, there is a structural concern. TFMs trained with ICL on synthetic priors approximate Bayesian model averaging at inference [13, 14]; broadly similar priors yield broadly similar posteriors, leaving little for a convex combiner to recover.

Context: Ensemble Literature & TFM Unique Challenges

The ensemble literature provides a useful baseline for what to expect from this study. Classical results show that gains from averaging classifiers scale with two things: how accurate the base classifiers are, and how independent their errors are. Bagging [10] reduces variance when bases are decorrelated; stacking [18, 19] learns a meta-mapping over base outputs; ensemble selection from a large library of fits [12] chooses a small weighted subset by greedy validation search. Across all three, the standard assumption is that the pool was constructed to be diverse, often by training the same model class on resampled or perturbed data. Diversity measures such as the Q-statistic, Cohen's k, and disagreement [20] were introduced to quantify exactly that.

Tabular foundation models break the assumption. The bases are pretrained, not refit per dataset, and the per-task perturbation budget is small: random seeds change the in-context order but not the underlying prior. Recent work has explored ensembling in this setting at three different levels. At the architecture level, TabM [16] shares parameters across an internal set of branches and trains them end-to-end. At the configuration level, the post-hoc protocol in TabPFN [2] averages many hyperparameter configurations of a single model. At the cross-class level, TabArena [9] reports post-hoc ensembles over TFMs, GBDTs, and neural baselines, and notes that validation-based weight selection can over-represent some model classes. HAPEns [15] extends that idea with a hardware-aware multi-objective selector. We sit at a fourth level: holding the model class fixed (all TFMs), and asking what convex or stacked combinations of six different pretrained TFMs can recover on their own.

The notion that ensemble gains have a ceiling is not new [20, 11], but it usually appears as a property of small classifier pools on small datasets, not as a structural feature of a pretraining family. For TFMs, the question becomes whether the inductive bias of ICL over synthetic priors leaves enough room for diversity to matter. There is theoretical reason to expect that it does not: [13] and [14] formalise the sense in which ICL approximates Bayesian model averaging at inference time, so two TFMs trained on broadly similar priors are already implicitly averaging over broadly similar posterior families. The Q-statistic we report (Q = 0.961) is the empirical counterpart of that argument: six models that all consult the same kind of prior fail on the same instances. Real-TabPFN [8] suggests one possible escape route, namely continued pre-training on real-world tables to shift the prior; whether that produces enough error decorrelation to lift the ceiling is open.

Calibration is the second thread. Modern deep classifiers tend to be overconfident, and post-hoc fixes such as temperature scaling [21] were developed specifically for that regime. [22] showed that some ensemble methods (notably bagged trees and random forests) produce well-calibrated probabilities almost as a byproduct, while others (boosting) do not. The pattern we find in §5.3 sits in the same family of results: convex combiners of TFM probabilities preserve the calibration of their inputs, but a discriminative meta-learner trained on out-of-fold predictions does not. Selective-prediction and worst-group metrics [23, 24] let us tell the two failure modes apart, which is why we report them alongside accuracy.

Ensemble Strategies Implemented

Six strategies share a common fit/predict/predict_proba interface over a fixed pool of K base TFMs producing class-probability vectors pk(x).

Weighted Averaging (WA) p = Σκ Wkpk with wk X scorek on validation. No second stage. Cheapest combiner.

Greedy Selection [12]. Forward selection with replacement: at each of S iterations, the base whose addition maximises validation accuracy is added. Final weight equals selection count over S. We use S = 50, matching the AutoGluon [25] WeightedEnsembleModel default.

Stacking [18, 19]. Bases produce 5-fold out-of-fold (OOF) predictions. A logistic-regression meta-learner is trained on the OOF features.

Temperature-Scaled Blending [21]. Per-base temperature Tk is fit on the validation set by minimising negative log-likelihood (NLL) of softmax(log pk/Tk); calibrated probabilities are then averaged uniformly.

Cascade Stacking. Two-level stacking with skip connections, modifying AutoGluon's high-quality preset [25]. Level-1 OOF predictions concatenate with raw features and feed level-2 base models, also with K-fold OOF. A final greedy-selection layer combines all level outputs. We use 2 levels, 3-fold OOF, S = 50.

Random-Init (Deep) Ensemble [26]. Each TFM is run with M = 3 different seeds. Per-base predictions are averaged across seeds, then cross-base averaging uses performance weights.

Experimental Setup Details

Datasets. 153 OpenML classification tasks drawn from the CC18 [28], TALENT [29], and TabZilla [30] pools. Selection criteria and the full dataset inventory are in Appendix H.

Base TFMs. Six models in inference mode: TabPFNv2.5 [31], TabPFNv2.6 [2], TabICL [3], TabICLv2 [3], LimiX [32], OrionMSPv1.5 [33].

Protocol. Per dataset: an 80/20 stratified train/test split; within train, a 75/25 train/validation split for ensemble weight learning. Stacking and cascade levels use 5-fold and 3-fold internal CV respectively. A fixed seed controls splits and base-model initialisation.

Metrics. Accuracy, weighted F1, one-vs-rest ROC-AUC, multi-class log-loss, and total fit-time per dataset (seconds). For deeper analysis on the TabArena classification suite we additionally report expected calibration error (ECE) [21], the reliability component of the Brier decomposition [34], area under the risk-coverage curve (AURC) [23], coverage at 95% accuracy, and worst-group accuracy (WGA) [24]. Statistical significance is reported via Friedman [35], Nemenyi [36], and pairwise Wilcoxon signed-rank [37].

Hardware. A single H100 (80 GB) GPU per run.

Key Results and Performance Insights

5.1 Aggregate performance
Table 1 reports per-method statistics across the 153 datasets in our benchmark. The accuracy spread among the top eight methods is 0.45 percentage points (TabICL at 0.872 to Cascade at 0.882). The Friedman test rejects equality of mean ranks across the 12 methods (x² = 389.95, p < 10-30); methods are not exchangeable, but the question is which differences survive a per-pair correction. Calibration, selective-prediction, and group-robustness metrics on the TabArena suite are reported in Table 3 (Appendix C) and analysed in §5.3.

The Nemenyi critical difference at α = 0.05 for K = 12, N = 153 is CD = 1.347. Three methods sit within CD of the top-ranked Cascade_2level: Stacking_LR (Δ = 0.48), Greedy_Selection (Δ = 0.80), and TabICLv2 (Δ = 0.90). Three ensembles and one base TFM are statistically indistinguishable on accuracy across 153 tasks; the remaining three ensembles cannot beat the best base. Pairwise Wilcoxon tests sharpen the picture: against TabICLv2, only Cascade_2level wins (+0.18%, p = 0.008); Greedy_Selection (+0.01%) and Stacking_LR (−0.03%) tie; WA, Temp_Scaled, and DeepEnsemble_3seed are all significantly worse (p < 0.05). One ensemble of six clears the bar of beating the strongest base.

5.2 Accuracy/compute frontier
Fit times span four orders of magnitude. TabICLv2 averages 0.71 s per dataset; Greedy, Stacking_LR, WA, and Temp_Scaled all sit near 6.6s, which is roughly the cost of one forward pass through the six bases plus a thin combination layer. DeepEnsemble_3seed costs 75.7 s, and Cascade_2level costs 178.5 s. Figure 1 plots the trade-off.

The Pareto frontier is dominated by TabICLv2 (cheapest competitive option) and Greedy_Selection (best accuracy at moderate cost). Cascade_2level sits on the frontier, but its marginal accuracy advantage over TabICLv2 corresponds to a 253× compute multiplier. DeepEnsemble_3seed is dominated outright: WA_performance and Stacking_LR achieve similar or better accuracy at one tenth its cost. Figure 2 (Appendix B) shows the corresponding critical-difference diagram.

5.3 Calibration and the diversity ceiling
Log-loss diverges from accuracy. The log-loss column of Table 1 tells a different story than the accuracy column. TabICLv2 has the lowest log-loss rank (4.11); Greedy_Selection (4.54) and Cascade_2level (4.55) sit close behind, both producing convex combinations of probability vectors. Stacking_LR ranks 8.13, the worst of any method tested except OrionMSPv1.5. Linear stacking still places the right class label, which is why its accuracy and ROC-AUC ranks stay competitive, but the cross-entropy objective on OOF predictions pushes the meta-learner toward sharper probability outputs than the bases produce, which degrades calibration.

Calibration tracks combination strategy, not compute. Table 3 (Appendix C) reports five complementary metrics on the TabArena classification suite: ECE, Brier reliability, AURC, coverage at 95% accuracy, and worst-group accuracy. TabICLv2 sets the calibration ceiling (ECE = 0.0236, Brier-REL = 0.0024). Greedy_Selection is the only ensemble that approaches it (ECE = 0.0253), and it never optimises for calibration directly. Stacking_LR records the worst calibration of any ensemble (ECE = 0.0272, Brier-REL = 0.0031), consistent with its log-loss rank. Temperature-Scaled Blending is equally instructive: per-base NLL minimisation gives an ECE of 0.0273, no better than Stacking_LR's.

Base-model diversity caps uncertainty quality. The mean pairwise Q-statistic [20] across the six TFMs is 0.961 (σ = 0.183, Cohen's κ = 0.856, in the "almost perfect agreement" band of the conventional Landis-Koch scale). Q values close to 1 signal near-redundancy: the six bases share the ICL-on-synthetic-priors recipe and tend to fail on the same instances, so any convex combiner has little variance to absorb; Appendix D states this ceiling formally as a consensus-set bound on the ensemble-vs-base accuracy gap. The ceiling is most consequential for DeepEnsemble_3seed: three random seeds perturb context order but share the synthetic prior, producing an AURC of 0.0617 (28% above TabICLv2's 0.0483) and coverage at 95% accuracy of 62.1% versus 68.4%. The 75.7 s cost buys neither accuracy nor uncertainty improvement; for selective prediction, Greedy_Selection (AURC = 0.0484) is the appropriate choice.

Cascade earns its cost on group robustness. Worst-group accuracy is the one axis where heavy stacking earns its overhead. Cascade_2level reaches 0.803, on par with TabICLv2 (0.802) and outperforming all other ensembles by roughly three points (Greedy and Stacking_LR both at 0.776). The skip-connection architecture appears to implicitly down-weight base models that are systematically biased on minority subgroups; simpler convex combiners do not replicate this. The fairness margin is narrow and confined to sensitive-attribute datasets, but it is the one regime in which cascade's 253× compute overhead translates into a qualitative advantage rather than a marginal one.

Discussion & Conclusion

The shape of the result holds across every metric we report. TFMs trained with ICL on synthetic priors already approximate Bayesian model averaging at inference, so explicit downstream ensembling lands inside the noise floor of a strong single base. Cascade stacking buys the last 0.2% of accuracy by letting the meta-learner combine raw features alongside OOF predictions, but the cost-benefit ratio is poor outside competition settings. Greedy selection is the more honest default: roughly 10× the cost of the strongest single model, the same mean accuracy as the heaviest stack, and no calibration regression.

Two patterns matter for downstream work. First, calibration is not a free byproduct of accuracy ensembling. Stacking with logistic regression damages probability quality despite improving accuracy rank, and uniform averaging after per-base temperature scaling is no better than the best base. Calibration-aware meta-learners optimised for log-loss directly, or post-hoc recalibration applied to the ensemble output rather than to its members, both remain open. Second, the per-dataset variance in best-base identity is what ensembling is really being asked to solve. A small gating learner trained on dataset metafeatures to pick the best TFM per dataset might match cascade at far lower compute, and is a better fit to the structure of the problem than a stack.

6.1 Limitations.
We use a single seed per dataset; statistical power comes from across-dataset variation rather than within-dataset replicates, which is what paired Wilcoxon, Friedman, and Nemenyi tests assume on N = 153 tasks. A small number of the largest tasks were dropped for individual base TFMs due to memory constraints, and Table 1 reports the intersection of tasks that completed for every method. We do not include GBDT baselines; the comparison is between single TFMs and TFM-only ensembles, leaving open whether out-of-class diversity (TFM + GBDT) recovers gains the within-class pool cannot.

6.2 Contamination.
Several base TFMs were pretrained on data overlapping with OpenML, a concern raised by the TabArena protocol [9]. Reported deltas should be read as upper bounds on within-pool ensemble effects: under cleaner contamination protocols, the ensemble-vs-best-base gap is likely smaller, not larger. Contamination does not change the qualitative finding (a near-redundant pool produces a hard ceiling), but the precise accuracy delta is best treated as inflated.

Future work. Hybrid TFM+GBDT pools, per-dataset gating learners, time-series TFMs where base-model spread may be larger, and calibration-aware meta-learners.

7 Conclusion
Six tabular foundation models trained with ICL on synthetic priors form a near-redundant pool. A Q-statistic of 0.961 caps what any convex combiner can recover, and the empirical results follow: a top equivalence group of four methods (three ensembles plus the best base) statistically indistinguishable on accuracy; a best ensemble that recovers a sub-percent accuracy gain at 253× compute; and a calibration trap when a meta-learner is asked to manufacture extra accuracy by sharpening probabilities. Greedy selection is the practical default; cascade stacking is justifiable only when worst-group accuracy is a primary target. The open question is whether out-of-class diversity (TFM + GBDT) breaks through the ceiling that within-class pools cannot.

0.961 Mean Pairwise Q-statistic: near-redundancy among TFMs. A high Q-statistic (close to 1) indicates that base models often make the same errors, limiting ensemble gains.
+0.18% Accuracy gain for best ensemble (Cascade_2level) over the strongest single TFM, at a significant computational cost.
253x Compute multiplier for Cascade_2level over the best base TFM. This highlights the high computational overhead for marginal accuracy gains.

Enterprise Process Flow: Ensemble Model Development

Base TFMs predict probabilities
OOF/Val. for meta-learning
Meta-learner/Combiner trains
Ensemble predicts

Ensemble Strategy Performance Comparison

Strategy Accuracy Rank (lower is better) Log-loss Rank (lower is better) Fit Time (s)
Cascade_2level 4.48 4.55 178.5
Stacking_LR 4.96 8.13 6.6
Greedy_Selection 5.28 4.54 6.7
TabICLv2 (Best Base) 5.39 4.11 0.7
DeepEnsemble_3seed 6.10 5.62 75.7

The Calibration Trap of Stacking_LR

Stacking with a logistic-regression meta-learner, while competitive in accuracy and ROC-AUC ranks, suffers from the worst log-loss rank among ensembles (8.13). The meta-learner achieves higher accuracy by sharpening class boundaries, which unfortunately destroys the quality of the predicted probabilities (calibration). This highlights a critical trade-off: pursuing raw accuracy can compromise the trustworthiness of probabilistic outputs, which is vital for risk-sensitive applications. Calibrated outputs are crucial for decision-making under uncertainty, and the sharpening effect undermines this. Alternative strategies that explicitly optimize for calibration or apply post-hoc recalibration techniques may be necessary to mitigate this issue.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by optimizing tabular data analysis with advanced AI solutions.

Annual Cost Savings Potential $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A structured approach to integrating advanced AI into your tabular data workflows, leveraging insights from current research.

Phase 01: Strategic Assessment & Data Readiness

Evaluate current tabular data challenges, identify high-impact use cases, and assess data quality and availability. Focus on understanding existing TFM performance and identifying areas where ensembling could provide value without significant calibration compromise.

Phase 02: Pilot Deployment & Ensemble Experimentation

Implement a pilot project with selected TFMs. Experiment with simpler ensemble strategies like Greedy Selection for a balanced approach to accuracy and calibration. Closely monitor calibration metrics (e.g., ECE, log-loss) alongside accuracy.

Phase 03: Performance Tuning & Calibration Enhancement

Refine ensemble configurations. If advanced stacking is pursued, integrate calibration-aware meta-learners or post-hoc recalibration techniques. Explore diversity-enhancing methods such as pre-training on diverse real-world tables, if feasible.

Phase 04: Scalable Integration & Continuous Monitoring

Integrate the optimized ensemble models into production workflows. Establish continuous monitoring for model performance drift, calibration, and potential biases in worst-group accuracy. Implement feedback loops for iterative improvement.

Ready to Transform Your Data Strategy?

Leverage expert insights to build robust, calibrated, and efficient AI solutions for your tabular data challenges. Book a consultation to discuss your specific needs.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking