Tabular Data and PGMs

news
code
analysis
Author

Bulent Soykan

Published

June 12, 2025

The Tabular Data Fallacy: Why We’ve Only Scratched the Surface of AI’s Real Goldmine

There’s a pervasive narrative in the world of AI, a comforting story we tell ourselves on Kaggle leaderboards and in startup pitch decks. It’s the story of the “solved problem,” and its main character is tabular data. With the undisputed power of libraries like XGBoost and LightGBM, and a relentless focus on benchmark accuracy, we’ve convinced ourselves that the major challenges of data in rows and columns are behind us.

This narrative is a fallacy. And it’s holding us back.

It’s true that if your goal is to squeeze another fraction of a percent of accuracy out of a perfectly clean, fully labeled dataset, then yes, the path is well-worn. But that’s not the world real businesses operate in. The real world is a chaotic landscape of incomplete records, unlabeled information, and shocking, unpredicted events.

For years, I’ve watched the AI community chase incremental gains on sanitized problems while ignoring the foundational, high-value challenges that plague every major industry. The reason for this oversight is our collective fixation on a single class of tools: discriminative models.

I’m writing this to tell you there’s a better way. The future of AI in the enterprise, the truly transformative and defensible innovations, will come from a different approach entirely. They will be built on probabilistic generative models.


The Allure and Limits of the Discriminative Path

To understand where we need to go, we must first be honest about where we are. The vast majority of machine learning applications on tabular data rely on discriminative models.

A discriminative model learns to separate data points. Its entire purpose is to find a boundary—a line or a complex, high-dimensional surface—that optimally divides one class from another. It learns the conditional probability, \(P(Y|X)\).

Think of a bank’s loan approval model. Its goal is to predict default risk.

  • The Input (X): Applicant data like credit_score, income, age, years_at_job.
  • The Output (Y): A binary label, Default or No Default.

The model takes this data and learns a function that, for any new applicant, draws a line and decides which side they fall on. It is incredibly effective at this one job. This is why Gradient Boosted Decision Trees (GBDTs) dominate—they are master boundary-finders.

But this singular focus is also their greatest weakness. A discriminative model is like a student who has crammed for a multiple-choice test. They can pick the right answer from a list of options with remarkable accuracy, but they lack any foundational knowledge of the subject. They don’t know why an answer is correct, only that it matches the patterns they memorized.

This leads to critical failures when faced with real-world messiness: * It sees a blank on the test (missing data) and panics, forcing you to guess the answer for it. * It cannot learn from the textbook (unlabeled data), only from the pre-answered practice tests. * It has no way of knowing if a question is from a completely different subject (Out-of-Distribution data) and will confidently provide a nonsensical answer.

This isn’t a solid foundation for building robust, intelligent systems. It’s a house of cards.

The Generative Leap—From Prediction to True Understanding

A probabilistic generative model takes a radically different approach. It doesn’t just learn the boundary between classes; it learns the inherent structure of the data itself. Its goal is to learn the full joint probability distribution, \(P(X, Y)\).

Let’s return to our bank example. A generative model wouldn’t just learn how to separate defaulters from non-defaulters. It would learn the “story” of the entire applicant pool. It would understand the complex interplay between all the variables: how income relates to age, how that combination relates to the requested loan amount, and how all those factors together define a “typical” applicant.

Because it learns the distribution of the data, it can generate new, plausible data points. It can create a profile of a synthetic-but-realistic applicant. This creative ability is the key that unlocks solutions to previously intractable problems.

The family of generative models is diverse and powerful:

  • Variational Autoencoders (VAEs): Think of a VAE as a master artist and forger. The “encoder” part of the network looks at a real applicant’s profile and creates a compressed, abstract sketch of them (in what we call the latent space). The “decoder” part is trained to take that sketch and perfectly reconstruct the original profile. By mastering this process of sketching and reconstructing, the decoder becomes a generator. We can give it new, random sketches, and it will create entirely new, realistic applicant profiles.
  • Generative Adversarial Networks (GANs): This is the famous model with two dueling neural networks. A Generator creates fake applicant profiles from scratch. A Discriminator (a discriminative model!) acts as a detective, trying to tell the difference between the real applicants and the fakes. The two are locked in an escalating arms race, with the Generator becoming an incredibly sophisticated forger, capable of producing synthetic data that is indistinguishable from reality.
  • Flow-based Models: Imagine starting with a simple block of marble (a simple, known probability distribution like a Gaussian). A flow-based model is like a master sculptor who applies a series of precise, reversible chisels and cuts (invertible transformations) to shape the block into a complex statue (the distribution of your applicant data). Because every step is reversible, you can not only create a statue from the block but also calculate the exact sequence of steps to turn the statue back into the block, giving you a precise mathematical grasp of the data’s probability.

Solving the “Impossible” Problems—Generative Models in Action

This all sounds great in theory. But let’s look at how this directly solves the high-value problems that leave purely discriminative models helpless.

Conquering Missing Features

  • The Status Quo: An applicant for a loan, a 25-year-old software engineer, leaves the years_at_current_job field blank. The standard approach is crude: impute the column’s average, which might be 5.2 years. This instantly makes the applicant’s profile nonsensical and contradictory.
  • The Generative Approach: A trained generative model looks at the rest of the applicant’s data: age: 25, profession: software_engineer, education: master's_degree. Having learned the joint distribution of tens of thousands of applicants, it understands the strong correlation between these features. It performs conditional sampling. It essentially asks, “For the universe of 25-year-old software engineers with a Master’s degree in my dataset, what is the probability distribution of their ‘years at current job’?” The answer is likely a distribution heavily skewed towards 1-3 years. The model then samples from this specific, conditional distribution to fill in the blank. It doesn’t guess; it makes a highly informed, contextually-aware inference.

The Economics of Unlabeled Data

  • The Pain Point: A hospital wants to build an AI model to predict the risk of sepsis, a life-threatening condition. They have electronic health records for 2 million patients, a treasure trove of unlabeled data (vitals, lab results, etc.). However, getting expert doctors to retrospectively review each case and apply a definitive Sepsis or No Sepsis label is prohibitively expensive. They can only afford to label 5,000 records.
  • The Generative Solution (Semi-Supervised Learning): This is where generative models create immense economic value.
    1. Unsupervised Pre-training: We first train a generative model (like a VAE) on all 2 million unlabeled patient records. The model’s task is simply to understand the data’s structure—to learn what a “normal” patient’s trajectory looks like, what common patterns in lab results are, and how different vitals relate to each other. It builds a rich, internal representation of human physiology as captured in the data.
    2. Supervised Fine-tuning: Now, we take the 5,000 labeled records. We use this small, precious dataset to fine-tune the pre-trained model. Because the model has already done 99% of the work by learning the data’s landscape, the labeled data simply acts as a guide, putting names to the regions the model has already discovered. The result is a highly accurate sepsis prediction model that approaches the performance of a model trained on millions of labeled records, but for a tiny fraction of the cost.

The Ultimate Safety Net—Out-of-Distribution (OOD) Detection

  • The Silent Failure: A hedge fund trains a trading algorithm on market data from 2010-2019. The model performs brilliantly in back-testing. In 2020, the COVID-19 pandemic creates unprecedented market volatility. The discriminative model, only knowing how to classify patterns it has seen before, continues to operate with high confidence, misinterpreting the new reality and leading to catastrophic losses. It has no mechanism for recognizing that the fundamental rules of the game have changed.
  • The Generative Alarm Bell: A generative model is also trained on the 2010-2019 data. It doesn’t just learn trading signals; it learns the probability distribution of a “normal” market day. When the pandemic-era data starts streaming in, the model calculates the likelihood of this new data under its learned distribution. The probability is infinitesimally small. It immediately flags this data as Out-of-Distribution. It essentially raises a giant red flag and says: “EMERGENCY. The world I was trained on no longer exists. My predictions are unreliable. Human intervention is required.” It knows what it doesn’t know, which is arguably the most critical feature of any AI system deployed in a high-stakes, dynamic environment.

A Call for True Innovation

For too long, we have defined “progress” in AI by climbing a few points up a leaderboard on a static dataset. This is not innovation; it’s optimization.

The real opportunities—the massive markets—lie in solving the foundational problems that every enterprise faces. They are in building systems that are robust to the chaos of reality, that can learn from all available data, not just the perfectly curated bits, and that are smart enough to know when they are out of their depth.

These are generative problems. They require a shift in our thinking, away from simply drawing boundaries and towards the ambitious goal of deep, probabilistic understanding. To the founders, the data scientists, and the investors, I say: look past the low-hanging fruit. The most valuable work is yet to be done, in the rich, untapped frontier of tabular data.