lalaithion:

dataandphilosophy:

Please someone tell me why I’m wrong.

It’s possible to match a data set optimally with one parameter.

Model: y=sin(bx), where y is scaled such that all values fall between 0 and 1 exclusive. The difficulty of hitting every point rises with the number of data points, but that just means you need bigger values of b. The “model” will look like an almost fully filled space with a sin curve oscillating so fast it looks like a series of vertical lines. Yet it hits every single point (because in a large enough option space, I can do that) when possible or the exact midpoint when not. Plausibly 100% perfection is impossible in many cases, but a sufficiently close approximation probably is.

If this understanding of overfitting sin waves is correct, doesn’t that suggest a flaw in how we penalize complexity in model-fitting?

You’re right about a sine wave hitting every point with a single, high enough, parameter.

The second part of this post confuses me though. That’s why we try not to use ‘number of parameters to tune’ as a measure of complexity in a model. Sometimes this is a good measure, and it’s very simple, but it usually isn’t very good. Especially because the same model can have different numbers of parameters depending on how it’s written.

If I were fitting a sine wave to a dataset, I would probably add a penalization term to the error function that was proportional to b. Maybe b^2.

OK, so you say that “we try not to use ‘number of parameters to tune’ as a measure of complexity in a model.”

I was taught AIC and BIC in my econ undergrad, which was fairly theory heavy. No other ways of penalizing overfitting.

@identicaltomyself agrees that this is terrible, but continues to use AIC because what else are you going to do?

Please, teach me the forbidden tuning methods! I am looking for something that adequately penalizes overly complex models, because, as this makes clear, complexity is not just about number of parameters. 

Notes

  1. identicaltomyself reblogged this from togglesbloggle and added:
    I don’t think the number of prime factors is a good measure of the complexity of a rational number. Every rational...
  2. lalaithion reblogged this from dataandphilosophy and added:
    Epistemic status: Not an expert, but I’ve talked with experts about this sort of thing.When I took Machine Learning, my...
  3. jadagul reblogged this from dataandphilosophy and added:
    The fundamental answer to your question is that it is not possible to do statistics before you have selected a model. It...
  4. spiralingintocontrol reblogged this from dataandphilosophy and added:
    so the usual way this gets solved in the Machine Learning side of academia is something called “regularization,” which...
  5. thirqual reblogged this from dataandphilosophy and added:
    1) There is nothing about intuition in what I wrote. 2) Consider the possibility that you do not understand modeling and...
  6. nostalgebraist said: your mental model of overfitting shouldn’t be about # of parameters, it should be about generalization error. criteria based on # of parameters are just proxies for generalization error, and this kind of idea may show why they are not always good proxies.
  7. nostalgebraist said: this model performs terribly on any (x,y) pair you hold out of the training data when you fit it. so just hold out some values, and see how it does. (this is already the way overfitting is typically penalized in practice.)
  8. nostalgebraist said: anyway, even if this works, you can catch it easily with cross-validation (or just validation, period)
  9. nostalgebraist said: i remember thinking about this exact case a while ago. i think i did some math about it, like finding the derivative of the loss wrt b, and concluded that you can’t actually get an arbitrarily good fit.
  10. togglesbloggle reblogged this from dataandphilosophy and added:
    In practice, how often do your models diverge from the (polynomial) + epsilon structure? Like, in comparing two...
  11. anthropicprincipal reblogged this from dataandphilosophy and added:
    In the Bayesian paradigm, your model is a prior assumption in and of itself, and then you also have some prior on b....
  12. dataandphilosophy reblogged this from thirqual and added:
    I am looking for something better than “intuitively, these models seem more plausible.” AIC promises, but doesn’t...
  13. spiralingintocontrol said: just penalize the L1 norm of all the parameters? that’s what we learned in my convex optimization class
  14. spiralingintocontrol answered: sine waves are bullshit
  15. slythernim answered: I feel like the problem here is an insufficiently useful definition of the word “optimally.” The procedure you suggest is descriptive, but not predictive; it has no modelling value.