To overfit, or not to overfit, that's the question

up vote
1
down vote

favorite

I hope this is not a stupid question. Let us say I have a data generation process that is quite stationary and I do not care about arriving at generalizable knowledge but more about accurate predictions. Would it be acceptable in this scenario to overfit a powerful model (e.g. random forest => fully saturated-ish model) by refreshing it daily using all retrospective data and using it to predict next day’s dependent variable?

edited Nov 14 at 17:20

Penguin_Knight

10k2046

asked Nov 14 at 17:03

cs0815

249417

You can do that, but of what value are overfitted and potentially false predictions?
– Todd D
Nov 14 at 17:17

but the process is fairly stationary so new data should not be unexpected thus lead to massively 'false' predictions ....
– cs0815
Nov 14 at 17:20

Related stats.stackexchange.com/q/249493/35989
– Tim♦
Nov 14 at 17:58

@cs0815 you have accepted a rather ordinary and simple answer very quickly. I posted the question more as a temporary answer in a process were I was hoping that you were gonna give some more information about your question. What is the deal with the 'refreshing it daily'? That would be essential to make this question not a duplicate with just a fancy title.
– Martijn Weterings
Nov 14 at 17:58

1

Frequency of model updates and overfitting are separate concerns. If the model doesn't overfit, then it can benefit from consuming new data frequently provided that the new data contains information and not only noise. Overfitting is fitting to the noise, and if you somehow prevent it then, you'll be fitting to daily new information, which is good
– Aksakal
Nov 14 at 21:24

|
show 2 more comments

up vote
1
down vote

favorite

edited Nov 14 at 17:20

Penguin_Knight

10k2046

asked Nov 14 at 17:03

cs0815

249417

You can do that, but of what value are overfitted and potentially false predictions?
– Todd D
Nov 14 at 17:17

but the process is fairly stationary so new data should not be unexpected thus lead to massively 'false' predictions ....
– cs0815
Nov 14 at 17:20

Related stats.stackexchange.com/q/249493/35989
– Tim♦
Nov 14 at 17:58

@cs0815 you have accepted a rather ordinary and simple answer very quickly. I posted the question more as a temporary answer in a process were I was hoping that you were gonna give some more information about your question. What is the deal with the 'refreshing it daily'? That would be essential to make this question not a duplicate with just a fancy title.
– Martijn Weterings
Nov 14 at 17:58

1

Frequency of model updates and overfitting are separate concerns. If the model doesn't overfit, then it can benefit from consuming new data frequently provided that the new data contains information and not only noise. Overfitting is fitting to the noise, and if you somehow prevent it then, you'll be fitting to daily new information, which is good
– Aksakal
Nov 14 at 21:24

|
show 2 more comments

up vote
1
down vote

favorite

edited Nov 14 at 17:20

Penguin_Knight

10k2046

asked Nov 14 at 17:03

cs0815

249417

regression multiple-regression modeling model

edited Nov 14 at 17:20

Penguin_Knight

10k2046

asked Nov 14 at 17:03

cs0815

249417

edited Nov 14 at 17:20

Penguin_Knight

10k2046

asked Nov 14 at 17:03

cs0815

249417

edited Nov 14 at 17:20

Penguin_Knight

10k2046

edited Nov 14 at 17:20

Penguin_Knight

10k2046

edited Nov 14 at 17:20

Penguin_Knight

10k2046

asked Nov 14 at 17:03

cs0815

249417

asked Nov 14 at 17:03

cs0815

249417

asked Nov 14 at 17:03

cs0815

249417

You can do that, but of what value are overfitted and potentially false predictions?
– Todd D
Nov 14 at 17:17

but the process is fairly stationary so new data should not be unexpected thus lead to massively 'false' predictions ....
– cs0815
Nov 14 at 17:20

Related stats.stackexchange.com/q/249493/35989
– Tim♦
Nov 14 at 17:58

@cs0815 you have accepted a rather ordinary and simple answer very quickly. I posted the question more as a temporary answer in a process were I was hoping that you were gonna give some more information about your question. What is the deal with the 'refreshing it daily'? That would be essential to make this question not a duplicate with just a fancy title.
– Martijn Weterings
Nov 14 at 17:58

1

Frequency of model updates and overfitting are separate concerns. If the model doesn't overfit, then it can benefit from consuming new data frequently provided that the new data contains information and not only noise. Overfitting is fitting to the noise, and if you somehow prevent it then, you'll be fitting to daily new information, which is good
– Aksakal
Nov 14 at 21:24

|
show 2 more comments

You can do that, but of what value are overfitted and potentially false predictions?
– Todd D
Nov 14 at 17:17

but the process is fairly stationary so new data should not be unexpected thus lead to massively 'false' predictions ....
– cs0815
Nov 14 at 17:20

Related stats.stackexchange.com/q/249493/35989
– Tim♦
Nov 14 at 17:58

@cs0815 you have accepted a rather ordinary and simple answer very quickly. I posted the question more as a temporary answer in a process were I was hoping that you were gonna give some more information about your question. What is the deal with the 'refreshing it daily'? That would be essential to make this question not a duplicate with just a fancy title.
– Martijn Weterings
Nov 14 at 17:58

1

Frequency of model updates and overfitting are separate concerns. If the model doesn't overfit, then it can benefit from consuming new data frequently provided that the new data contains information and not only noise. Overfitting is fitting to the noise, and if you somehow prevent it then, you'll be fitting to daily new information, which is good
– Aksakal
Nov 14 at 21:24

You can do that, but of what value are overfitted and potentially false predictions?
– Todd D
Nov 14 at 17:17

but the process is fairly stationary so new data should not be unexpected thus lead to massively 'false' predictions ....
– cs0815
Nov 14 at 17:20

Related stats.stackexchange.com/q/249493/35989
– Tim♦
Nov 14 at 17:58

@cs0815 you have accepted a rather ordinary and simple answer very quickly. I posted the question more as a temporary answer in a process were I was hoping that you were gonna give some more information about your question. What is the deal with the 'refreshing it daily'? That would be essential to make this question not a duplicate with just a fancy title.
– Martijn Weterings
Nov 14 at 17:58

Frequency of model updates and overfitting are separate concerns. If the model doesn't overfit, then it can benefit from consuming new data frequently provided that the new data contains information and not only noise. Overfitting is fitting to the noise, and if you somehow prevent it then, you'll be fitting to daily new information, which is good
– Aksakal
Nov 14 at 21:24

|
show 2 more comments

3 Answers
3

active

oldest

votes

up vote
3
down vote

It will eventually be a balance that you need to test (e.g cross validation).

If you are too conservative then you won't capture the model and the predictions will be bad.

If you are too liberal then you will capture too much of the noise (aside from the model) and the predictions will be bad.

It can be that a slightly more conservative model than the 'real' model (e.g the true model is a polynomial of order 5 and the optimal model to fit it is of order 4) works better, but this depends entirely on the specific circumstances and needs to be tested on a case-by-case basis. However, in general it is better to add some little bias (it will reduce the variability, if done correctly ).

In case your question is about adding new data to the data that you used to train your model, then I would guess that this is rarely gonna be a problem. In most cases adding more data should make the model better unless the modelfit has the behaviour that it is not gonna improve with more data (e.g. when the model is not constant in time, but then the predictions are not going be good anyway).

edited Nov 14 at 17:45

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

Thanks I think I will use CV to still optimize hyper parameters but refresh daily with all data.
– cs0815
Nov 14 at 17:21

Is your question about updating or about overfitting?
– Martijn Weterings
Nov 14 at 17:23

add a comment |

up vote
3
down vote

We say that model overfitts when it has good performance on training data, but not on unseen data. It is not a statement about data generating process, but about the sample that you use for training, versus any other sample that can be drawn. So if model has good predictive performance on unseen data, it does not overfit.

Overfitting would not be a problem if you didn't want to make predictions on unseen data and didn't want to make any conclusions about it given the model. You are right that if you can be perfectly sure that the future data would be identical to your training sample, then it wouldn't matter, but I can't imagine any scenario where you could be sure about it. Notice that even if you had perfectly representative sample, or population data, it still can happen that the phenomenon of interest would change over time and the past data wouldn't be relevant any more.

See also the Which model is better: One that overfits or one that underfits? thread.

answered Nov 14 at 18:24

Tim♦

55k9124211

Thanks. Sorry I would disagree a bit. If the training sample is representative of the data generation process and the unseen data are as well, then memorizing data (i.e. over-fitting) should not be a major issue. I guess the more dimensions there are the more representative samples there have to be ... I also said, that I refit the model regularly, so even a change in the data generation process should be picked up?
– cs0815
Nov 14 at 20:15

1

@cs0815 If you re-fit the model, you seem to be assuming that the data can change over time, don't you? If so, then inevitably every time you train the model on historical data, to predict the future. So something could have changed. If that's not the case, don't re-fit your model, train it once and don't monitor the performance, as you're waisting your time.
– Tim♦
Nov 14 at 20:24

add a comment |

up vote
-1
down vote

Overfitting is bad, because it means the model you learned from your training data may not work well for new data points. You can imagine a perfectly overfit model that simply memorizes each training point and returns the appropriate output. When confronted with data that it wasn't trained on, it outputs a random number. You could train a model like this on a ton of retrospective data, but unless you get identical data tomorrow, you'll do no better than random. I suppose an approach like this could work with a limited and discrete input space, but you don't really need machine learning models for that anyway.

answered Nov 14 at 21:01

Nuclear Wang

2,482819

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f377005%2fto-overfit-or-not-to-overfit-thats-the-question%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
3
down vote

It will eventually be a balance that you need to test (e.g cross validation).

If you are too conservative then you won't capture the model and the predictions will be bad.

If you are too liberal then you will capture too much of the noise (aside from the model) and the predictions will be bad.

edited Nov 14 at 17:45

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

Thanks I think I will use CV to still optimize hyper parameters but refresh daily with all data.
– cs0815
Nov 14 at 17:21

Is your question about updating or about overfitting?
– Martijn Weterings
Nov 14 at 17:23

add a comment |

up vote
3
down vote

It will eventually be a balance that you need to test (e.g cross validation).

If you are too conservative then you won't capture the model and the predictions will be bad.

If you are too liberal then you will capture too much of the noise (aside from the model) and the predictions will be bad.

edited Nov 14 at 17:45

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

Thanks I think I will use CV to still optimize hyper parameters but refresh daily with all data.
– cs0815
Nov 14 at 17:21

Is your question about updating or about overfitting?
– Martijn Weterings
Nov 14 at 17:23

add a comment |

up vote
3
down vote

It will eventually be a balance that you need to test (e.g cross validation).

If you are too conservative then you won't capture the model and the predictions will be bad.

If you are too liberal then you will capture too much of the noise (aside from the model) and the predictions will be bad.

edited Nov 14 at 17:45

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

It will eventually be a balance that you need to test (e.g cross validation).

If you are too conservative then you won't capture the model and the predictions will be bad.

If you are too liberal then you will capture too much of the noise (aside from the model) and the predictions will be bad.

edited Nov 14 at 17:45

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

edited Nov 14 at 17:45

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

answered Nov 14 at 17:18

Martijn Weterings

11.8k1355

Thanks I think I will use CV to still optimize hyper parameters but refresh daily with all data.
– cs0815
Nov 14 at 17:21

Is your question about updating or about overfitting?
– Martijn Weterings
Nov 14 at 17:23

add a comment |

Thanks I think I will use CV to still optimize hyper parameters but refresh daily with all data.
– cs0815
Nov 14 at 17:21

Is your question about updating or about overfitting?
– Martijn Weterings
Nov 14 at 17:23

Thanks I think I will use CV to still optimize hyper parameters but refresh daily with all data.
– cs0815
Nov 14 at 17:21

Is your question about updating or about overfitting?
– Martijn Weterings
Nov 14 at 17:23

add a comment |

up vote
3
down vote

See also the Which model is better: One that overfits or one that underfits? thread.

answered Nov 14 at 18:24

Tim♦

55k9124211

Thanks. Sorry I would disagree a bit. If the training sample is representative of the data generation process and the unseen data are as well, then memorizing data (i.e. over-fitting) should not be a major issue. I guess the more dimensions there are the more representative samples there have to be ... I also said, that I refit the model regularly, so even a change in the data generation process should be picked up?
– cs0815
Nov 14 at 20:15

1

@cs0815 If you re-fit the model, you seem to be assuming that the data can change over time, don't you? If so, then inevitably every time you train the model on historical data, to predict the future. So something could have changed. If that's not the case, don't re-fit your model, train it once and don't monitor the performance, as you're waisting your time.
– Tim♦
Nov 14 at 20:24

add a comment |

up vote
3
down vote

See also the Which model is better: One that overfits or one that underfits? thread.

answered Nov 14 at 18:24

Tim♦

55k9124211

Thanks. Sorry I would disagree a bit. If the training sample is representative of the data generation process and the unseen data are as well, then memorizing data (i.e. over-fitting) should not be a major issue. I guess the more dimensions there are the more representative samples there have to be ... I also said, that I refit the model regularly, so even a change in the data generation process should be picked up?
– cs0815
Nov 14 at 20:15

1

@cs0815 If you re-fit the model, you seem to be assuming that the data can change over time, don't you? If so, then inevitably every time you train the model on historical data, to predict the future. So something could have changed. If that's not the case, don't re-fit your model, train it once and don't monitor the performance, as you're waisting your time.
– Tim♦
Nov 14 at 20:24

add a comment |

up vote
3
down vote

See also the Which model is better: One that overfits or one that underfits? thread.

answered Nov 14 at 18:24

Tim♦

55k9124211

See also the Which model is better: One that overfits or one that underfits? thread.

answered Nov 14 at 18:24

Tim♦

55k9124211

answered Nov 14 at 18:24

Tim♦

55k9124211

answered Nov 14 at 18:24

Tim♦

55k9124211

answered Nov 14 at 18:24

Tim♦

55k9124211

Thanks. Sorry I would disagree a bit. If the training sample is representative of the data generation process and the unseen data are as well, then memorizing data (i.e. over-fitting) should not be a major issue. I guess the more dimensions there are the more representative samples there have to be ... I also said, that I refit the model regularly, so even a change in the data generation process should be picked up?
– cs0815
Nov 14 at 20:15

1

@cs0815 If you re-fit the model, you seem to be assuming that the data can change over time, don't you? If so, then inevitably every time you train the model on historical data, to predict the future. So something could have changed. If that's not the case, don't re-fit your model, train it once and don't monitor the performance, as you're waisting your time.
– Tim♦
Nov 14 at 20:24

add a comment |

Thanks. Sorry I would disagree a bit. If the training sample is representative of the data generation process and the unseen data are as well, then memorizing data (i.e. over-fitting) should not be a major issue. I guess the more dimensions there are the more representative samples there have to be ... I also said, that I refit the model regularly, so even a change in the data generation process should be picked up?
– cs0815
Nov 14 at 20:15

1

@cs0815 If you re-fit the model, you seem to be assuming that the data can change over time, don't you? If so, then inevitably every time you train the model on historical data, to predict the future. So something could have changed. If that's not the case, don't re-fit your model, train it once and don't monitor the performance, as you're waisting your time.
– Tim♦
Nov 14 at 20:24

Thanks. Sorry I would disagree a bit. If the training sample is representative of the data generation process and the unseen data are as well, then memorizing data (i.e. over-fitting) should not be a major issue. I guess the more dimensions there are the more representative samples there have to be ... I also said, that I refit the model regularly, so even a change in the data generation process should be picked up?
– cs0815
Nov 14 at 20:15

@cs0815 If you re-fit the model, you seem to be assuming that the data can change over time, don't you? If so, then inevitably every time you train the model on historical data, to predict the future. So something could have changed. If that's not the case, don't re-fit your model, train it once and don't monitor the performance, as you're waisting your time.
– Tim♦
Nov 14 at 20:24

add a comment |

up vote
-1
down vote

answered Nov 14 at 21:01

Nuclear Wang

2,482819

add a comment |

up vote
-1
down vote

answered Nov 14 at 21:01

Nuclear Wang

2,482819

add a comment |

up vote
-1
down vote

answered Nov 14 at 21:01

Nuclear Wang

2,482819

answered Nov 14 at 21:01

Nuclear Wang

2,482819

answered Nov 14 at 21:01

Nuclear Wang

2,482819

answered Nov 14 at 21:01

Nuclear Wang

2,482819

answered Nov 14 at 21:01

Nuclear Wang

2,482819

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky