Understand your data statistically before developing your model-Chapter I

Inspired by “Understanding 8 types of Cross-Validation” by Satyam Kumar

Cross-Validation (CV) is one critical way for evaluating our machine learning models. However, it should be applied correctly to your own data. You need to check your data and understand it statistically before developing your model on your data.

There are more than eight types of Cross-Validation variants you may use to develop your model. However, which one you should use largely depends on your data. We should check the data at least by looking at (1) sample size: small or large? (2) class balance or not? (3) whether it is time-series data?

The following 8 types of CV are explained in the “Understanding 8 types of Cross-Validation” by Satyam Kumar.

Leave p out cross-validation
Leave one out cross-validation
Holdout cross-validation
Repeated random subsampling validation
k-fold cross-validation
Stratified k-fold cross-validation
Time Series cross-validation
Nested cross-validation

You may find the Pros vs. Cons for each one in the article. Here, I just put my key idea:

Make sure you have balanced data and not time-series one. The fast and safe way is to up-sample or down-sample your data. After balancing the data, we can easily apply Nest Cross-Validation (why? check my previous blog)
If it is time-series data, there are few CV options for you, and you need to use time Series cross-validation.

I will keep this topic updated.

Understand your data statistically before developing your model-Chapter I

Share this: