Menu Close

Understand your data statistically before developing your model-Chapter I

Inspired by “Understanding 8 types of Cross-Validation” by Satyam Kumar

Cross-Validation (CV) is one critical way for evaluating our machine learning models. However, it should be applied correctly to your own data. You need to check your data and understand it statistically before developing your model on your data.

There are more than eight types of Cross-Validation variants you may use to develop your model. However, which one you should use largely depends on your data. We should check the data at least by looking at (1) sample size: small or large? (2) class balance or not? (3) whether it is time-series data?

The following 8 types of CV are explained in the Understanding 8 types of Cross-Validation” by Satyam Kumar.

  1. Leave p out cross-validation
  2. Leave one out cross-validation
  3. Holdout cross-validation
  4. Repeated random subsampling validation
  5. k-fold cross-validation
  6. Stratified k-fold cross-validation
  7. Time Series cross-validation
  8. Nested cross-validation

You may find the Pros vs. Cons for each one in the article. Here, I just put my key idea:

  1. Make sure you have balanced data and not time-series one. The fast and safe way is to up-sample or down-sample your data. After balancing the data, we can easily apply Nest Cross-Validation (why? check my previous blog)
  2. If it is time-series data, there are few CV options for you, and you need to use time Series cross-validation.

I will keep this topic updated.