Best Practices and Missteps When Creating Training and Evaluation Sets in Machine Learning

Imagine you’re training for a big race. You wouldn’t just run on the same track every day, right? You’d mix it up – maybe some hill runs, a few sprints, and some long-distance jogs. Similarly, in machine learning, we can’t just use all our data in one go. We need to prepare and organize it in a way that helps our machine learning models train effectively and then test their skills on new, unseen tracks.

This process of organizing our data into different sets – for training, validating, and testing – is foundational. It ensures that our models don’t just memorize the data (which would be like you only running on one track and then being surprised by a hill on race day!). Instead, they genuinely learn underlying patterns, making them ready for any new data they encounter. This article will discuss the best practices and pitfalls to avoid when creating training and evaluation sets.


Best Practices When Creating Training and Evaluation Sets

Always split your data into training and test sets
For instance, you could use 80% of your data for training and 20% for testing. A real-world example is UCI’s Machine Learning Repository, where datasets are usually divided into these proportions for machine learning research. This practice ensures your model learns from varied samples and gets assessed on unseen data, promoting robustness and reliability.

Perform random sampling while splitting the data
In a bank, when determining credit scores, they uniformly select different people’s past financial information instead of picking all individuals from a certain district or income group. Random sampling helps to decrease the bias in the model, ensuring that it doesn’t overfit or underfit and can generalize to new data.

Remember to use validation sets
A company that makes sales predictions might use the last quarter’s data for validation after training the model on data from prior years. A validation set will provide valuable insights into the model’s performance during the iterative process while tuning model hyperparameters. This practice helps avoid overfitting on the training set and promotes a more generalized model.

Ensure the distribution in the training set reflects real-world scenarios
For instance, when building an email spam detector, your model needs examples of both spam and non-spam emails. If your training set is full of only spam emails, the model can’t reliably detect non-spam emails. Ensuring a realistic distribution helps the model generalize well to unseen data and provides it with comprehensive information necessary to make accurate predictions or classifications.

Consider stratified sampling for imbalanced datasets
In medical diagnoses, certain illnesses might be far rarer than others. To make sure the model is trained on a representative sample, you may need to intentionally oversample the rare cases. Stratified sampling allows for each class or category to be adequately represented during model training, including minority classes in the case of imbalanced datasets. This leads to a model capable of predicting all classes accurately, thus ensuring effective generalization.


Things to Watch Out for: Creating Training and Evaluation Sets

Not splitting the data randomly
It can lead to overfitting or underfitting if the test set does not accurately reflect the distribution of the whole dataset. The model may perform well on the test set but will not generalize well to new data. The cause of this mistake is the misconception that the order of the data does not influence the outcome of the machine learning model.

  • For example, suppose a student is working on a machine learning project to predict the sales of an online store. If they split the data chronologically, training on older data and testing on recent data, it may lead to misleading results.
    Fix: It’s crucial to ensure that the test set is representative of the data as a whole. Random partitioning of the data can help achieve this.

Using too much of the dataset for training
As a consequence, this can lead to overfitting, where the model performs well on training data but performs poorly on unseen data. The cause of this mistake is the belief that more training data will always lead to a more accurate model.

  • An example is a student working on a project to predict housing prices. If they use 90% of their data for training, they might have very few examples left for testing.
    Fix: As a countermeasure, maintain a reasonable split, like 70% for training and 30% for testing, or employ techniques like cross-validation.

Not considering class imbalance while partitioning
If the model is not exposed to a sufficient number of examples from each class during training, it may perform poorly on the minority class in the evaluation phase. This mistake occurs due to a lack of awareness about the class distribution in the dataset.

  • Imagine a student working on a default prediction model for a bank that may have data where only 5% of clients defaulted. If this is not accounted for while splitting, the model will have inadequate examples of default for learning.
    Fix: Implement techniques like stratified sampling while partitioning the data, which ensures that each class is proportionally represented in both the training and test sets.

Leaking information from the test set
Preprocessing the entire dataset before splitting results in the ‘leakage’ of information from the test set into the training set. This may cause overfitting and a highly optimistic estimate of the model performance. This mistake is due to a misconception that data preprocessing should be done before partitioning the set into train and test.

  • Suppose a student is working on a machine learning project to predict cancer from medical records. They preprocess (normalize or scale) the entire dataset before splitting it into train and test sets.
    Fix: Generally, it’s best to fit the preprocessing methods only on the training set and then apply these transformations to both the training set and the test set to avoid data leakage.



Case Study: Sarah’s Journey with Data Splitting in Machine Learning

Sarah, a high school senior, had recently developed a keen interest in machine learning. After attending a workshop on the basics of machine learning, she decided to work on a project that predicts whether a student will pass or fail based on various factors like attendance, participation, and assignment scores. Eager to apply her newfound knowledge, Sarah embarked on her machine learning journey.

Sarah collected data from her school’s records, ensuring she had permission and that all data was anonymized. The dataset consisted of 1,000 student records, each with features like attendance percentage, average assignment score, participation in extracurricular activities, and a binary label indicating pass/fail.

Sarah remembered the importance of splitting her dataset. She knew that training her model on all the data and then testing it on the same data would not give her a true measure of its performance. After some research, Sarah decided on a 70-15-15 split for training, validation, and testing. Given her dataset size, this seemed appropriate.

Sarah noticed that the number of students who failed was significantly lower than those who passed. To ensure that her training, validation, and test sets all had a similar proportion of pass-and-fail students, she used stratified sampling. Sarah used the train_test_split function from Scikit-learn, ensuring she set the stratify parameter to achieve stratified sampling. To get a more robust measure of her model’s performance, Sarah decided to use 5-fold cross-validation.

After her initial model training, Sarah noticed her model had a 98% accuracy on the training set but only 75% on the validation set. Recognizing this as a potential overfitting, she decided to gather more data and also introduced regularization.

Sarah plotted histograms of key features for both training and test sets. She ensured that the distributions were similar, confirming that her splits were representative.

Sarah documented every step and set random seeds to ensure that her work could be reproduced by others in the future.

After addressing the overfitting and ensuring best practices in data splitting, Sarah’s model achieved an accuracy of 85% on the validation set. She presented her findings at a school science fair, emphasizing the importance of proper data preparation and splitting in machine learning.

Her project was well-received, with many praising her meticulous approach to data splitting. Sarah’s journey illustrated that even at the high school level, with the right knowledge and approach, one can effectively navigate the complexities of machine learning.