candido.ai logo

Are you splitting your dataset correctly?


Unlocking the Power of AI to Safeguard Your Data and Keep Threats at Bay

The evaluation of application performance is a process that has the division of the data set as a fundamental step.

By identifying a situation in which a machine learning application is appropriate, with the goals of optimizing some technical, operational, or business process and not having your organization overtaken by the competition, we must take the first steps to build the project properly.

This is first done by focusing on the available data set. For the most part, our data will guide us in developing a machine-learning application. Through an empirical process armed with metrics such as accuracy recall, among others, we will evaluate the performance of the machine learning application.

Metrics for Evaluation

Precision, Recall (sensitivity), F1 Score, and Accuracy are metrics used to evaluate the performance of a Machine Learning model. Without the use of these metrics for model optimization, the application may be biased or have errors that are recognized only after the application is in the production environment.

After evaluating the results obtained, we begin a new iteration in the development cycle, adjusting our network architecture/ML algorithm, dataset, and other relevant application components.

The evaluation of application performance is a process that has the division of the data set as a fundamental step. You can do this division in a few different ways. For illustration, we will use the most common division in the literature, the “80/10/10”. In this division, we separate 80% of our data to perform model training, 10% for evaluation during project development — also called dev set — and the remaining 10% for the test set, which we use to put our model “to the test.”

Have you ever wondered about the applicability of this “traditional” division? For a long time, the “80/10/10” data division — or one of the similar proportions — was taken as a standard for developing machine learning models.

Is this division still applicable today when working with massive data sets?

Performing a wrong split of your dataset can multiply the development time of your project unnecessarily. Depending on the number of samples available, you may run evaluations of your model daily or even at higher frequencies on unnecessarily large sets.

As much as the impression that defining a model to arrive at a particular result is relatively straightforward, this could not be further from the truth. Adopting an iterative process is necessary to achieve a breakthrough in a machine-learning problem.

Remember, a machine learning project is composed of a highly iterative process.

Developing a model that produces satisfactory results is a highly iterative process. If we do not structure our data set correctly concerning the number of samples available, each development iteration will undoubtedly cost your team more and more.