Input data sample training clarifications

Question

Dear users and experts,
I am not clear about how is used the input data sample.
I have a csv file with 5 fields and 20k lines entries.
When I run a classification training, does the 20k entries used ? Or the algorithm split this data sample randomly?

My final goal is to ensure that I am training on the desired examples I am providing, in order to make my training better when I find some new examples on which the algortihm is doing wrong.

Best regards.

Answer

@DJamin Thanks for the question. The algorithm will not split the data. The idea is to split the whole dataset into training and test, where the test dataset is held back from training your model. Then in the training stage, the original training dataset is divided again into the (secondary) training dataset and validation dataset, where the validation dataset is also held back from training your model. The reason for the second split of training dataset is that the most models have some hyperparameters that need to be tuned, where the role of validation dataset is to be used for this purpose with a specific model. Thus, if my model does not have hyperparameters to be tuned, I do not need to have the training dataset split into the (secondary) training and validation datasets.

• Training Dataset: The sample of data used to fit the model.
• Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
• Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

Answer

Dear expert,
thanks for the feedback.
But sorry I am not clear with your explanation of the splitting.
The whole 20k dataset is used for training and these same 20k entries are used again for evaluation?

I have another last question : is there a way to provide a training sample that will be used fully. And then provide a test sample (containing different data) to evaluate the accuracy of the training. This test sample will be used fully too.

I want to ensure that all the desired data will be used for learning.
Anyone knows if such possiblity exists?
Best regards.

Input data sample training clarifications

2 answers