Input data sample training clarifications

DJamin 386 Reputation points
2021-08-19T10:26:53.483+00:00

Dear users and experts,
I am not clear about how is used the input data sample.
I have a csv file with 5 fields and 20k lines entries.
When I run a classification training, does the 20k entries used ? Or the algorithm split this data sample randomly?

My final goal is to ensure that I am training on the desired examples I am providing, in order to make my training better when I find some new examples on which the algortihm is doing wrong.

Best regards.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,575 questions
.NET Machine learning
.NET Machine learning
.NET: Microsoft Technologies based on the .NET software framework.Machine learning: A type of artificial intelligence focused on enabling computers to use observed data to evolve new behaviors that have not been explicitly programmed.
150 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Ramr-msft 17,616 Reputation points
    2021-08-20T04:04:24.663+00:00

    @DJamin Thanks for the question. The algorithm will not split the data. The idea is to split the whole dataset into training and test, where the test dataset is held back from training your model. Then in the training stage, the original training dataset is divided again into the (secondary) training dataset and validation dataset, where the validation dataset is also held back from training your model. The reason for the second split of training dataset is that the most models have some hyperparameters that need to be tuned, where the role of validation dataset is to be used for this purpose with a specific model. Thus, if my model does not have hyperparameters to be tuned, I do not need to have the training dataset split into the (secondary) training and validation datasets.

    • Training Dataset: The sample of data used to fit the model.
    • Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
    • Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

    0 comments No comments

  2. DJamin 386 Reputation points
    2021-08-20T11:05:10.763+00:00

    Dear expert,
    thanks for the feedback.
    But sorry I am not clear with your explanation of the splitting.
    The whole 20k dataset is used for training and these same 20k entries are used again for evaluation?

    I have another last question : is there a way to provide a training sample that will be used fully. And then provide a test sample (containing different data) to evaluate the accuracy of the training. This test sample will be used fully too.

    I want to ensure that all the desired data will be used for learning.
    Anyone knows if such possiblity exists?
    Best regards.