Q&A (Week 1)

Under the Holdout method in the “Sampling for evaluation” section in the worksheet, does the “test set or holdout set” here refer to the validation set or test set? Is validation set relevant to both Holdout method and Cross-validation, or is it only relevant to Cross-validation?

Simple answer: It is referring to the test set.

Complicated answer: Validation set and test set, while referring to different subsets of data, play very similar role. They are both “hold out” or “set aside” part of our full dataset that is exclusively used to find out how well our classifier is performing without using data the classifier has already seen during its training process.

Long answer: The main thing to understand here is that we want to predict how well our classifier will perform and to do that we need to evaluate our classifier on some data. But to get a good estimate of performance, we should not evaluate the model on data it has already seen in the training process because the training process involve trying to do as well as possible on the training data so of course it is expected to do well on them. We need “unseen” / “hold out”/ “set aside” / “out of sample” / “TEST” data. The word “validation” and “test” is basically synonyms, but we use different words just to signal that we are doing the test in slightly different context. See, we might want to know how well our classifier perform for different purposes: There is one obvious one: to get a final performance report, the one that you use to boast in a paper, to your boss or to convince customers to use your classifier. This is the number you use to predict how well your classifier will actually perform in the future. We often call the set of data that generate this number the “test set”. (but really, this is just convention by this point). One other reason you might want to evaluate performance is to do model selection (notice that KFold is implemented in the model_selection module of sklearn). This is the scenario where you have multiple classifier that has already been trained and you want compare their performance. Well, you can evaluate their performance on data not used in any of their training process and choose the classifier that perform best. This set of data used to evaluate performance is often call the “test set” as well. BUT!! This process of selecting the best classifier from multiple classifiers to come up with one ultimate classifier IS PART OF TRAINING: it is a process that uses data to find a classifier that we hope perform well in future instance of similar data. To know the actual performance of that selected classifier, we can’t use the same data that we used to select it (same reasoning, it is selected to do well on that data, so of course it will do well). We need to evaluate on another test set. Now we have a linguistic problem of using the same term “test set” referring on two physically different (different values, store in physically different location) thing. So we just change the word to a synonymous one, which is why we call it the Validation set. The reason my “simple answer” just says that it is the “test set” is because we haven’t run into this issue of needing to “test” / evaluation for different purpose yet and “test” is the more common / default / general word to use (it is the default word in many ML libraries).

Analogy: We use “mid sem test”, “final test”, “exam”, “quiz” to refer the process of evaluating student performance for different purpose. and different context We could’ve use “model selection testing set” to disambiguate, but “validation” is already widely used.
Under the Cross-validation evaluation method, when it says “test set” or “testing” in the definitions, does the “test” here refer to validation set?

Following the story we outlined above, “testing” here refer to the process of evaluation classifier performance on data it hasn’t seen during training. We do not need to make the distinction between test / validation here because we are talking about that performance evaluation process in general.

Cross-validation is just a method of splitting full data set in “data to train model” and “data unseen during training” multiple time so that we can train on multiple different training set and evaluate their performance. This is usually known as the “train-test split”. The idea is to get different independently trained model and independently evaluate their performance on independent data so that, when we take average of these independent evaluation, it is a better prediction of actual performance. Average of independent samples from the same distribution is a better prediction of the mean is a basic result from probability, but just how “independent” these evaluation are is up for debate, but we are only applying it as a general wisdom that taking average = good.
In the Random sub-sampling method, the description says “each train/test split is drawn with replacement from the original data set”. What does the part “with replacement from the original data set” refer to?

Simple answer: This is just one way to come up with the train-test split. And it is talking about a process that comes up with training and testing set using a process like the following. Suppose full data set is of size 10. And you want to get 5 data point in training set and 3 in testing set (exact number not important for now). We can generate 5 random numbers, e.g. [2, 8, 9, 1, 1] and 3 more [5, 7, 8] and put into our training set the data points with index [2, 8, 9, 1, 1] and our test set the data point with index [5, 7, 8].

Long answer: “Random draw with replacement” is a term used in probability theory. It means we generate a sample of N things by reaching into a bag, randomly picking up something, record what it is and return it into the bag, and repeat this N times. Notice that the “replacement” part makes it possible to pick up something you have picked up before (see the index 1 repeated twice above and the index 8 repeated in train and test set). That is a behaviour you would not get if you are doing “random draw without replacement”. Wait ….. isn’t this doing test evaluation on data you might have seen before …… Good point. But if your data set is large enough, that is going to be rare and it turns out that it is not cheating if the distribution of training and testing set is different, even if there might be repeated instances. But you are right though, “sampling without replacement” is a more common practice (most libraries that implement train_test_split, or k-fold cross-validation has a shuffle=true option, which is effectively doing sampling without replacement).

How much do you need to care about this? If you haven’t mastered what train-test data are, what “model” means, what K-fold cross validation is, and know a few models and how they are trained and tested and figure out what constitute good “performance measure”, maybe leave this alone first. Just know that this is one option to get train-test set, it is not used particularly often and the theory of why it is valid is perhaps not very important to you right now.
In the Leave-one-out cross-validation method (see worksheet), it says “n is the number of instances in the data set”. Is this “data set” here referring to the “train and validation set”?

Simple answer: “Data set” refer to full data set, i.e. all the data you have.

Long answer: “Data set” refers to all the data you have to perform training and testing. Remember that we might be talking about testing in different context here. So might not be doing model selection and we just want to train the model and find out how well it does, then we are talking about all the data you have. And if we are talking about a process with selection, then we might already have a testing data set aside for testing the ultimate model, but we might do cross-validation to split the remaining data multiple times, then n = size of remaining data.

Again, K-Fold, Leave-one-out (which is just K-fold with K selected so that test set is always of size 1) or any such splitting method is there to split any data set, you decide what data set to split and why.
I might have said something like “ROC is not a score”. Why? What about precision_score, recall_score, and f1_score?

I think I said something like “ROC is not a score like what the score I wrote on the whiteboard meant”. I regretted saying that and I regret picking that word “score” to write on the whiteboard. You might have notice that I use the word “performance” a lot above and I think that is closer to what I meant. Throughout the above questions and answers, I didn’t mention how this performance is to be measured, and indeed that is something we determine case-by-case in practice. The vast majority of the time, we measure performance by picking a score: a number that we can compute for a trained classifier on some data with the property that the higher this number is the better the performance is. Accuracy, precision, recall, F1 are all scores in this sense. Other ones that you might be familiar with might include MSE (mean squared error), MAE (mean absolute error) for regression tasks, but be careful that these are numbers that are better the lower they are, so we generally call them “error” or “loss”.

ROC is not a score in the above sense. First of all, it is not even a number, it is a curve on the x-y plane. To make it a number, we generally calculate the area under a ROC curve (AUC). Now this number does have the property that the higher it is, the better something is.. That “something” however is not ONE classifier. Recall that we construct the ROC curve for model that output a number that we can compare with a chosen threshold to decide whether or not an observation is positive or negative. Unfortunately, we call that number a “score” as well, perhaps you can translate it as “confidence”: it is a number that is higher when the model is more sure that it is a positive observation. Upshot is, “trained model + chosen threshold = classifier”, different threshold different classifier. The ROC AUC is a way to find out the performance of the “trained model” part of the equation by looking at its performance across ALL threshold.

So, higher AUC, better model (or more specifically, better TPR-FPR trade off can be made with the model).