Demo implementation of how to do sampling with replacement using different probability density

In terms of hyperparameters, I think you mentioned that X=(X1, X2, …) is part of hyperparameters. Does this mean the train-test split is part of the hyperparameter?

Sorry that confused you. I did not meant to imply that X = (X1, X2, …) are hyperparameters, they are not parameters of any kind. Depending on what your notation meant, they are either features, i.e. each observation (data you collect) constitute a row in a data frame of the form (X, y) with X = (X1, X2, …, Xn) being the independent variables, i.e. the features.

I was giving the example that one could do “feature selection” in which one chooses a subset of N best features according to some criteria (e.g. correlation of X_i with the response y). The size of that subset, N, can be a hyperparameter. Meaning it is something you choose BEFORE you call model.fit(). If your model is a linear model, then the model itself has parameters (the linear coefficients in this case). Each parameter will produce a different “guess” or “hypothesis” that explains the relationship between X and y and the role of model.fit() is to find the best parameter according to a given goodness criterion (accuracy, mean squared error etc). Each different hyperparameter (e.g. N above) will also produce a different hypothesis whose goodness can be measure with any given data. BUT, it is NOT optimised (or “trained”) by model.fit(). It is instead chosen before calling .fit() and usually changes which set of parameter .fit() can even affect. In that sense, it is a parameter that control other parameters, hence the name “hyperparameter”.
Regarding the K-fold stratified cross-validation, some sources mention it improves accuracy. Is this referring to higher accuracy on the training set, or higher accuracy for the test set as well? Cross-validation (of any sort. K-fold, leave-one-out, random sampling) is a method to evaluate out-of-sample accuracy (or any other performance measures). It doesn’t, by itself improves anything. However, by doing cross-validation, one can get information about performance of the model in the future. So, if you do cross-validation across different classifiers, then you can pick the one with the best cross-validated scores to ensure the final classifiers you choose has the best accuracy among all the classifiers you tested.

Also, to use exaggerated language, we NEVER evaluate performance on training set. While we do evaluate accuracy or other performance scores on training set, it is for the purpose of “fitting” the model: to find a parameter, configuration (e.g. decision tree branches) that do well to explain the training data. The score is used for this search, but we never believe that the scores computed during training will be indicative of future performance. We always evaluate how well our model might do on future data – how well our model “generalise” – using data it has never seen during training. This out-of-sample data set can either be a test set (data set aside specifically for this purpose) or we go through a cross-validation process.
For the decision tree, I think in class we were told to stop splitting when “instances at root have the same class”. Do we need to consider whether to split the continuous features before or after the categorical features when constructing the decision tree? (since taking a different threshold may affect the number of classes after the split, affect the Information Gain calculated, and that may affect when we stop splitting). Or does the threshold or order have no impact on the final model?

That is indeed a subtle thing. Let’s recall what we do for the case where all the features are categorical and the response y is also categorical: at every node, we split on the feature that has the highest information gain. In this case, the ordering of features doesn’t matter since we evaluate InfoGain on every feature and take the best.

Things becomes slightly complicated when continuous variables are involved, but the overall story doesn’t really change: We still construct a way to evaluate “goodness of split” for each feature, evaluate this on all feature and then split on the best. The only that changed is how we evaluate this “goodness of split” on continuous features. So, the order still doesn’t matter, we still evaluate on everything and pick the best.

How we evaluate goodness of split depends on the type of feature and the type of response:
1. Categorical input -> categorical output. This is the case discussed in lecture and workshop, we evaluate information gain or similar purity criteria.
2. Categorical input -> continuous output. We can still split on this feature, but now we need to extend the idea of purity that we evaluate on each node to continuous variable. To motivate this a little with an extremely nice situation. We have the following data [(A, 5), (B, 1000), (B, 1050), (A, 7)]. It is clear that when the input category is A, the output is very small and when it is B it is large. The decision rule “if input is A, predict output = 6, else predict 1025”, where those prediction values are the average, this look like a good rule. In general, if we split on the feature values and we use the “average” of the output in each split as our prediction, then we evaluate how good our split is by looking expected error that kind of prediction will make. In our simple example, not splitting on that feature will give the prediction of “5 + 7 + 1000 + 1050 / 4 = 515.5” which is far from all the values we have seen, while if we do split and predict base on A or B, we improve to predicting 6 (which is going to be close to any values in the A branch) and 1025 which is close to any values in the B branch. So in general, we still have a formulate that is similar in spirit to the InfoGain formula: “error of prediction at parent node - average error across children node”.
3. Continuous input -> categorical output. One way to handle this is to use a threshold on the input value. If we pick a threshold T, then splitting based on whether input < T effectively turn this into case 1 above: categorical (binary) input -> categorical output. Its just that there are infinitely many T to choose from. Well, we can just choose the “best T”, meaning find the T where the resulting split has best purity score. And then the goodness of split from that T is then the goodness of split for this feature overall. Choosing the best T from infinitely many might seem impossible, but, this is a very typical situation that just turn into an optimisation problem. E.g. find out the function InfoGain(T) and use calculus to solve for the optimal T.
4. Continuous input -> continuous output. This is now a regression problem. One way to handle this is to do the same thing as case 2 above, we have a categorical (binary now) input -> continuous output. And the issue of which threshold to choose is again down to issue of optimisation like case 3.
All the above have variation with various strength and weaknesses, but perhaps understand the reasoning behind these methods first might help. See sklearn documentation on decision tree (classifiers and regressors) for reference.
For ranking-based feature selection of discrete valued features using information gain, is the mutual information calculated based on the original data set (from the top of tree without any split)?

Short answer: Mutual information (or correlation, or any such scores used for feature selection) is evaluated on the “train” data both the train-test split and train-validation-test split.

Long answer: First thing first, feature selection is something you can do regardless of what model you end up using, so this is not about decision tree per se. You might select features you think are best and feed it into DecisionTree, RandomForest, linear models or anything you wish.

The reason we shouldn’t use validation or test data to do feature selection is related to why we don’t use training data to evaluate performance. Feature selection is something you do as part of training: the selected features is going to be used by model.fit() to find best parameter. If you use the validation or test data to find such a feature set and then evaluate performance on data you already use to inform how your model is constructed, the resulting performance is going to be overly optimistic.
I am confused between wrapper-based feature selection and embedded feature selection, as they both seem to have feature selection in parallel to training. Is the main difference that the “mechanism” in embedded feature selection needs to be explicitly defined in the algorithm (like decision tree), while it is considered unknown in the wrapper-based feature selection before the model is trained?

As far as I can tell, your description is correct. The embedded rely on “performance measure” that is built into the model you choose whereas wrapper one is independent on what model you use to evaluate performance. I am not sure these are standard terminology as I have not seen it much … I am going to point you to this site for this question.
In the Bagging topic, the slide mentioned to use “cross-validation to evaluate its out-of-sample data”. Does it mean that the accuracy will be calculated over the out-of-sample data each time even if the size is different? And when it says bagging “improves unstable classifiers by reducing variance”, I am confused about what this improvement is relative to. Is this talking about improvement in the performance after taking majority vote or averaging compared to the performance from base classifiers which are unstable?

Yes. Remember, the main point of cross-validation is that we evaluate performance on samples not used to train the model. Evaluating on 1 sample the model didn’t see during training is still a way to predict performance, but the value you get out is going to have a high variance compare to evaluating performance on lots of samples and take average. This is just using phenomena typified in the Central limit theorem: if one sample has variance S^2, then averaging over N independent samples has variance S^2 / N.

And yes again, the improvement comes from averaging unstable (high variance) base estimators. Bagging is again using central limit theorem-like phenomena, if you have N independent prediction, you can reduce your prediction variance by 1/N. Although we don’t have guaranteed independence as needed in central limit theorem, we use a lot of tricks (random subsampling training data, random subsampling feature set etc) to make the individual base estimators as “independent” as possible. So, although we won’t get full 1/N reduction in variance, there will still be some improvement over individuals.

Demo implementation of how to do sampling with replacement using different probability density

import numpy as np

def sample_one(probabilities):
    """
    Generate a single sample with probability density specified by
      `probabilities = [p1, p2, ..., pN]`
    We will return an index `i` between 0 and N-1 inclusive.
    """
    assert np.sum(probabilities) # just checking this is an honest probability density
    
    N = len(probabilities) # size of the sample space
    cummulative_density = np.cumsum(probabilities) # compute CDF
    cummulative_density = np.concatenate([[0], cummulative_density]) # prepend 0 to the list
    random_seed = np.random.rand() # generate a single random number between 0 and 1 uniformly.
    
    # Look for the index i such that CDF[i -1] <= random_seed <= CDF[i]
    for i in range(N):
        if cummulative_density[i] < random_seed <= cummulative_density[i + 1]:
            return i # break and return when found
    #return N -1 # if loop complete it must have fall on the last interval


def sample(n, items, probabilities):
    """For pedagogy only. Inefficient implementation"""
    samples = []
    for _ in range(n):
        index = sample_one(probabilities)
        samples.append(items[index])
    return samples


items = ['a', 'b', 'c', 'd']
probabilities = [1/8, 1/8, 4/8, 2/8]
num_samples = 10000
samples = sample(num_samples, items, probabilities)

print(f"Empirical probabilities:")
a, b = np.unique(samples,  return_counts=True)
for a, b in zip(*np.unique(samples,  return_counts=True)):
    print(f"Probability of oberving {a} = {b / num_samples}")
Empirical probabilities:
Probability of oberving a = 0.1208
Probability of oberving b = 0.1248
Probability of oberving c = 0.505
Probability of oberving d = 0.2494