October 13, 2024
Stratified K-Fold Demonstration
The source file named "pr_class_05_2_kfold_B.py" is a modification of the file from the J. Heaton class.
It is intended to demonstrate how stratified K-Fold works.
The input file is based on the "jh-simple-dataset.csv" file, however I sorted the rows by the product column and
am using the modified file named "jh-simple-dataset_sorted.csv" as the imput.
As a result, the input file now has the following structure, considering the value of the product
column:
product |
counts |
percent |
a |
130 |
6.50 |
b |
963 |
48.15 |
c |
738 |
36.90 |
d |
59 |
2.95 |
e |
30 |
1.50 |
f |
72 |
3.60 |
g |
8 |
.40 |
So we can see that if we applied a K-Fold split with a K-Fold size of 500, then the first fold would only
contain rows with a product value of a or b.
The goal of using a stratified K-Fold is to have folds within which the percentage of y-values is approximately
the same as in the original input data.
Using the command for a 5 split stratified K-Fold
kf = StratifiedKFold(5, shuffle=True, random_state=42)
we get the following distribution for the first fold:
product |
counts |
perc |
a |
26 |
6.50 |
b |
193 |
48.25 |
c |
148 |
37.00 |
d |
11 |
2.75 |
e |
6 |
1.50 |
f |
15 |
3.75 |
g |
1 |
0.25 |
The remaining folds have similar distrubutions of rows.
So unlike a simple K-Fold, we now have the desired distribution characteristics.
October 26, 2024
pr_class_05_2_kfold_C.py
Intermittenly, the prediction of age in fold 1 is printed to be
Fold #1
Train: index=[45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89] size=(45,)
Test: index=[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44] size=(45,)
2/2 [==============================] - 0s 1ms/step
prediction
pred
0 0.984798
1 0.984798
2 0.984798
3 0.984798
4 0.948751
5 0.984798
6 0.984798
7 0.984798
8 0.984798
9 0.984798
10 0.984798
Running the code again, results are reasonable:
prediction
pred
0 53.453243
1 42.135712
2 40.878017
3 39.353210
4 41.757721
5 38.513622
6 44.270351
7 47.103832
8 52.380497
9 37.705742
10 54.965900
October 27, 2024
Part 5.2, section "Training with both a Cross-Validation and a Holdout Set"
The performance of the training is measured by taking the RMSE (square root of mean_squared_error)
of the differences between prediction and test data.
The typical RMSE for a fold in the example seems to be around 0.6
However, I have seen an RMSE for the holdout set of around 24 which is very high.
Even in J. Heaton's jupyter notebook, he gets an RMSE of around 24 for fold number 5 as well as the holdout set.
https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_2_kfold.ipynb
Fold #1
Fold score (RMSE): 0.544195299216696
Fold #2
Fold score (RMSE): 0.48070599342910353
Fold #3
Fold score (RMSE): 0.7034584765928998
Fold #4
Fold score (RMSE): 0.5397141785190473
Fold #5
Fold score (RMSE): 24.126205213080077
Cross-validated score (RMSE): 10.801732731207947
Holdout score (RMSE): 24.097657947297677
I have seen similar results, however today I got a run with much better results (without changing the code):
Fold #1
12/12 [==============================] - 0s 513us/step
Fold score (RMSE): 0.7567204236984253
Fold #2
12/12 [==============================] - 0s 513us/step
Fold score (RMSE): 0.5426404476165771
Fold #3
12/12 [==============================] - 0s 486us/step
Fold score (RMSE): 1.0122915506362915
Fold #4
12/12 [==============================] - 0s 499us/step
Fold score (RMSE): 0.648369312286377
Fold #5
12/12 [==============================] - 0s 512us/step
Fold score (RMSE): 0.5357717871665955
Cross-validated score (RMSE): 0.7210066914558411
7/7 [==============================] - 0s 486us/step
Holdout score (RMSE): 0.4387000294433415
I don't know what causes the results to be so different between random runs, when all the external conditions
seem to be the same.