home scroll deno

AI learning blog October 2024

October 13, 2024

Stratified K-Fold Demonstration
The source file named "pr_class_05_2_kfold_B.py" is a modification of the file from the J. Heaton class.

It is intended to demonstrate how stratified K-Fold works.

The input file is based on the "jh-simple-dataset.csv" file, however I sorted the rows by the product column and am using the modified file named "jh-simple-dataset_sorted.csv" as the imput.

As a result, the input file now has the following structure, considering the value of the product column:

product counts percent
a 130 6.50
b 963 48.15
c 738 36.90
d 59 2.95
e 30 1.50
f 72 3.60
g 8 .40

So we can see that if we applied a K-Fold split with a K-Fold size of 500, then the first fold would only contain rows with a product value of a or b.

The goal of using a stratified K-Fold is to have folds within which the percentage of y-values is approximately the same as in the original input data.

Using the command for a 5 split stratified K-Fold
kf = StratifiedKFold(5, shuffle=True, random_state=42)
we get the following distribution for the first fold:

product counts perc
a 26 6.50
b 193 48.25
c 148 37.00
d 11 2.75
e 6 1.50
f 15 3.75
g 1 0.25

The remaining folds have similar distrubutions of rows.

So unlike a simple K-Fold, we now have the desired distribution characteristics.

October 26, 2024

pr_class_05_2_kfold_C.py Intermittenly, the prediction of age in fold 1 is printed to be
Fold #1 Train: index=[45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89] size=(45,) Test: index=[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44] size=(45,) 2/2 [==============================] - 0s 1ms/step prediction pred 0 0.984798 1 0.984798 2 0.984798 3 0.984798 4 0.948751 5 0.984798 6 0.984798 7 0.984798 8 0.984798 9 0.984798 10 0.984798
Running the code again, results are reasonable:
prediction pred 0 53.453243 1 42.135712 2 40.878017 3 39.353210 4 41.757721 5 38.513622 6 44.270351 7 47.103832 8 52.380497 9 37.705742 10 54.965900

October 27, 2024

Part 5.2, section "Training with both a Cross-Validation and a Holdout Set"
The performance of the training is measured by taking the RMSE (square root of mean_squared_error) of the differences between prediction and test data.

The typical RMSE for a fold in the example seems to be around 0.6

However, I have seen an RMSE for the holdout set of around 24 which is very high.
Even in J. Heaton's jupyter notebook, he gets an RMSE of around 24 for fold number 5 as well as the holdout set.

https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_2_kfold.ipynb
Fold #1 Fold score (RMSE): 0.544195299216696 Fold #2 Fold score (RMSE): 0.48070599342910353 Fold #3 Fold score (RMSE): 0.7034584765928998 Fold #4 Fold score (RMSE): 0.5397141785190473 Fold #5 Fold score (RMSE): 24.126205213080077 Cross-validated score (RMSE): 10.801732731207947 Holdout score (RMSE): 24.097657947297677
I have seen similar results, however today I got a run with much better results (without changing the code):
Fold #1 12/12 [==============================] - 0s 513us/step Fold score (RMSE): 0.7567204236984253 Fold #2 12/12 [==============================] - 0s 513us/step Fold score (RMSE): 0.5426404476165771 Fold #3 12/12 [==============================] - 0s 486us/step Fold score (RMSE): 1.0122915506362915 Fold #4 12/12 [==============================] - 0s 499us/step Fold score (RMSE): 0.648369312286377 Fold #5 12/12 [==============================] - 0s 512us/step Fold score (RMSE): 0.5357717871665955 Cross-validated score (RMSE): 0.7210066914558411 7/7 [==============================] - 0s 486us/step Holdout score (RMSE): 0.4387000294433415
I don't know what causes the results to be so different between random runs, when all the external conditions seem to be the same.

Date


Follow Me

discord