atmelino

October 12, 2024

Stratified K-Fold Information

https://www.youtube.com/watch?v=7062skdX05Y
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py
https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation

October 13, 2024

Stratified K-Fold Demonstration

The source file named "pr_class_05_2_kfold_B.py" is a modification of the file from the J. Heaton class.

It is intended to demonstrate how stratified K-Fold works.

The input file is based on the "jh-simple-dataset.csv" file, however I sorted the rows by the product column and am using the modified file named "jh-simple-dataset_sorted.csv" as the imput.

As a result, the input file now has the following structure, considering the value of the product column:

product	counts	percent
a	130	6.50
b	963	48.15
c	738	36.90
d	59	2.95
e	30	1.50
f	72	3.60
g	8	.40

So we can see that if we applied a K-Fold split with a K-Fold size of 500, then the first fold would only contain rows with a product value of a or b.

The goal of using a stratified K-Fold is to have folds within which the percentage of y-values is approximately the same as in the original input data.

Using the command for a 5 split stratified K-Fold

kf = StratifiedKFold(5, shuffle=True, random_state=42)

we get the following distribution for the first fold:

product	counts	perc
a	26	6.50
b	193	48.25
c	148	37.00
d	11	2.75
e	6	1.50
f	15	3.75
g	1	0.25

The remaining folds have similar distrubutions of rows.

So unlike a simple K-Fold, we now have the desired distribution characteristics.

October 26, 2024

pr_class_05_2_kfold_C.py Intermittenly, the prediction of age in fold 1 is printed to be

Fold #1 Train: index=[45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89] size=(45,) Test: index=[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44] size=(45,) 2/2 [==============================] - 0s 1ms/step prediction pred 0 0.984798 1 0.984798 2 0.984798 3 0.984798 4 0.948751 5 0.984798 6 0.984798 7 0.984798 8 0.984798 9 0.984798 10 0.984798

Running the code again, results are reasonable:

prediction pred 0 53.453243 1 42.135712 2 40.878017 3 39.353210 4 41.757721 5 38.513622 6 44.270351 7 47.103832 8 52.380497 9 37.705742 10 54.965900

October 27, 2024

Part 5.2, section "Training with both a Cross-Validation and a Holdout Set"

The performance of the training is measured by taking the RMSE (square root of mean_squared_error) of the differences between prediction and test data.

The typical RMSE for a fold in the example seems to be around 0.6

However, I have seen an RMSE for the holdout set of around 24 which is very high.
Even in J. Heaton's jupyter notebook, he gets an RMSE of around 24 for fold number 5 as well as the holdout set.

https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_2_kfold.ipynb

Fold #1 Fold score (RMSE): 0.544195299216696 Fold #2 Fold score (RMSE): 0.48070599342910353 Fold #3 Fold score (RMSE): 0.7034584765928998 Fold #4 Fold score (RMSE): 0.5397141785190473 Fold #5 Fold score (RMSE): 24.126205213080077 Cross-validated score (RMSE): 10.801732731207947 Holdout score (RMSE): 24.097657947297677

I have seen similar results, however today I got a run with much better results (without changing the code):

Fold #1 12/12 [==============================] - 0s 513us/step Fold score (RMSE): 0.7567204236984253 Fold #2 12/12 [==============================] - 0s 513us/step Fold score (RMSE): 0.5426404476165771 Fold #3 12/12 [==============================] - 0s 486us/step Fold score (RMSE): 1.0122915506362915 Fold #4 12/12 [==============================] - 0s 499us/step Fold score (RMSE): 0.648369312286377 Fold #5 12/12 [==============================] - 0s 512us/step Fold score (RMSE): 0.5357717871665955 Cross-validated score (RMSE): 0.7210066914558411 7/7 [==============================] - 0s 486us/step Holdout score (RMSE): 0.4387000294433415

I don't know what causes the results to be so different between random runs, when all the external conditions seem to be the same.

AI learning blog October 2024

October 12, 2024

Stratified K-Fold Information

October 13, 2024

Stratified K-Fold Demonstration

October 26, 2024

October 27, 2024

Part 5.2, section "Training with both a Cross-Validation and a Holdout Set"

Date

Posts by Date

Follow Me

Modified July 2023