2025-12-10 22:04:51 +01:00
2025-12-06 20:43:56 +01:00
2025-12-06 20:43:56 +01:00
2025-12-10 21:38:28 +01:00
2025-12-06 20:43:56 +01:00
2025-12-06 20:43:56 +01:00
2025-12-07 08:26:57 +01:00
2025-12-10 21:38:28 +01:00
2025-11-30 23:28:08 +01:00
2025-12-07 08:26:57 +01:00
2025-12-10 22:04:51 +01:00
2025-12-10 21:40:44 +01:00
2025-11-26 12:39:43 +00:00
2025-12-06 20:43:56 +01:00
2025-12-10 22:03:58 +01:00
2025-12-06 20:43:56 +01:00
2025-11-26 13:45:57 +01:00
2025-12-06 00:14:59 +01:00
2025-12-06 22:05:19 +01:00
2025-12-06 00:14:59 +01:00
2025-12-06 20:43:56 +01:00

Electrocardiogram

We are dealing with an exteremly imbalance dataset related to electrocardiogram signals that contain binary classes and labeled as good(0) and bad(1) signals.

STEP 1: Fill missing values

All the columns in our data contain missing values a range from 25 to 70. By using from sklearn.impute import KNNImputer we fill all of them using 5 of the nearst neighbors of that missing value.

imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_frame)
data_frame_imputed = pandas.DataFrame(data_imputed, columns=columns)

missing_value_counts = data_frame_imputed.isna().sum()
write_textfile(f"{data_directory}/no_missing.txt", missing_value_counts)
return data_frame_imputed

STEP 2: Scaling

We used from sklearn.preprocessing import RobustScaler to handle scaling.

scaler = RobustScaler()
x = data_frame.drop("label", axis=1)
x_scale = scaler.fit_transform(x)
data_frame_scaled = pandas.DataFrame(x_scale, columns=x.columns)
data_frame_scaled["label"] = labels.values

STEP 3: k-fold cross validation + stratify classes + balancing training data

First of all we split the dataset into 2 parts train (85%) and test (15%). For making sure that majority class and imbalanced class distributed fairly we passed stratify=y

x_train, x_test, y_train, y_test = train_test_split(
  X,
  y,
  test_size=0.15,
  stratify=y,
  random_state=42,
)

Then, for train dataset we used from sklearn.model_selection import StratifiedKFold to this class distribution also apply for train and validation data.

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for fold_num, (train_idx, val_idx) in enumerate(
      tqdm.tqdm(skf.split(X, y), total=skf.n_splits, desc="Training Folds"), start=1
  ):
      X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
      y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

and finally we use one of these balancing methods from imblearn.over_sampling import ADASYN, SMOTE, SVMSMOTE, BorderlineSMOTE, KMeansSMOTE to augment samples for only train data

if smote:
  if smote_method.lower() == "kmeans":
      sampler = KMeansSMOTE(
          k_neighbors=5,
          cluster_balance_threshold=0.1,
          random_state=random_state,
      )
  elif smote_method.lower() == "smote":
      sampler = SMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "svmsmote":
      sampler = SVMSMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "borderline":
      sampler = BorderlineSMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "adasyn":
      sampler = ADASYN(n_neighbors=5, random_state=random_state)
  else:
      raise ValueError(f"Unknown smote_method: {smote_method}")

  X_train, y_train = sampler.fit_resample(X_train, y_train)

model.fit(X_train, y_train)

STEP 4: Train different models to find the best possible approach

What we are looking for:

Dangerous: Sick → predicted healthy : high recall score or low FN

Costly: Healthy → predicted sick : high precision score or low FP

STEP 5:

Current results taken KMEANS_SMOTE:

model stage accuracy f1_macro f2_macro recall_macro precision_macro f1_class0 f1_class1 f2_class0 f2_class1 recall_class0 recall_class1 precision_class0 precision_class1 TP TN FP FN
CatBoost_balanced_knn10 train 0.9843784049402589 0.8696686267343388 0.8824472728294012 0.8916952848998795 0.8508242781484853 0.9919396338322237 0.7473976196364541 0.9908276010500254 0.7740669446087769 0.9900881006639566 0.7933024691358025 0.9938004847319636 0.7078480715650071 789 26898 140 19
CatBoost_balanced_knn10 test 0.9802604802604803 0.8348421298822796 0.8461546793313885 0.8541662696976049 0.8176680164072361 0.9898162729658793 0.6798679867986799 0.988757446094471 0.703551912568306 0.9880528191154894 0.7202797202797203 0.991586032814472 0.64375 103 4714 57 40
LGBM_KMEANS_SMOTE_knn10 train 0.9883286128479746 0.8784419356817057 0.8436008106620193 0.8240767336379762 0.9582821430574249 0.9940169232360254 0.7628669481273861 0.9966698960611392 0.6905317252628993 0.9984466771524954 0.6497067901234568 0.9896275269971563 0.9269367591176938 775 27036 2 33
LGBM_KMEANS_SMOTE_knn10 test 0.9865689865689866 0.8543196878009516 0.8121616449258658 0.7895809912158687 0.9600745182511498 0.9931221342225928 0.7155172413793104 0.9964866786565728 0.6278366111951589 0.9987424020121568 0.5804195804195804 0.9875647668393782 0.9325842696629213 83 4765 6 60

Tuning LightGBM and CatBoost

As it is written in models/catboost_model.py tune function for this model we used the following parameters:

  scaling_methods = [
      "standard_scaling",
      "robust_scaling",
      "minmax_scaling",
      "yeo_johnson",
  ]
  sampling_methods = [
      "KMeansSMOTE",
      "class_weight",
  ]
  learning_rate_list = [0.03, 0.05, 0.1]
  depth_list = [6, 8]
  l2_leaf_reg_list = [1, 3]
  subsample_list = [0.8, 1.0]
  k_neighbors_list = [10]
  kmeans_estimator_list = [5]

Also, for models/lightgbm_model.py tune function we used the folowing parameters:

  scaling_methods = [
      "standard_scaling",
      "robust_scaling",
      "minmax_scaling",
      "yeo_johnson",
  ]
  sampling_methods = [
      "KMeansSMOTE",
      "class_weight",
  ]
  boosting_type_list = ["gbdt", "dart"]
  learning_rate_list = [0.03, 0.05, 0.1]
  number_of_leaves_list = [100]
  l2_regularization_lambda_list = [0.1, 0.5]
  l1_regularization_alpha_list = [0.1, 0.5]
  tree_subsample_tree_list = [0.8, 1.0]
  subsample_list = [0.8, 1.0]
  kmeans_smote_k_neighbors_list = [10]
  kmeans_smote_n_clusters_list = [5]

After tuning we train both models based on their best parameters and compare on an imbalanced test data. here is the comparison results:

model accuracy f1_macro f2_macro recall_macro precision_macro f1_class0 f2_class0 recall_class0 precision_class0 f1_class1 f2_class1 recall_class1 precision_class1 TP TN FP FN
catboost 0.9814814814814815 0.8195693865042805 0.8013174756506312 0.7903526990720451 0.8559205703525894 0.9904901243599122 0.9921698350221925 0.9932928107315029 0.9877032096706961 0.6486486486486487 0.6104651162790697 0.5874125874125874 0.7241379310344828 84 4739 32 59
lightgbm 0.9849409849409849 0.8469442386692707 0.8185917013944679 0.8023094072140393 0.9084632979829487 0.9922755741127348 0.9946427824048885 0.9962272060364703 0.9883551673944687 0.7016129032258065 0.6425406203840472 0.6083916083916084 0.8285714285714286 87 4753 18 56

next steps:

✅ 1. Stratified K-fold only apply on train.
✅ 2. train LGBM model using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
✅ 3. train Cat_boost using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
🗹 4. implement proposed methods of this article : https://1drv.ms/b/c/ab2a38fe5c318317/IQBEDsSFcYj6R6AMtOnh0X6DAZUlFqAYq19WT8nTeXomFwg
🗹 5. compare proposed model with SMOTE vs oversampling balancing method
Description
No description provided
Readme MIT 1.3 MiB
Languages
Python 87.6%
R 12.4%