We are dealing with an exteremly imbalance dataset related to electrocardiogram signals that contain binary classes and labeled as good(0) and bad(1) signals.

STEP 1: Fill missing values

All the columns in our data contain missing values a range from 25 to 70. By using from sklearn.impute import KNNImputer we fill all of them using 5 of the nearst neighbors of that missing value.

imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_frame)
data_frame_imputed = pandas.DataFrame(data_imputed, columns=columns)

missing_value_counts = data_frame_imputed.isna().sum()
write_textfile(f"{data_directory}/no_missing.txt", missing_value_counts)
return data_frame_imputed

STEP 2: Scaling

We used from sklearn.preprocessing import RobustScaler to handle scaling.

scaler = RobustScaler()
x = data_frame.drop("label", axis=1)
x_scale = scaler.fit_transform(x)
data_frame_scaled = pandas.DataFrame(x_scale, columns=x.columns)
data_frame_scaled["label"] = labels.values

STEP 3: k-fold cross validation + stratify classes + balancing training data

First of all we split the dataset into 2 parts train (85%) and test (15%). For making sure that majority class and imbalanced class distributed fairly we passed stratify=y

x_train, x_test, y_train, y_test = train_test_split(
  X,
  y,
  test_size=0.15,
  stratify=y,
  random_state=42,
)

Then, for train dataset we used from sklearn.model_selection import StratifiedKFold to this class distribution also apply for train and validation data.

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for fold_num, (train_idx, val_idx) in enumerate(
      tqdm.tqdm(skf.split(X, y), total=skf.n_splits, desc="Training Folds"), start=1
  ):
      X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
      y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

and finally we use one of these balancing methods from imblearn.over_sampling import ADASYN, SMOTE, SVMSMOTE, BorderlineSMOTE, KMeansSMOTE to augment samples for only train data

if smote:
  if smote_method.lower() == "kmeans":
      sampler = KMeansSMOTE(
          k_neighbors=5,
          cluster_balance_threshold=0.1,
          random_state=random_state,
      )
  elif smote_method.lower() == "smote":
      sampler = SMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "svmsmote":
      sampler = SVMSMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "borderline":
      sampler = BorderlineSMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "adasyn":
      sampler = ADASYN(n_neighbors=5, random_state=random_state)
  else:
      raise ValueError(f"Unknown smote_method: {smote_method}")

  X_train, y_train = sampler.fit_resample(X_train, y_train)

model.fit(X_train, y_train)

STEP 4: Train different models to find the best possible approach

What we are looking for:

Dangerous: Sick → predicted healthy : high recall score or low FN

Costly: Healthy → predicted sick : high precision score or low FP

STEP 5:

Current results taken KMEANS_SMOTE:

model	stage	accuracy	f1_macro	f2_macro	recall_macro	precision_macro	f1_class0	f1_class1	f2_class0	f2_class1	recall_class0	recall_class1	precision_class0	precision_class1	TP	TN	FP	FN
CatBoost_balanced_knn10	train	0.9843784049402589	0.8696686267343388	0.8824472728294012	0.8916952848998795	0.8508242781484853	0.9919396338322237	0.7473976196364541	0.9908276010500254	0.7740669446087769	0.9900881006639566	0.7933024691358025	0.9938004847319636	0.7078480715650071	789	26898	140	19
CatBoost_balanced_knn10	test	0.9802604802604803	0.8348421298822796	0.8461546793313885	0.8541662696976049	0.8176680164072361	0.9898162729658793	0.6798679867986799	0.988757446094471	0.703551912568306	0.9880528191154894	0.7202797202797203	0.991586032814472	0.64375	103	4714	57	40
LGBM_KMEANS_SMOTE_knn10	train	0.9883286128479746	0.8784419356817057	0.8436008106620193	0.8240767336379762	0.9582821430574249	0.9940169232360254	0.7628669481273861	0.9966698960611392	0.6905317252628993	0.9984466771524954	0.6497067901234568	0.9896275269971563	0.9269367591176938	775	27036	2	33
LGBM_KMEANS_SMOTE_knn10	test	0.9865689865689866	0.8543196878009516	0.8121616449258658	0.7895809912158687	0.9600745182511498	0.9931221342225928	0.7155172413793104	0.9964866786565728	0.6278366111951589	0.9987424020121568	0.5804195804195804	0.9875647668393782	0.9325842696629213	83	4765	6	60

Tuning LightGBM and CatBoost

As it is written in models/catboost_model.py tune function for this model we used the following parameters:

  scaling_methods = [
      "standard_scaling",
      "robust_scaling",
      "minmax_scaling",
      "yeo_johnson",
  ]
  sampling_methods = [
      "KMeansSMOTE",
      "class_weight",
  ]
  learning_rate_list = [0.03, 0.05, 0.1]
  depth_list = [6, 8]
  l2_leaf_reg_list = [1, 3]
  subsample_list = [0.8, 1.0]
  k_neighbors_list = [10]
  kmeans_estimator_list = [5]

Also, for models/lightgbm_model.py tune function we used the folowing parameters:

  scaling_methods = [
      "standard_scaling",
      "robust_scaling",
      "minmax_scaling",
      "yeo_johnson",
  ]
  sampling_methods = [
      "KMeansSMOTE",
      "class_weight",
  ]
  boosting_type_list = ["gbdt", "dart"]
  learning_rate_list = [0.03, 0.05, 0.1]
  number_of_leaves_list = [100]
  l2_regularization_lambda_list = [0.1, 0.5]
  l1_regularization_alpha_list = [0.1, 0.5]
  tree_subsample_tree_list = [0.8, 1.0]
  subsample_list = [0.8, 1.0]
  kmeans_smote_k_neighbors_list = [10]
  kmeans_smote_n_clusters_list = [5]

After tuning we train both models based on their best parameters and compare on an imbalanced test data. here is the comparison results:

model	accuracy	f1_macro	f2_macro	recall_macro	precision_macro	f1_class0	f2_class0	recall_class0	precision_class0	f1_class1	f2_class1	recall_class1	precision_class1	TP	TN	FP	FN
catboost	0.9814814814814815	0.8195693865042805	0.8013174756506312	0.7903526990720451	0.8559205703525894	0.9904901243599122	0.9921698350221925	0.9932928107315029	0.9877032096706961	0.6486486486486487	0.6104651162790697	0.5874125874125874	0.7241379310344828	84	4739	32	59
lightgbm	0.9849409849409849	0.8469442386692707	0.8185917013944679	0.8023094072140393	0.9084632979829487	0.9922755741127348	0.9946427824048885	0.9962272060364703	0.9883551673944687	0.7016129032258065	0.6425406203840472	0.6083916083916084	0.8285714285714286	87	4753	18	56

next steps:

✅ 1. Stratified K-fold only apply on train.
✅ 2. train LGBM model using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
✅ 3. train Cat_boost using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
🗹 4. implement proposed methods of this article : https://1drv.ms/b/c/ab2a38fe5c318317/IQBEDsSFcYj6R6AMtOnh0X6DAZUlFqAYq19WT8nTeXomFwg
🗹 5. compare proposed model with SMOTE vs oversampling balancing method