We are dealing with an exteremly imbalance dataset related to electrocardiogram signals that contain binary classes and labeled as good(0) and bad(1) signals.

STEP 1: Fill missing values

All the columns in our data contain missing values a range from 25 to 70. By using from sklearn.impute import KNNImputer we fill all of them using 5 of the nearst neighbors of that missing value.

imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_frame)
data_frame_imputed = pandas.DataFrame(data_imputed, columns=columns)

missing_value_counts = data_frame_imputed.isna().sum()
write_textfile(f"{data_directory}/no_missing.txt", missing_value_counts)
return data_frame_imputed

STEP 2: Scaling

We used from sklearn.preprocessing import RobustScaler to handle scaling.

scaler = RobustScaler()
x = data_frame.drop("label", axis=1)
x_scale = scaler.fit_transform(x)
data_frame_scaled = pandas.DataFrame(x_scale, columns=x.columns)
data_frame_scaled["label"] = labels.values

STEP 3: k-fold cross validation + stratify classes + balancing training data

First of all we split the dataset into 2 parts train (85%) and test (15%). For making sure that majority class and imbalanced class distributed fairly we passed stratify=y

x_train, x_test, y_train, y_test = train_test_split(
  X,
  y,
  test_size=0.15,
  stratify=y,
  random_state=42,
)

Then, for train dataset we used from sklearn.model_selection import StratifiedKFold to this class distribution also apply for train and validation data.

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for fold_num, (train_idx, val_idx) in enumerate(
      tqdm.tqdm(skf.split(X, y), total=skf.n_splits, desc="Training Folds"), start=1
  ):
      X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
      y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

and finally we use one of these balancing methods from imblearn.over_sampling import ADASYN, SMOTE, SVMSMOTE, BorderlineSMOTE, KMeansSMOTE to augment samples for only train data

if smote:
  if smote_method.lower() == "kmeans":
      sampler = KMeansSMOTE(
          k_neighbors=5,
          cluster_balance_threshold=0.1,
          random_state=random_state,
      )
  elif smote_method.lower() == "smote":
      sampler = SMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "svmsmote":
      sampler = SVMSMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "borderline":
      sampler = BorderlineSMOTE(k_neighbors=5, random_state=random_state)
  elif smote_method.lower() == "adasyn":
      sampler = ADASYN(n_neighbors=5, random_state=random_state)
  else:
      raise ValueError(f"Unknown smote_method: {smote_method}")

  X_train, y_train = sampler.fit_resample(X_train, y_train)

model.fit(X_train, y_train)

STEP 4: Train different models to find the best possible approach

What we are looking for:

Dangerous: Sick → predicted healthy : high recall score or low FN

Costly: Healthy → predicted sick : high precision score or low FP

STEP 5:

Current results taken KMEANS_SMOTE:

model	stage	accuracy	f1_macro	f2_macro	recall_macro	precision_macro	f1_class0	f1_class1	f2_class0	f2_class1	recall_class0	recall_class1	precision_class0	precision_class1	TP	TN	FP	FN
CatBoost_balanced_knn10	train	0.9843784049402589	0.8696686267343388	0.8824472728294012	0.8916952848998795	0.8508242781484853	0.9919396338322237	0.7473976196364541	0.9908276010500254	0.7740669446087769	0.9900881006639566	0.7933024691358025	0.9938004847319636	0.7078480715650071	789	26898	140	19
CatBoost_balanced_knn10	test	0.9802604802604803	0.8348421298822796	0.8461546793313885	0.8541662696976049	0.8176680164072361	0.9898162729658793	0.6798679867986799	0.988757446094471	0.703551912568306	0.9880528191154894	0.7202797202797203	0.991586032814472	0.64375	103	4714	57	40
LGBM_KMEANS_SMOTE_knn10	train	0.9883286128479746	0.8784419356817057	0.8436008106620193	0.8240767336379762	0.9582821430574249	0.9940169232360254	0.7628669481273861	0.9966698960611392	0.6905317252628993	0.9984466771524954	0.6497067901234568	0.9896275269971563	0.9269367591176938	775	27036	2	33
LGBM_KMEANS_SMOTE_knn10	test	0.9865689865689866	0.8543196878009516	0.8121616449258658	0.7895809912158687	0.9600745182511498	0.9931221342225928	0.7155172413793104	0.9964866786565728	0.6278366111951589	0.9987424020121568	0.5804195804195804	0.9875647668393782	0.9325842696629213	83	4765	6	60

next steps:

✅ 1. Stratified K-fold only apply on train.
🗹 2. train LGBM model using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
🗹 3. train Cat_boost using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
🗹 4. implement proposed methods of this article : https://1drv.ms/b/c/ab2a38fe5c318317/IQBEDsSFcYj6R6AMtOnh0X6DAZUlFqAYq19WT8nTeXomFwg
🗹 5. compare proposed model with SMOTE vs oversampling balancing method