# Electrocardiogram We are dealing with an exteremly imbalance dataset related to electrocardiogram signals that contain binary classes and labeled as good(0) and bad(1) signals. ## STEP 1: Fill missing values All the columns in our data contain missing values a range from 25 to 70. By using `from sklearn.impute import KNNImputer` we fill all of them using 5 of the nearst neighbors of that missing value. ``` imputer = KNNImputer(n_neighbors=5) data_imputed = imputer.fit_transform(data_frame) data_frame_imputed = pandas.DataFrame(data_imputed, columns=columns) missing_value_counts = data_frame_imputed.isna().sum() write_textfile(f"{data_directory}/no_missing.txt", missing_value_counts) return data_frame_imputed ``` ## STEP 2: Scaling We used `from sklearn.preprocessing import RobustScaler` to handle scaling. ``` scaler = RobustScaler() x = data_frame.drop("label", axis=1) x_scale = scaler.fit_transform(x) data_frame_scaled = pandas.DataFrame(x_scale, columns=x.columns) data_frame_scaled["label"] = labels.values ``` ## STEP 3: k-fold cross validation + stratify classes + balancing training data First of all we split the dataset into 2 parts train (85%) and test (15%). For making sure that majority class and imbalanced class distributed fairly we passed `stratify=y` ``` x_train, x_test, y_train, y_test = train_test_split( X, y, test_size=0.15, stratify=y, random_state=42, ) ``` Then, for train dataset we used `from sklearn.model_selection import StratifiedKFold` to this class distribution also apply for train and validation data. ``` skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state) for fold_num, (train_idx, val_idx) in enumerate( tqdm.tqdm(skf.split(X, y), total=skf.n_splits, desc="Training Folds"), start=1 ): X_train, X_val = X.iloc[train_idx], X.iloc[val_idx] y_train, y_val = y.iloc[train_idx], y.iloc[val_idx] ``` and finally we use one of these balancing methods `from imblearn.over_sampling import ADASYN, SMOTE, SVMSMOTE, BorderlineSMOTE, KMeansSMOTE` to augment samples for only train data ``` if smote: if smote_method.lower() == "kmeans": sampler = KMeansSMOTE( k_neighbors=5, cluster_balance_threshold=0.1, random_state=random_state, ) elif smote_method.lower() == "smote": sampler = SMOTE(k_neighbors=5, random_state=random_state) elif smote_method.lower() == "svmsmote": sampler = SVMSMOTE(k_neighbors=5, random_state=random_state) elif smote_method.lower() == "borderline": sampler = BorderlineSMOTE(k_neighbors=5, random_state=random_state) elif smote_method.lower() == "adasyn": sampler = ADASYN(n_neighbors=5, random_state=random_state) else: raise ValueError(f"Unknown smote_method: {smote_method}") X_train, y_train = sampler.fit_resample(X_train, y_train) model.fit(X_train, y_train) ``` ## STEP 4: Train different models to find the best possible approach #### What we are looking for: #### Dangerous: Sick → predicted healthy : high recall score or low FN #### Costly: Healthy → predicted sick : high precision score or low FP ## STEP 5: Current results taken KMEANS_SMOTE: | model | stage | accuracy | f1_macro | f2_macro | recall_macro | precision_macro | f1_class0 | f1_class1 | f2_class0 | f2_class1 | recall_class0 | recall_class1 | precision_class0 | precision_class1 | TP | TN | FP | FN | |-----------------------|-------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|-----|-------|----|----| | CatBoost_balanced_knn10 | train | 0.9843784049402589 | 0.8696686267343388 | 0.8824472728294012 | 0.8916952848998795 | 0.8508242781484853 | 0.9919396338322237 | 0.7473976196364541 | 0.9908276010500254 | 0.7740669446087769 | 0.9900881006639566 | 0.7933024691358025 | 0.9938004847319636 | 0.7078480715650071 | 789 | 26898 | 140 | 19 | | CatBoost_balanced_knn10 | test | 0.9802604802604803 | 0.8348421298822796 | 0.8461546793313885 | 0.8541662696976049 | 0.8176680164072361 | 0.9898162729658793 | 0.6798679867986799 | 0.988757446094471 | 0.703551912568306 | 0.9880528191154894 | 0.7202797202797203 | 0.991586032814472 | 0.64375 | 103 | 4714 | 57 | 40 | | LGBM_KMEANS_SMOTE_knn10 | train | 0.9883286128479746 | 0.8784419356817057 | 0.8436008106620193 | 0.8240767336379762 | 0.9582821430574249 | 0.9940169232360254 | 0.7628669481273861 | 0.9966698960611392 | 0.6905317252628993 | 0.9984466771524954 | 0.6497067901234568 | 0.9896275269971563 | 0.9269367591176938 | 775 | 27036 | 2 | 33 | | LGBM_KMEANS_SMOTE_knn10 | test | 0.9865689865689866 | 0.8543196878009516 | 0.8121616449258658 | 0.7895809912158687 | 0.9600745182511498 | 0.9931221342225928 | 0.7155172413793104 | 0.9964866786565728 | 0.6278366111951589 | 0.9987424020121568 | 0.5804195804195804 | 0.9875647668393782 | 0.9325842696629213 | 83 | 4765 | 6 | 60 | ## next steps: ``` ✅ 1. Stratified K-fold only apply on train. 🗹 2. train LGBM model using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained) 🗹 3. train Cat_boost using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained) 🗹 4. implement proposed methods of this article : https://1drv.ms/b/c/ab2a38fe5c318317/IQBEDsSFcYj6R6AMtOnh0X6DAZUlFqAYq19WT8nTeXomFwg 🗹 5. compare proposed model with SMOTE vs oversampling balancing method ```