# Electrocardiogram

We are dealing with an exteremly imbalance dataset related to electrocardiogram signals that contain binary classes and labeled as good(0) and bad(1) signals. 

## STEP 1: Fill missing values

  All the columns in our data contain missing values a range from 25 to 70. By using `from sklearn.impute import KNNImputer` we fill all of them using 5 of the nearst neighbors of that missing value.
  
  ```
  imputer = KNNImputer(n_neighbors=5)
  data_imputed = imputer.fit_transform(data_frame)
  data_frame_imputed = pandas.DataFrame(data_imputed, columns=columns)

  missing_value_counts = data_frame_imputed.isna().sum()
  write_textfile(f"{data_directory}/no_missing.txt", missing_value_counts)
  return data_frame_imputed
  ```

## STEP 2: Scaling

  We used `from sklearn.preprocessing import RobustScaler` to handle scaling.

  ```
  scaler = RobustScaler()
  x = data_frame.drop("label", axis=1)
  x_scale = scaler.fit_transform(x)
  data_frame_scaled = pandas.DataFrame(x_scale, columns=x.columns)
  data_frame_scaled["label"] = labels.values
  ```

## STEP 3: k-fold cross validation + stratify classes + balancing training data

  First of all we split the dataset into 2 parts train (85%) and test (15%). For making sure that majority class and imbalanced class
  distributed fairly we passed `stratify=y`
  
  ```
  x_train, x_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.15,
    stratify=y,
    random_state=42,
  )
  ```
  Then, for train dataset we used `from sklearn.model_selection import StratifiedKFold` to this class distribution also apply for train and 
  validation data.
  
  ```
  skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
  for fold_num, (train_idx, val_idx) in enumerate(
        tqdm.tqdm(skf.split(X, y), total=skf.n_splits, desc="Training Folds"), start=1
    ):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
  ```
  and finally we use one of these balancing methods `from imblearn.over_sampling import ADASYN, SMOTE, SVMSMOTE, BorderlineSMOTE, KMeansSMOTE` to augment samples for only train data
  
  ```
  if smote:
    if smote_method.lower() == "kmeans":
        sampler = KMeansSMOTE(
            k_neighbors=5,
            cluster_balance_threshold=0.1,
            random_state=random_state,
        )
    elif smote_method.lower() == "smote":
        sampler = SMOTE(k_neighbors=5, random_state=random_state)
    elif smote_method.lower() == "svmsmote":
        sampler = SVMSMOTE(k_neighbors=5, random_state=random_state)
    elif smote_method.lower() == "borderline":
        sampler = BorderlineSMOTE(k_neighbors=5, random_state=random_state)
    elif smote_method.lower() == "adasyn":
        sampler = ADASYN(n_neighbors=5, random_state=random_state)
    else:
        raise ValueError(f"Unknown smote_method: {smote_method}")
  
    X_train, y_train = sampler.fit_resample(X_train, y_train)
  
  model.fit(X_train, y_train)
  ```

## STEP 4: Train different models to find the best possible approach 

#### What we are looking for:

#### Dangerous: Sick → predicted healthy : high recall score or low FN

#### Costly: Healthy → predicted sick : high precision score or low FP


## STEP 5: 


Current results taken KMEANS_SMOTE: 

| model                 | stage | accuracy           | f1_macro           | f2_macro           | recall_macro       | precision_macro    | f1_class0          | f1_class1          | f2_class0          | f2_class1          | recall_class0      | recall_class1      | precision_class0   | precision_class1   | TP  | TN    | FP | FN |
|-----------------------|-------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|-----|-------|----|----|
| CatBoost_balanced_knn10     | train | 0.9843784049402589 | 0.8696686267343388 | 0.8824472728294012 | 0.8916952848998795 | 0.8508242781484853 | 0.9919396338322237 | 0.7473976196364541 | 0.9908276010500254 | 0.7740669446087769 | 0.9900881006639566 | 0.7933024691358025 | 0.9938004847319636 | 0.7078480715650071 | 789 | 26898 | 140 | 19 |
| CatBoost_balanced_knn10     | test  | 0.9802604802604803 | 0.8348421298822796 | 0.8461546793313885 | 0.8541662696976049 | 0.8176680164072361 | 0.9898162729658793 | 0.6798679867986799 | 0.988757446094471  | 0.703551912568306  | 0.9880528191154894 | 0.7202797202797203 | 0.991586032814472  | 0.64375            | 103 | 4714  | 57 | 40 |
| LGBM_KMEANS_SMOTE_knn10     | train | 0.9883286128479746 | 0.8784419356817057 | 0.8436008106620193 | 0.8240767336379762 | 0.9582821430574249 | 0.9940169232360254 | 0.7628669481273861 | 0.9966698960611392 | 0.6905317252628993 | 0.9984466771524954 | 0.6497067901234568 | 0.9896275269971563 | 0.9269367591176938 | 775 | 27036 | 2  | 33 |
| LGBM_KMEANS_SMOTE_knn10     | test  | 0.9865689865689866 | 0.8543196878009516 | 0.8121616449258658 | 0.7895809912158687 | 0.9600745182511498 | 0.9931221342225928 | 0.7155172413793104 | 0.9964866786565728 | 0.6278366111951589 | 0.9987424020121568 | 0.5804195804195804 | 0.9875647668393782 | 0.9325842696629213 | 83  | 4765  | 6  | 60 |


## Tuning LightGBM and CatBoost 

As it is written in `models/catboost_model.py` tune function for this model we used the following parameters: 

```
  scaling_methods = [
      "standard_scaling",
      "robust_scaling",
      "minmax_scaling",
      "yeo_johnson",
  ]
  sampling_methods = [
      "KMeansSMOTE",
      "class_weight",
  ]
  learning_rate_list = [0.03, 0.05, 0.1]
  depth_list = [6, 8]
  l2_leaf_reg_list = [1, 3]
  subsample_list = [0.8, 1.0]
  k_neighbors_list = [10]
  kmeans_estimator_list = [5]

```
Also, for `models/lightgbm_model.py` tune function we used the folowing parameters: 

```
  scaling_methods = [
      "standard_scaling",
      "robust_scaling",
      "minmax_scaling",
      "yeo_johnson",
  ]
  sampling_methods = [
      "KMeansSMOTE",
      "class_weight",
  ]
  boosting_type_list = ["gbdt", "dart"]
  learning_rate_list = [0.03, 0.05, 0.1]
  number_of_leaves_list = [100]
  l2_regularization_lambda_list = [0.1, 0.5]
  l1_regularization_alpha_list = [0.1, 0.5]
  tree_subsample_tree_list = [0.8, 1.0]
  subsample_list = [0.8, 1.0]
  kmeans_smote_k_neighbors_list = [10]
  kmeans_smote_n_clusters_list = [5]
```
After tuning we train both models based on their best parameters and compare on an imbalanced test data. 
here is the comparison results: 
| model    | accuracy           | f1_macro          | f2_macro          | recall_macro      | precision_macro   | f1_class0         | f2_class0         | recall_class0     | precision_class0  | f1_class1         | f2_class1         | recall_class1     | precision_class1  | TP | TN   | FP | FN |
|----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|----|------|----|----|
| catboost | 0.9814814814814815 | 0.8195693865042805 | 0.8013174756506312 | 0.7903526990720451 | 0.8559205703525894 | 0.9904901243599122 | 0.9921698350221925 | 0.9932928107315029 | 0.9877032096706961 | 0.6486486486486487 | 0.6104651162790697 | 0.5874125874125874 | 0.7241379310344828 | 84 | 4739 | 32 | 59 |
| lightgbm | 0.9849409849409849 | 0.8469442386692707 | 0.8185917013944679 | 0.8023094072140393 | 0.9084632979829487 | 0.9922755741127348 | 0.9946427824048885 | 0.9962272060364703 | 0.9883551673944687 | 0.7016129032258065 | 0.6425406203840472 | 0.6083916083916084 | 0.8285714285714286 | 87 | 4753 | 18 | 56 |


## next steps: 
```
✅ 1. Stratified K-fold only apply on train.
✅ 2. train LGBM model using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
✅ 3. train Cat_boost using KMEANS_SMOTE with knn k_neighbors=10 (fine-tune remained)
🗹 4. implement proposed methods of this article : https://1drv.ms/b/c/ab2a38fe5c318317/IQBEDsSFcYj6R6AMtOnh0X6DAZUlFqAYq19WT8nTeXomFwg
🗹 5. compare proposed model with SMOTE vs oversampling balancing method
```