# Image and Video Understanding Project A comprehensive project comparing multiple state-of-the-art deep learning models for object detection and instance segmentation on a waste/litter detection dataset. ## Overview This project evaluates and compares different deep learning architectures for instance segmentation on a custom waste detection dataset. Each model is trained and evaluated on the same dataset to enable fair comparison. ## Models ### 1. YOLO (YOLOv8l-seg) - **Model**: YOLOv8 Large Segmentation - **Framework**: Ultralytics - **Parameters**: 45.9M - **Training**: 200 epochs, batch size 16, image size 960x960 - **Features**: Real-time inference, bounding box + mask prediction ### 2. Mask R-CNN - **Backbone**: ResNet-101 with FPN - **Framework**: Detectron2 - **Training**: 1000-3000 iterations, batch size 8, image size 960x960 - **Features**: Instance segmentation with high accuracy ### 3. Mask2Former - **Architecture**: Transformer-based segmentation - **Framework**: Detectron2 - **Features**: Unified framework for semantic, instance, and panoptic segmentation ### 4. DETR - **Status**: Dataset prepared (implementation in progress) ## Dataset Custom waste/litter detection dataset with **20 classes**: - Clear plastic bottle, Glass bottle, Plastic bottle cap, Metal bottle cap - Broken glass, Drink can, Other carton, Corrugated carton - Paper cup, Disposable plastic cup, Plastic lid, Other plastic - Normal paper, Plastic film, Other plastic wrapper, Pop tab - Plastic straw, Styrofoam piece, Unlabeled litter, Cigarette **Dataset Structure**: Train/Val/Test splits in COCO format ## Project Structure ``` ├── YOLO/ │ ├── main.ipynb # Training and evaluation notebook │ ├── results/ │ │ ├── train_200_960_16/ # Training outputs │ │ └── evaluation_200_960_16/ # Evaluation results │ └── dataset/ # Dataset configuration ├── MRCNN/ │ ├── main.ipynb # Training and evaluation notebook │ ├── results/ │ │ ├── train_1000_iter/ # Training outputs │ │ └── eval/ # Evaluation metrics │ └── requirements.txt ├── M2FORMER/ │ ├── main.ipynb # Training and evaluation notebook │ ├── output/ # Training outputs │ ├── Mask2Former/ # Mask2Former repository │ └── requirements.txt └── DETR/ └── dataset/ # Image data ``` ## Setup ### Prerequisites - Python 3.8+ - PyTorch (with CUDA support recommended) - GPU recommended for training ### Installation #### YOLO ```bash pip install ultralytics ``` #### Mask R-CNN ```bash pip install -r MRCNN/requirements.txt pip install 'git+https://github.com/facebookresearch/detectron2.git' ``` #### Mask2Former ```bash pip install -r M2FORMER/requirements.txt pip install 'git+https://github.com/facebookresearch/detectron2.git' git clone https://github.com/facebookresearch/Mask2Former.git cd Mask2Former/mask2former/modeling/pixel_decoder/ops/ ./make.sh # Compile CUDA operations ``` ## Usage ### Training Each model has a Jupyter notebook (`main.ipynb`) with complete training pipelines: 1. **YOLO**: Open `YOLO/main.ipynb` - Configure dataset path in `data.yaml` - Run training cells - Model saves checkpoints every 10 epochs 2. **Mask R-CNN**: Open `MRCNN/main.ipynb` - Configure dataset paths and parameters - Register COCO format datasets - Train and evaluate 3. **Mask2Former**: Open `M2FORMER/main.ipynb` - Setup Mask2Former repository - Configure training parameters - Train and evaluate ### Evaluation All notebooks include: - COCO-style evaluation metrics - Confusion matrix generation - Prediction visualization - Performance comparison tools ## Results ### YOLO Results - **Box mAP50**: 26.9% - **Box mAP50-95**: 20.7% - **Mask mAP50**: 26.7% - **Mask mAP50-95**: 19.5% - **Precision (Box)**: 28.8% - **Recall (Box)**: 29.5% ### Mask R-CNN Results - **Box AP**: 15.8% - **Box AP50**: 23.9% - **Mask AP**: 15.9% - **Mask AP50**: 23.7% - Best performance on: Metal bottle cap (50.4% AP), Clear plastic bottle (42.6% AP), Drink can (40.1% AP) Results are saved in respective `results/` directories with: - Model weights (`.pth` or `.pt` files) - Evaluation metrics (JSON format) - Training logs and visualizations - Confusion matrices ## Training Parameters ### YOLO - Epochs: 200 - Batch size: 16 - Image size: 960x960 - Learning rate: 0.01 - Optimizer: AdamW - Data augmentation: Enabled ### Mask R-CNN - Iterations: 1000-3000 - Batch size: 8 - Image size: 960x960 - Learning rate: 0.00025 - Backbone: ResNet-101 FPN - ROI batch size: 16 ### Mask2Former - Configuration: COCO instance segmentation - Backbone: ResNet-101 - Image size: Variable ## Requirements ### Common Dependencies - Python 3.8+ - PyTorch - CUDA (for GPU training) - OpenCV - NumPy - Matplotlib ### Model-Specific See individual `requirements.txt` files in each model directory for complete dependency lists.