Image and Video Understanding Project
A comprehensive project comparing multiple state-of-the-art deep learning models for object detection and instance segmentation on a waste/litter detection dataset.
Overview
This project evaluates and compares different deep learning architectures for instance segmentation on a custom waste detection dataset. Each model is trained and evaluated on the same dataset to enable fair comparison.
Models
1. YOLO (YOLOv8l-seg)
- Model: YOLOv8 Large Segmentation
- Framework: Ultralytics
- Parameters: 45.9M
- Training: 200 epochs, batch size 16, image size 960x960
- Features: Real-time inference, bounding box + mask prediction
2. Mask R-CNN
- Backbone: ResNet-101 with FPN
- Framework: Detectron2
- Training: 1000-3000 iterations, batch size 8, image size 960x960
- Features: Instance segmentation with high accuracy
3. Mask2Former
- Architecture: Transformer-based segmentation
- Framework: Detectron2
- Features: Unified framework for semantic, instance, and panoptic segmentation
4. DETR
- Status: Dataset prepared (implementation in progress)
Dataset
Custom waste/litter detection dataset with 20 classes:
- Clear plastic bottle, Glass bottle, Plastic bottle cap, Metal bottle cap
- Broken glass, Drink can, Other carton, Corrugated carton
- Paper cup, Disposable plastic cup, Plastic lid, Other plastic
- Normal paper, Plastic film, Other plastic wrapper, Pop tab
- Plastic straw, Styrofoam piece, Unlabeled litter, Cigarette
Dataset Structure: Train/Val/Test splits in COCO format
Project Structure
├── YOLO/
│ ├── main.ipynb # Training and evaluation notebook
│ ├── results/
│ │ ├── train_200_960_16/ # Training outputs
│ │ └── evaluation_200_960_16/ # Evaluation results
│ └── dataset/ # Dataset configuration
├── MRCNN/
│ ├── main.ipynb # Training and evaluation notebook
│ ├── results/
│ │ ├── train_1000_iter/ # Training outputs
│ │ └── eval/ # Evaluation metrics
│ └── requirements.txt
├── M2FORMER/
│ ├── main.ipynb # Training and evaluation notebook
│ ├── output/ # Training outputs
│ ├── Mask2Former/ # Mask2Former repository
│ └── requirements.txt
└── DETR/
└── dataset/ # Image data
Setup
Prerequisites
- Python 3.8+
- PyTorch (with CUDA support recommended)
- GPU recommended for training
Installation
YOLO
pip install ultralytics
Mask R-CNN
pip install -r MRCNN/requirements.txt
pip install 'git+https://github.com/facebookresearch/detectron2.git'
Mask2Former
pip install -r M2FORMER/requirements.txt
pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/facebookresearch/Mask2Former.git
cd Mask2Former/mask2former/modeling/pixel_decoder/ops/
./make.sh # Compile CUDA operations
Usage
Training
Each model has a Jupyter notebook (main.ipynb) with complete training pipelines:
-
YOLO: Open
YOLO/main.ipynb- Configure dataset path in
data.yaml - Run training cells
- Model saves checkpoints every 10 epochs
- Configure dataset path in
-
Mask R-CNN: Open
MRCNN/main.ipynb- Configure dataset paths and parameters
- Register COCO format datasets
- Train and evaluate
-
Mask2Former: Open
M2FORMER/main.ipynb- Setup Mask2Former repository
- Configure training parameters
- Train and evaluate
Evaluation
All notebooks include:
- COCO-style evaluation metrics
- Confusion matrix generation
- Prediction visualization
- Performance comparison tools
Results
YOLO Results
- Box mAP50: 26.9%
- Box mAP50-95: 20.7%
- Mask mAP50: 26.7%
- Mask mAP50-95: 19.5%
- Precision (Box): 28.8%
- Recall (Box): 29.5%
Mask R-CNN Results
- Box AP: 15.8%
- Box AP50: 23.9%
- Mask AP: 15.9%
- Mask AP50: 23.7%
- Best performance on: Metal bottle cap (50.4% AP), Clear plastic bottle (42.6% AP), Drink can (40.1% AP)
Results are saved in respective results/ directories with:
- Model weights (
.pthor.ptfiles) - Evaluation metrics (JSON format)
- Training logs and visualizations
- Confusion matrices
Training Parameters
YOLO
- Epochs: 200
- Batch size: 16
- Image size: 960x960
- Learning rate: 0.01
- Optimizer: AdamW
- Data augmentation: Enabled
Mask R-CNN
- Iterations: 1000-3000
- Batch size: 8
- Image size: 960x960
- Learning rate: 0.00025
- Backbone: ResNet-101 FPN
- ROI batch size: 16
Mask2Former
- Configuration: COCO instance segmentation
- Backbone: ResNet-101
- Image size: Variable
Requirements
Common Dependencies
- Python 3.8+
- PyTorch
- CUDA (for GPU training)
- OpenCV
- NumPy
- Matplotlib
Model-Specific
See individual requirements.txt files in each model directory for complete dependency lists.