saeed/Image-and-Video-Understanding-Project

Fork 0

Go to file

saeedkhosravi94 489da84586 first commit

2025-11-08 21:37:52 +01:00

README.md

first commit

2025-11-08 21:37:52 +01:00

README.md

Image and Video Understanding Project

A comprehensive project comparing multiple state-of-the-art deep learning models for object detection and instance segmentation on a waste/litter detection dataset.

Overview

This project evaluates and compares different deep learning architectures for instance segmentation on a custom waste detection dataset. Each model is trained and evaluated on the same dataset to enable fair comparison.

Models

1. YOLO (YOLOv8l-seg)

Model: YOLOv8 Large Segmentation
Framework: Ultralytics
Parameters: 45.9M
Training: 200 epochs, batch size 16, image size 960x960
Features: Real-time inference, bounding box + mask prediction

2. Mask R-CNN

Backbone: ResNet-101 with FPN
Framework: Detectron2
Training: 1000-3000 iterations, batch size 8, image size 960x960
Features: Instance segmentation with high accuracy

3. Mask2Former

Architecture: Transformer-based segmentation
Framework: Detectron2
Features: Unified framework for semantic, instance, and panoptic segmentation

4. DETR

Status: Dataset prepared (implementation in progress)

Dataset

Custom waste/litter detection dataset with 20 classes:

Clear plastic bottle, Glass bottle, Plastic bottle cap, Metal bottle cap
Broken glass, Drink can, Other carton, Corrugated carton
Paper cup, Disposable plastic cup, Plastic lid, Other plastic
Normal paper, Plastic film, Other plastic wrapper, Pop tab
Plastic straw, Styrofoam piece, Unlabeled litter, Cigarette

Dataset Structure: Train/Val/Test splits in COCO format

Project Structure

├── YOLO/
│   ├── main.ipynb                    # Training and evaluation notebook
│   ├── results/
│   │   ├── train_200_960_16/        # Training outputs
│   │   └── evaluation_200_960_16/   # Evaluation results
│   └── dataset/                      # Dataset configuration
├── MRCNN/
│   ├── main.ipynb                    # Training and evaluation notebook
│   ├── results/
│   │   ├── train_1000_iter/         # Training outputs
│   │   └── eval/                     # Evaluation metrics
│   └── requirements.txt
├── M2FORMER/
│   ├── main.ipynb                    # Training and evaluation notebook
│   ├── output/                       # Training outputs
│   ├── Mask2Former/                  # Mask2Former repository
│   └── requirements.txt
└── DETR/
    └── dataset/                      # Image data

Setup

Prerequisites

Python 3.8+
PyTorch (with CUDA support recommended)
GPU recommended for training

Installation

YOLO

pip install ultralytics

Mask R-CNN

pip install -r MRCNN/requirements.txt
pip install 'git+https://github.com/facebookresearch/detectron2.git'

Mask2Former

pip install -r M2FORMER/requirements.txt
pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/facebookresearch/Mask2Former.git
cd Mask2Former/mask2former/modeling/pixel_decoder/ops/
./make.sh  # Compile CUDA operations

Usage

Training

Each model has a Jupyter notebook (main.ipynb) with complete training pipelines:

YOLO: Open YOLO/main.ipynb
- Configure dataset path in data.yaml
- Run training cells
- Model saves checkpoints every 10 epochs
Mask R-CNN: Open MRCNN/main.ipynb
- Configure dataset paths and parameters
- Register COCO format datasets
- Train and evaluate
Mask2Former: Open M2FORMER/main.ipynb
- Setup Mask2Former repository
- Configure training parameters
- Train and evaluate

Evaluation

All notebooks include:

COCO-style evaluation metrics
Confusion matrix generation
Prediction visualization
Performance comparison tools

Results

YOLO Results

Box mAP50: 26.9%
Box mAP50-95: 20.7%
Mask mAP50: 26.7%
Mask mAP50-95: 19.5%
Precision (Box): 28.8%
Recall (Box): 29.5%

Mask R-CNN Results

Box AP: 15.8%
Box AP50: 23.9%
Mask AP: 15.9%
Mask AP50: 23.7%
Best performance on: Metal bottle cap (50.4% AP), Clear plastic bottle (42.6% AP), Drink can (40.1% AP)

Results are saved in respective results/ directories with:

Model weights (.pth or .pt files)
Evaluation metrics (JSON format)
Training logs and visualizations
Confusion matrices

Training Parameters

YOLO

Epochs: 200
Batch size: 16
Image size: 960x960
Learning rate: 0.01
Optimizer: AdamW
Data augmentation: Enabled

Mask R-CNN

Iterations: 1000-3000
Batch size: 8
Image size: 960x960
Learning rate: 0.00025
Backbone: ResNet-101 FPN
ROI batch size: 16

Mask2Former

Configuration: COCO instance segmentation
Backbone: ResNet-101
Image size: Variable

Requirements

Common Dependencies

Python 3.8+
PyTorch
CUDA (for GPU training)
OpenCV
NumPy
Matplotlib

Model-Specific

See individual requirements.txt files in each model directory for complete dependency lists.