Files
Image-and-Video-Understandi…/README.md
2025-11-08 21:37:52 +01:00

5.1 KiB

Image and Video Understanding Project

A comprehensive project comparing multiple state-of-the-art deep learning models for object detection and instance segmentation on a waste/litter detection dataset.

Overview

This project evaluates and compares different deep learning architectures for instance segmentation on a custom waste detection dataset. Each model is trained and evaluated on the same dataset to enable fair comparison.

Models

1. YOLO (YOLOv8l-seg)

  • Model: YOLOv8 Large Segmentation
  • Framework: Ultralytics
  • Parameters: 45.9M
  • Training: 200 epochs, batch size 16, image size 960x960
  • Features: Real-time inference, bounding box + mask prediction

2. Mask R-CNN

  • Backbone: ResNet-101 with FPN
  • Framework: Detectron2
  • Training: 1000-3000 iterations, batch size 8, image size 960x960
  • Features: Instance segmentation with high accuracy

3. Mask2Former

  • Architecture: Transformer-based segmentation
  • Framework: Detectron2
  • Features: Unified framework for semantic, instance, and panoptic segmentation

4. DETR

  • Status: Dataset prepared (implementation in progress)

Dataset

Custom waste/litter detection dataset with 20 classes:

  • Clear plastic bottle, Glass bottle, Plastic bottle cap, Metal bottle cap
  • Broken glass, Drink can, Other carton, Corrugated carton
  • Paper cup, Disposable plastic cup, Plastic lid, Other plastic
  • Normal paper, Plastic film, Other plastic wrapper, Pop tab
  • Plastic straw, Styrofoam piece, Unlabeled litter, Cigarette

Dataset Structure: Train/Val/Test splits in COCO format

Project Structure

├── YOLO/
│   ├── main.ipynb                    # Training and evaluation notebook
│   ├── results/
│   │   ├── train_200_960_16/        # Training outputs
│   │   └── evaluation_200_960_16/   # Evaluation results
│   └── dataset/                      # Dataset configuration
├── MRCNN/
│   ├── main.ipynb                    # Training and evaluation notebook
│   ├── results/
│   │   ├── train_1000_iter/         # Training outputs
│   │   └── eval/                     # Evaluation metrics
│   └── requirements.txt
├── M2FORMER/
│   ├── main.ipynb                    # Training and evaluation notebook
│   ├── output/                       # Training outputs
│   ├── Mask2Former/                  # Mask2Former repository
│   └── requirements.txt
└── DETR/
    └── dataset/                      # Image data

Setup

Prerequisites

  • Python 3.8+
  • PyTorch (with CUDA support recommended)
  • GPU recommended for training

Installation

YOLO

pip install ultralytics

Mask R-CNN

pip install -r MRCNN/requirements.txt
pip install 'git+https://github.com/facebookresearch/detectron2.git'

Mask2Former

pip install -r M2FORMER/requirements.txt
pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone https://github.com/facebookresearch/Mask2Former.git
cd Mask2Former/mask2former/modeling/pixel_decoder/ops/
./make.sh  # Compile CUDA operations

Usage

Training

Each model has a Jupyter notebook (main.ipynb) with complete training pipelines:

  1. YOLO: Open YOLO/main.ipynb

    • Configure dataset path in data.yaml
    • Run training cells
    • Model saves checkpoints every 10 epochs
  2. Mask R-CNN: Open MRCNN/main.ipynb

    • Configure dataset paths and parameters
    • Register COCO format datasets
    • Train and evaluate
  3. Mask2Former: Open M2FORMER/main.ipynb

    • Setup Mask2Former repository
    • Configure training parameters
    • Train and evaluate

Evaluation

All notebooks include:

  • COCO-style evaluation metrics
  • Confusion matrix generation
  • Prediction visualization
  • Performance comparison tools

Results

YOLO Results

  • Box mAP50: 26.9%
  • Box mAP50-95: 20.7%
  • Mask mAP50: 26.7%
  • Mask mAP50-95: 19.5%
  • Precision (Box): 28.8%
  • Recall (Box): 29.5%

Mask R-CNN Results

  • Box AP: 15.8%
  • Box AP50: 23.9%
  • Mask AP: 15.9%
  • Mask AP50: 23.7%
  • Best performance on: Metal bottle cap (50.4% AP), Clear plastic bottle (42.6% AP), Drink can (40.1% AP)

Results are saved in respective results/ directories with:

  • Model weights (.pth or .pt files)
  • Evaluation metrics (JSON format)
  • Training logs and visualizations
  • Confusion matrices

Training Parameters

YOLO

  • Epochs: 200
  • Batch size: 16
  • Image size: 960x960
  • Learning rate: 0.01
  • Optimizer: AdamW
  • Data augmentation: Enabled

Mask R-CNN

  • Iterations: 1000-3000
  • Batch size: 8
  • Image size: 960x960
  • Learning rate: 0.00025
  • Backbone: ResNet-101 FPN
  • ROI batch size: 16

Mask2Former

  • Configuration: COCO instance segmentation
  • Backbone: ResNet-101
  • Image size: Variable

Requirements

Common Dependencies

  • Python 3.8+
  • PyTorch
  • CUDA (for GPU training)
  • OpenCV
  • NumPy
  • Matplotlib

Model-Specific

See individual requirements.txt files in each model directory for complete dependency lists.