Image segmentation is a computer vision and image processing technique that involves grouping or labeling similar regions or segments in an image at the pixel level. Each segment of pixels is represented by a class label or a mask.
In image segmentation, an image consists of two main components:
- Things: Countable objects in an image (e.g., people, flowers, birds, animals, etc.).
- Stuff: Amorphous regions (or repeating patterns) of similar material, which are uncountable (e.g., road, sky, grass).
Semantic segmentation assigns a class label to every pixel in an image. It classifies regions as belonging to a particular category, such as a car, tree, or road. However, it does not differentiate between multiple instances of the same object. For example, if an image contains two cars, semantic segmentation will classify both as "car" but will not distinguish between them.
Commonly used architectures for semantic segmentation include:
- SegNet
- U-Net
- DeconvNet
- Fully Convolutional Networks (FCNs)
Instance segmentation extends semantic segmentation by distinguishing between different instances of the same class. It assigns a unique mask or bounding box to each object in an image. This is useful for tasks where object counting or differentiation is required, such as detecting multiple cars or people in an image.
Panoptic segmentation combines the best aspects of both semantic and instance segmentation. Each pixel in an image is assigned both a semantic label (class) and a unique instance identifier. This approach enables a more comprehensive understanding of the scene by distinguishing between different objects while also classifying background regions.
U-Net is a widely used deep learning architecture for semantic segmentation. It follows a U-shaped design with an encoder-decoder structure:
- Encoder (Contracting Path): Uses convolutional layers to capture spatial features while downsampling the image.
- Decoder (Expanding Path): Uses upsampling layers to reconstruct the segmented image while preserving spatial details.
- Skip Connections: Connect corresponding layers in the encoder and decoder to retain high-resolution information.
U-Net is commonly used in medical image segmentation, satellite image processing, and other pixel-wise classification tasks.
The choice of loss function significantly impacts the performance of a segmentation model. Some commonly used loss functions include:
- Cross-Entropy Loss: Measures the difference between predicted and ground-truth probability distributions.
- Intersection over Union (IoU) Loss: Measures the overlap between the predicted mask and ground truth. IoU loss penalizes cases where either precision or recall is low.
- Dice Loss: Computes the similarity between the predicted and actual segmentation masks. It is particularly useful for imbalanced datasets.
- Tversky Loss: A variant of Dice loss that allows adjusting the balance between false positives and false negatives, making it suitable for highly imbalanced datasets.
- Focal Loss: Focuses on hard-to-classify examples by down-weighting easy samples, improving performance on challenging datasets.
To assess model performance, various evaluation metrics are used:
- Pixel Accuracy: Measures the percentage of correctly classified pixels.
- Intersection over Union (mIoU): The Intersection over Union (IoU) metric measures the overlap between the predicted and ground truth segmentation masks.
- Precision, Recall, and F1 Score: Measures model performance in detecting true positives and avoiding false positives/negatives.
This repository aims to build UNet from scratch for binary semantic segmentation.
The UNet implementation in the original paper used cropping while concatenating feature maps from the contracting and expanding paths. This was necessary due to the loss of border pixels in every convolution. The original UNet did not use padding, which resulted in an output image smaller than its input (e.g., an input of 572×572 produced an output of 388×388). In this implementation, padding is used in the contracting path, ensuring that the output feature map has the same size as the input image.
The dataset used is a Kaggle person segmentation dataset.
- Two UNet models were trained using different loss functions—one with the Dice Loss and the other with Soft IoU (Jaccard) Loss.
- Using Soft IoU as a loss function brought the optimization closer to capturing what we really care about.
- Model checkpoints were also used to save training progress at regular intervals.
- The UNet model trained with the Soft IoU loss significantly outperformed the Dice loss model, as seen in the inference results.
Learning plot for the UNet model trained with Dice Loss

UNET_SEGMENTATION/
│
├── checkpoints/ # Directory for storing model checkpoints
│ └──checkpoint1.pth # model checkpoint file
│
├── Data/ # Directory for dataset
│ ├── test/ # Test dataset directory
│ └── train/ # Train dataset directory
│
├── input_images/ # Directory for input images
│
├── output_images/ # Directory for output images
│
├── trained_model/ # Directory for saved model
│ └── unet_segmentation.pth # saved model file
│
├── scr/ # Directory for project scripts
│ ├── load_data.py # load dataset srcipt
│ ├── model.py # UNet model definition
│ ├── train.py # Training script
│ ├── utils.py # Utility functions
│ └── inference.py # Inference script
│
├── config.json # json configuration file
│
├── README.md # Project README file



