SuperFormer Portfolio

Overview

Note: This project is still in progress, and the content will be updated gradually.

In this project, we propose SuperFormer, a novel model for addressing the core efficiency challenges in Salient Object Detection (SOD). Efficiency is defined as achieving the desired outcome with minimal waste of effort, time, or resources. While most state-of-the-art SOD methods focus primarily on improving detection accuracy, SuperFormer treats the consumption of time and resources as equally important. It is specifically designed to optimize model complexity and size, ensuring a balance between accuracy and resource efficiency for faster, more sustainable learning.

How SuperFormer facilitates efficient SOD?

Utilizing Superpixels segmentation, which groups image pixels based on shared descriptive attributes like color and spatial similarity.
Introducing a multimodal feature representation leveraging Fourier descriptors to encode superpixel shape, size, and rotation.
Proposing Dynamic Centroid Positional Embedding (DCPE) to address spatial heterogeneity in superpixel graphs by representing spatial relationships via Euclidean centroids.

Superpixel

We utilized Simple Linear Iterative Clustering (SLIC) Achanta et al. (2010), which groups pixels based on color similarity and spatial proximity, utilizing fast, linear computational complexity to classify superpixel regions as either foreground or background—essentially performing binary classification within semantic segmentation. Unlike pixel-based images, which have grid-like structures with uniform connectivity, superpixels form graph-like structures with irregular connectivity, providing greater flexibility in representing complex shapes and regions.

Grid-like structures: Uniform, fixed connectivity between neighboring pixels, ideal for regular data but limited in adaptability.
Graph-like structures: Irregular connectivity, allowing more flexible and adaptive representations of non-uniform regions, making them better suited for tasks like segmentation.

Multimodal Representation

Color plays a critical role in salient object detection, with color distribution being a key indicator for identifying saliency. The graph-like structure of superpixels requires a statistical representation to effectively capture this color distribution. Representing variable-sized superpixels in a tensor form suitable for optimization algorithms, such as stochastic gradient descent, poses challenges due to the irregularity of superpixel sizes and their non-uniform structure. To encode the variability while maintaining compatibility with gradient-based optimization, SuperFormer takes a preliminary approach using a 6-D vector representing the MEAN and Standard Deviation (STD) of each RGB channels to encode color distribution.

Shape is another crucial indicator in saliency detection. To capture the shape characteristics of each superpixel, we first extract the contour of the superpixel, which consists of its boundary coordinates. We then apply the Fourier Transform to this contour to obtain complex coefficients that represent the shape in the frequency domain. The Fourier Transformation of the contour gives us an amplitude and a phase component.

DCPE: Dynamic Centroid Positional Embedding

Current state-of-the-art learnable positional embeddings (e.g., BERT) typically rely on two approaches:

Uniform Positional Relationship (Fixed or Sinusoidal Positional Embeddings) they assume a uniform positional relationship, where token indices are consistent across spatial or sequential domains,

Learnable Dictionary of Positional Parameters (Relative Positional Embeddings) they use a dictionary of learnable parameters to cover all possible relative positions \citep{huang2018music}.

However, due to the irregular and variable nature of superpixel locations, creating learnable parameters for every possible 2D relative position is impractical. Therefore, DCPE offers a solution by non-linearly projecting positional encodings based on the centroids of each superpixel, adapting to their irregular spatial arrangement.

Computer Vision Project: SuperFormer

Overview

Superpixel

Multimodal Representation

DCPE: Dynamic Centroid Positional Embedding

Experiments

Results

Links