Computer Vision Project: SLOPE_KP

Overview

SLOP-KP
SLOP-KP: Self-supervised Learning of Object Pose Estimation Using Keypoint Prediction.

The SLOPE_KP: Self-supervised Learning of Object Pose Estimation Using Keypoint Prediction outlines advancements in predicting both object pose (camera perspective) and shape from single images. The key innovation is a novel approach to predicting camera pose using self-supervised learning of keypoints—specific points located on a deformable shape that is typical for a particular object category (like birds, cars, etc.).

Key Contributions

3D Geometry Fundamentals

In this section, I will explain the fundamentals of 3D geometry and rendering to provide a clearer understanding of the project and the proposed model.

Category-specific Mean Shape

The model's shape serves as a prototypical representation of a category, such as a bird, capturing the essential geometric structure common to that category. This allows the model to generalize across different instances while maintaining the key features that define the category. We have acces to the shape components of the category specific mean shape including Vertices, Faces, UV Vertices, and UV Faces.

mean_shape
Category-specific Mean-shape of Birds.

Shape

The shape of an object is defined by two fundamental components: vertices and faces.

Vertices

Vertices represent the 3D points scattered across the surface of the object. These points define the structure and are arranged in a matrix of size \((V, 3)\), where \(V\) is the number of vertices, and each vertex contains three coordinates \((x, y, z)\) representing its position in 3D space.

Faces

Faces are typically triangular surfaces formed by connecting three vertices. These triangles collectively form the surface of the object's 3D shape. The faces are organized in a structure of shape \((F, 3)\), where \(F\) represents the number of faces, and each face consists of three vertices.

UV Vertices

This is an array containing UV coordinates for each vertex in the 3D model. It usually has the shape \((V, 2)\), where \(V\) is the number of vertices. Each entry contains the UV coordinates \((u, v)\) for a vertex.

UV Faces

This is an array that specifies the texture coordinates (UV coordinates) for each face of the 3D model. It defines how the UV coordinates are connected to form faces in the texture space. It has shape of \((F, 3)\) where \(F\) is the number of faces and each entry contains three indices. These indices correspond to vertices in the UV Vertices array, forming triangles or other polygons in the 2D texture space.

Texture

Texture refers to the detailed surface information that is applied to a 3D mesh. The texture is essentially a 2D image that gets mapped onto the 3D model to provide visual detail such as colors and patterns.

Texture Predictor

First The model’s TextureMapPredictor module outputs the predicted texture map. The texture map is essentially a 2D image (or multiple images) that captures the texture details to be applied to the 3D model. The shape of the texture map is \((B, C, H, W)\), where \(B\) is the batch size, \(C\) is the number of color channels (typically 3 for RGB or 2 if predicting flow), and \(H\) and \(W\) are the height and width of the texture image, respectively. The \(H\) and \(W\) dimensions correspond to the \(u\) and \(v\) coordinates in the texture/pixel space. We need a way to map the 2D texture map (image) onto the 3D surface of the model.

UV Sampler

Each face of the 3D model is associated with a set of UV coordinates (pixel coordinates of the 2D image) that specify how the texture (image pixels) should be wrapped around the model. The UV Sampler uses these coordinates to determine which parts of the texture (image) correspond to which parts of the 3D surface. The UV Sampler tensor has shape \((F, T, T, 2)\), where \(F\) is the number of faces, \(T\) is the texture size, and 2 is the \((u, v)\) coordinate. Each face has a texture with \((T \times T)\) pixels, and each pixel has a \((u,v)\) coordinate. By sampling these coordinates, the model can fetch the correct color or texture information from the 2D texture map. This process allows the model to apply detailed texture information to the 3D surface.

Sampling Results

The result of this sampling is a texture map for each face of the 3D model, which is then used to render or visualize the model. The sampled texture values correspond to the areas of the texture image that should appear on the model’s surface. After sampling, the texture has shape of \((B, F, T, T, C)\) indicating that for each face \(F\) in the 3D model, we get a \((T \times T)\) texture grid, with \(C\) channels of color (3) or flow (2) information.

Training Texture Prediction Model

Backpropagation involves updating the Texture Predictor Model, as it is responsible for generating the initial texture map from the input features. The loss is backpropagated to this model, allowing its parameters to be adjusted based on the gradients derived from the loss, thereby enhancing texture prediction accuracy. In contrast, the UV Sampler, which maps 2D texture coordinates to the 3D shape, is crucial for the final rendering but is typically a fixed function or non-learnable component. It does not have parameters that can be optimized, so the loss is not backpropagated to it.

Rendering

After obtaining the texture map outputs from the Texture Prediction Model, combined with the UV Sampling module, the resulting texture has the shape \((B, F, T, T, C)\). Along with the predicted vertices reconstructed from the deformations of the mean shape, as predicted by the Shape Model, we now have the 3D geometry and corresponding texture of the object necessary for rendering its image.

However, while the network predicts both the 3D shape and texture, the camera pose remains critical for accurately projecting this 3D data into a 2D image. Without the camera pose, it would be impossible to correctly position, orient, and map the texture of the 3D object onto the 2D image plane from the desired viewpoint. Our rendering function, therefore, takes as input the vertices, faces, 3D texture, and camera pose to generate the final rendered texture and mask.

3D Rotation

Euler Angles

I start by a short introduction to rotation. 3D rotation can be shown by three angles:

Gimbal Lock Phenomenon

In normal situations, the three axes of rotation (pitch, yaw, and roll) are independent, meaning that adjusting one does not affect the others. However, if you rotate the object in such a way that two of these axes align (for example, rotating 90 degrees in pitch), you lose the ability to rotate around one of the original axes. This is what causes gimbal lock: the system becomes "locked" in a way that removes one degree of freedom, and the object can no longer be oriented in certain ways. For example, in an airplane, if you pitch the plane 90 degrees (nose pointing straight up), the yaw and roll axes are now aligned, meaning that trying to adjust yaw will result in roll instead, and vice versa.

Rotation Matrices and Orthogonality

Rotation matrices are indeed orthogonal matrices:

Preservation of Length and Angles: Rotation, by definition, should preserve the magnitude (or length) of vectors and the angles between them. Orthogonal matrices have the property of preserving the dot product between vectors, which guarantees that the vector lengths and angles are unchanged. This is crucial for rotations, as they only change the direction of vectors, not their size or relative orientation.

Invertible: Rotation matrices are required to be invertible, so that rotating by an angle and then rotating by the opposite angle returns the vector to its original position. Orthogonal matrices satisfy the property \(Q^{-1}=Q^T)=\). This simplifies calculations, as it means that the transpose of the rotation matrix can be used to perform the inverse rotation.

Determinant: The determinant of a rotation matrix is always +1. While orthogonal matrices can have determinants of If the determinant is −1, the matrix represents a reflection rather than a rotation.

Orthogonal Matrix

Transpose Equals Inverse

An orthogonal matrix \( Q \) satisfies the condition:

$$ Q^T Q = Q Q^T = I $$

where \( Q^T \) is the transpose of \( Q \) and \( I \) is the identity matrix. This means that the inverse of \( Q \) is its transpose: \( Q^{-1} = Q^T \).

Preserves Lengths and Angles

Multiplying a vector by an orthogonal matrix preserves the vector's length and the angle between vectors. This makes orthogonal matrices useful for transformations that involve rotations and reflections in geometry.

Determinant

The determinant of an orthogonal matrix is always \( +1 \) or \( -1 \). If the determinant is \( +1 \), the matrix represents a rotation. If it’s \( -1 \), it represents a reflection.

Orthonormal Rows and Columns

Quaternions

Quaternions Goel et al. (2020) are a mathematical system that extends complex numbers, and they are often used in 3D graphics and physics to represent rotations in 3D space without suffering from issues like gimbal lock, which can occur with Euler angles (like azimuth, elevation, and cyclo-rotation). Quaternions provide a solution to gimbal lock because they represent rotations in 3D space without relying on three sequential angles. Instead of using Euler angles or a gimbal system, quaternions encode rotation as a single, continuous transformation in 3D space. This avoids the problem of gimbal lock entirely and provides smoother and more stable rotations A quaternion \(q\) has the form:

$$ q = w + xi + yj + zk = (w, \vec{v}) = (w, x, y, z)$$

where: Therefore:

$$ w = \cos\left(\frac{\theta}{2}\right), \; x = \sin\left(\frac{\theta}{2}\right) v_x, \; y = \sin\left(\frac{\theta}{2}\right) v_y, \; z = \sin\left(\frac{\theta}{2}\right) v_z $$

where \((v_x, v_y, v_z)\) is the unit vector of the axis of rotation, and \(\theta\) is the rotation angle in radians.

Converting Euler Angles to Quaternions

Quaternions for Each Rotation are computed as following:

Gram-Schmidt

The 6D rotation representation mapped onto SO(3) using the partial Gram-Schmidt procedure Zhou et al. (2020) is a method to represent 3D rotations more efficiently and robustly than traditional methods like Euler angles or quaternions. Unlike Euler angles, the 6D representation avoids gimbal lock and discontinuities. Moreover, the representation is more numerically stable compared to quaternions, especially when used in optimization tasks (e.g., neural networks), since quaternions require normalization to maintain valid rotations.

Special Orthogonalization Using SVD

The "special orthogonalization using SVD" Levinson et al. (2020) based on a 9D rotation representation, involves mapping a 9-dimensional (9D) vector to the special orthogonal group SO(3), which consists of valid 3D rotation matrices. This process leverages Singular Value Decomposition (SVD) to ensure the resulting matrix is a valid rotation matrix (i.e., it has orthonormal columns and a determinant of 1). The use of SVD ensures that the resulting matrix is orthogonal, which is a requirement for a valid rotation matrix. SVD is a well-established and stable algorithm, making this method robust for tasks that require precise rotation computations. This approach is flexible, and can handle matrices that are initially non-orthogonal, correcting them via the SVD process to ensure they represent a valid rotation.

Model Architecture

SLOP-KP
SLOP-KP: Self-supervised Learning of Object Pose Estimation Using Keypoint Prediction.
The model architecture has 3 main components:

Phase-I: Camera Multiplex

We begin by optimizing a randomly initialized camera multiplex, which represents a distribution over 40 cameras. Each camera is initialized with 6 parameters: 1 for scale, 2 for translation, and 3 for Euler angle rotations representing Azimuth, Elevation and Cyclo-rotation. The goal is to find the camera configuration that best explains the image, considering varying object poses. During this process, the multiplex is pruned down to the 4 best cameras by minimizing a camera update loss, which is computed from the rendered masks and textures (reconstructed image).

Phase-II: Shape and Texture Reconstruction

In the next phase, using the pruned multiplex of the 4 best cameras, we train the shape and texture models. The camera multiplex continues to be optimized based on the camera update loss, calculated from the rendered masks, and texture (reconstructed image) until we converge on the single best camera configuration that captures the object's pose most accurately.

The total loss to train shape and texture is defined as following:

\( L_{\text{total}} = \sum_{k} p_{k} \left(L_{\text{mask}, k} + L_{\text{pixel}, k}\right) + L_{\text{def}} + L_{\text{lap}}, \)

Where \(k\) indexes the cameras in the multiplex, and the silhouette mask loss is defined as:

\(L_{\text{mask}, k} = ||S - \tilde{S}_k||_2^2 + \text{dt}(S) \cdot \tilde{S}_k\)

where \(S\) and \(\tilde{S}_{k}\) are the ground-truth mask and the mask rendered from camera \(k\), respectively. \(dt(S)\) is the unidirectional distance transform of the ground-truth mask.

Unidirectional distance transform of the ground-truth mask: A distance transform converts a binary mask into a distance map, where each pixel's value represents the minimum distance to the nearest boundary or differently valued pixel (e.g., from foreground to background).Typically, distance transforms are calculated bidirectionally, measuring distances across both foreground-to-background and background-to-foreground transitions. In contrast, a unidirectional distance transform measures distance in only one direction—either from the foreground to the background or vice versa. For example, a unidirectional transform focused on foreground-to-background would measure how far each foreground pixel is from its nearest background pixel.

The image reconstruction loss computed from the foreground image is:

\(L_{\text{pixel}, k} = \text{dist}(\tilde{I}_k \odot S, I \odot S)\),

where \(I\) and \(\tilde{I}_k\) are the RGB image and the image rendered from camera \(k\). The \( \odot\) denotes the element-wise product.

A graph-laplacian smoothness prior on the shape that penalizes the vertices \(i\) that are far away from the centroid of their neighboring vertices \(N(i)\):

\(L_{\text{lap}} = ||V_i - \frac{1}{|N(i)|} \sum_{j \in N(i)} V_j ||^2\),

For deformable objects like birds, it is beneficial to regularize the deformations to avoid arbitrary large deformations from the mean shape by adding the energy term:

\(L_{\text{def}} = ||\Delta {V}||\).

The probability that a camera \(k\) is the optimal choice is computed using:

\( p_{k} = \frac{e^{-\frac{L_k}{\sigma}}}{\sum_{j} e^{-\frac{L_j}{\sigma}}}, \)

where \(L_{k} = L_{\text{mask}, k} + L_{\text{pixel}, k}\) is the camera update loss.

Phase-III: Keypoint Pose Prediction

Finally using the best camera pose optimized by multiplex and trained shape and texture model, the camera pose is predicted.

Solving perspective-n-points by keypoint heatmaps

If object shapes are predicted as variations of a fixed model shape (mean shape) that has a set number of vertices and faces, then each object instance will have vertices that match the same semantic locations on the model. The only difference is their position due to deformation. These vertices can be treated as 3D keypoints. By reducing the number of keypoints through down-sampling and learning them from 2D images, we can use traditional methods, like robust PnP (Perspective-n-Point) estimators Campbell et al. (2020) to determine the object’s pose.

First, the object's shape is predicted as a deformation from a mean-shape or model shape, resulting in a new shape represented by 3D points. Next, \(N\) keypoints are randomly and uniformly selected from these 3D points using Farthest Point Sampling Qi et al. (2017). Using the best pose from the optimized the camera multiplex, these \(N\) keypoints are projected onto the image plane, resulting in corresponding 2D points. Finally, with these 3D and 2D points, we can solve the perspective-n-point (PnP) problem to estimate the camera's position and orientation relative to the scene.

Keypoints Prediction Loss

\( \theta = \text{arg min}_{\theta} \sum_{i}^{N} S_{w} || S_{h} - F(x_{i}; \theta) ||^2 \),

where a weighted least square loss function summed over all \(N\) keypoints applying the proxy ground-truth heatmaps \(S_h\) and the predicted heatmaps by \(F(x_{i}; \theta)\), which is a keypoint prediction network.

Keypoints Prediction Network \(F(x_{i}; \theta)\)

The keypoint prediction network consists of two components: the Smooth network and the Border network. The Smooth network includes a channel attention block, which selects discriminative features, and a refinement residual block to enhance the feature map at each stage of the feature extraction process. The Border network amplifies the distinction of features by using a semantic boundary to guide feature learning. To further increase inter-class distinction, the Border network explicitly learns the semantic boundary with supervised guidance.

Self-supervision Block

The self-supervision block is developed to train the keypoints prediction network. This module is composed of two functional components \(S_h\) and \(S_w\) to generate a proxy ground-truth heatmaps and a weighting mask, respectively.

Rendered Label Texture (\(T_l\))

Due to the limitations in the used version of our renderer, a ground-truth label texture \(T_{\text{gt}}\) is initialized in the local coordinate space of the 3D model. The ground-truth texture data (colors) are created, without considering any external viewpoints (camera pose). To achieve this, a \(C_{\text{map}}\) assigns a unique color to each keypoint. Faces that contain at least one keypoint vertex are then colored according to.

Img
Original image.
gt_color
Texture rendered \((T_{l})\).

The rendering function projects the ground-truth label_texture \(T_{\text{gt}}\) onto the 3D model from the camera's viewpoint. It applies predicted vertices \(V\) representing the 3D coordinates of the model’s vertices, faces \(F\), which define the connectivity of these vertices to form the model's surface, and camera (best pose of multiplex \(\hat {\pi})\), which specifies the viewpoint and projection parameters. When \(T_{\text{gt}}\) is rendered using this function, it is mapped onto the model according to the camera’s perspective. The resulting output is a rendered texture \(T_{l}\) that visually integrates the color data from \(T_{\text{gt}}\) with the model’s geometry, as observed through the camera:

\( T_{l} = R (T_{\text{gt}}, F, V, \hat {\pi}) \),

Proxy Ground-truth Heatmaps (\(S_h\))

After the completion of Phase-II, we have an optimized multiplex, which gives of the best camera pose, as well as a trained shape reconstruction model, which predicts shape deformation \(\Delta V\), and texture. Applying random sampling of the 3D keypoints vertices over the object's predicted shape and best camera pose reconstructed by scale \(s\), translation \(t_{xy}\), and rotation \(r_q\), the 3D keypoints are projected into their 2D correspondences \([u_i, v_i]\). We employed \([x, y]\) as 2D coordinates of the image, and a Gaussian function to model the uncertainty of the locations of the 2D keypoints projections on the heatmap:

\( S_{h} = e^{\frac{-||[x, y] - [u_i, v_i] ||^2}{2 \sigma^2}} \).

Weighted Mask (\(S_w\))

The weight mask \(S_w\) is created by sampling colors \(c_{\text{sampled}}\) from the labeled texture \(T_{l}\), with each color corresponding to a 2D vertex \([u_i, v_i]\) that has been projected from a 3D keypoint.

\( S_{w} = \delta_{\epsilon}[||c_{\text{sampled}}-C_{\text{map}}||] \),

where \(\delta_{\epsilon}[arg]\) is an indicator, which returns 1 if its argument is true, and zero otherwise. The colors in \(C_{\text{map}}\) are chosen to be more than \(\epsilon\) apart, so it always works.
fig2
First row shows original images of 6 different birds of CUB dataset. Second and third rows show the proxy ground-truth heatmaps and predicted heatmaps predicted by the keypoints prediction network, respectively.

Experiments and Results

This section presents the experiment conducted with CUB Dataset Wah et al. (2011). We conducted our experiments to compare four different approaches to rotation representation. The first is to predict 4D unit quaternions by a CNN. The second is 6D rotation representation mapped onto SO(3) via a partial Gram-Schmidt procedure. The third is special orthogonalization using SVD based on 9D rotation representation. The fourth is our approach to camera pose prediction, which trains an intermediate keypoint prediction network.

3D Mesh Reconstruction

fig2
The first row displays the original images of nine different bird species from the CUB dataset. The second row presents the 3D meshes reconstructed using the ground-truth camera poses provided by the dataset through Structure from Motion (SfM). The third row showcases the 3D shapes reconstructed when camera poses are predicted using unit quaternions. Finally, the fourth row illustrates the 3D shapes obtained using camera poses predicted from keypoint correspondences.

Texture Reconstruction

fig2
The first row displays original images of six different bird species from the CUB dataset. The second row shows textures reconstructed using the SfM camera poses for rendering. The third and fourth rows present textures reconstructed with camera poses predicted by unit quaternions and the keypoint pose trainer, respectively.

Mask Reconstruction

fig2
The first row displays original RGB images of six different bird species from the CUB dataset. The second row presents the ground-truth masks provided by the dataset. The third row shows rendered masks using SfM camera poses. The fourth and fifth rows depict reconstructed masks using camera poses predicted by unit quaternions and keypoint correspondences, respectively.

Image Reconstruction

fig2
The first row displays original images of ten different bird species from the CUB dataset. The second row presents the ground-truth annotations provided by the dataset. The third and fourth rows show the masks and textures reconstructed using camera poses predicted by keypoints. The fifth row depicts the 3D shape reconstructed from the camera pose predictions.

Online Inference 3D Object Reconstruction from Videos

We conduct online experiments to infer 3D objects from video sequences with single and multiple objects per image, using the YouTubeVos and Davis datasets Xu et al. (2018), and we focus on the bird category. Inferring objects from video sequences is challenging due to varying positions, orientations, and occlusions. We use LWL Bhat et al. (2020) to compute bounding boxes from the predicted masks. These bounding boxes are used to crop frames and create patches. The image patches are then input to the reconstruction network, which predicts shape, texture, and camera pose. We compare the masks reconstructed by our method and three other approaches against the ground-truth masks. Models are evaluated using three metrics: Jaccard-Mean (mean intersection over union), Jaccard-Recall (mean fraction of values exceeding a threshold), and Jaccard-Decay (performance loss over time).
fig2
Online video object reconstruction framework. This example is from the YouTubeVos test set. The first row shows the original RGB images, and the second row shows the image patches generated by cropping using predicted bounding boxes from the LWL tracker. The third row shows the reconstructed shape and texture.
fig2
Mean intersection over union for 22 video sequences of YouTubeVos and Davis test sets.
fig2
The first row displays original images of 10 birds from the YouTubeVos and Davis video sequences. The second row presents image patches cropped using the bounding box dimensions predicted by the LWL tracker. The third and fourth rows show reconstructed textures and masks obtained using a 3D object reconstruction model, where keypoint correspondences are used to predict the camera poses capturing the images.

Tools and Technologies

This section discusses the design and implementation of the project.

Data Visualization

Utilized various libraries for data visualization:

Model Development and Deployment

Tools and libraries used for model development and deployment include:

Computing Infrastructure

Infrastructure and environments used include: