BRL-VBVC Portfolio

Project Description

Here, I will present an intriguing project in autonomous driving that bridges the gap from virtual to real-world applications. In this project, I developed a Bayesian Reinforcement Learning framework that leverages visual data from the environment to learn how to control a vehicle within the CARLA simulation.

Key Contributions

Co-led the multi-partner WASP-NTU research project, aligning academic innovation with real-world challenges in autonomous driving.
Mentored a PhD student in research direction, methodology, and publication, resulting in shared authorship of a peer-reviewed article (Holmquist, K. 2023).
Developed a vision-based framework for autonomous driving using reinforcement learning techniques.
Designed and implemented the model architecture, conducted experiments in the CARLA simulator, and presented findings at the ICPR 2020 conference.
Evaluated model performance using the CARLA benchmark (Dosovitskiy et al., 2017), addressing tasks of increasing complexity and testing generalization to unseen environments, with metrics reported in kilometers per infraction.

Tools & Technologies

CARLA simulator: https://carla.org/
Python, Scikit-learn, Numpy, Scipy, CSV, JSON
Matplotlib, Seaborn, Plotly, Visdom

Code / Git

BRL-VBVC GitHub Page

Research / Paper

IEEE
Arxiv

Presentations

Proceedings of ICPR 2021

Additional Context

Bayesian Approach to Reinforcement Learning

In this section, I will explain the fundamentals of the Bayesian approach to reinforcement learning developed in this project.

In this section, we present the Bayesian approach to Reinforcement Learning introduced by [4]. This framework integrates a Gaussian Mixture Model (GMM) with a reinforcement learning approach.

We assume that at each time step \(t\), the agent perceives the world through the state \(s_t\) and models the action-conditioned state probability by a GMM with a set of mixture components \(M\). Based on the perceived state, the agent makes a decision and performs an action \(a_t\) from the action set \(A\). The agent’s decision is evaluated by a reward signal \(r_t\), which is used to train the system.

We assume stochastic variables \(a \in A\) and \(m \in M\) to calculate the conditional probability \(p(a|s_t)\). These probabilities are used to select the best decision at a given state and are estimated as:

\[ p(a|s_{t}) ∝ p(a) \sum_{m ∈ M} p(s_{t}|m)p(m|a) \]

Due to the unknown parameters of the GMM, we use the multivariate t-distribution to estimate \(p(s_t|m)\), the likelihood of the state \(s_t\) given the mixture component \(m\).

Estimation of Mixture Component Probability

Next, we estimate the term \(p(m|a)\), the probability of mixture component \(m\) given the action \(a\). This probability is parameterized by a state-action value function \(Q = [q_{m,a}] : (|M| × |A|)\) (where \(| · |\) represents the cardinality of the respective set) and is learned through reinforcement learning.

To calculate the probabilities \(p(m|a)\) and \(p(a)\), the elements of \(Q\) must be non-negative. Thus, we use the offset:

\( \hat{q} = \frac{| \min{Q}|}{1 + | \min{Q}| - \min{Q}} \)

The expressions for the probabilities \(p(m|a)\) and \(p(a)\) are as follows:

\( p(m|a) ∝ (q_{m,a} + \hat{q}) \)

\( p(a) ∝ \sum_{m ∈ M} (q_{m,a} + \hat{q}) \)

Objective Function

To train the system, a temporal difference learning (TD) approach is used with the loss:

\( TD_{error} = r_t + γQ(a_{t+1}, m_{t+1}) - Q(a_t, m_t) \)

where \(r_t\) is the reward signal and \(γ\) is the forgetting factor. The term \(a_t\) indicates the action performed at time step \(t\), and \(m_t\) is the most similar component to the state \(s_t\), determined by the \( \ell_{\infty} \) norm between the state \(s_t\) and the mean vectors of the existing mixture components. The term \(a_{t+1}\) denotes the most probable action in the next time step \(t + 1\):

\( a_{t+1} = \arg \max_{a} p(a|s_{t+1}) \)

The most likely component is \(m_{t+1}\), based on the probability:

\( p(m|a_{t+1}, s_{t+1}) ∝ p(s_{t+1}|m)p(m|a_{t+1}) \)

Finally, the system parameters are learned by Q-Learning using:

\( Q(m_t, a_t) \leftarrow Q(m_t, a_t) + αw \cdot TD_{\text{error}} \)

where \(α\) corresponds to a decaying learning rate and \(w\) allows for soft updates despite the greedy choice of component \(m_{t+1}\). To calculate \(w\), the \(TD_{\text{error}}\) is evaluated using two boundaries: a lower threshold \(T_l\), indicating a bad decision, and an upper threshold \(T_u\), indicating a good one:

\( w = \begin{cases} p(m_t|a_t, s_t), & \text{if } TD_{\text{error}} > T_u \\ p(m_t|¬a_t, s_t), & \text{else if } TD_{\text{error}} < T_l \\ p(m_t|s_t), & \text{else} \end{cases} \)

where \{p(m|¬a, s_t)\) is the probability for component \(m\) given that action \(a\) was not performed at state \(s_t\):

\( p(m|¬a, s_t) ∝ p(s_t|m)(1 - p(m|a)) \)

To update the parameters, two criteria are used:

Similarity measure \(d_t\) given by the \( \ell_{\infty} \) norm between the state \(s_t\) and the mean vectors of \(m\).
Evaluation criterion given by \(TD_{\text{error}} < T_l\) to determine if the action was a bad choice.

Architecture Design

In this section I will describe the design and implementation of the proposed model architecture.

Input Description

We base our input on the semantic segmented input image. The image of the segmentation map is separated into six different regions, for each we calculate a weighted histogram of the class distribution. The number of regions represents basic directional information (three vertical divisions) and distance information (two horizontal divisions) of the scene. This approach can facilitate further studies of how the agent learns to control its attention to each region and if an attention mechanism could improve the performance of the task. To further reduce the dimensionality of the input, the semantic labels are clustered into five categories as:

Road
Road-line
Off-road
Static object
Dynamic object

The feature vector of each patch is concatenated into a single state-vector. The state-vector is normalized by its \( \ell_{1} \)-norm. Due to the low density of road lines in the semantic segmentation input image, we weighted the class, road lines, by 20 for the histogram calculations.

In reality, ground-truth is obviously not available, and as a result, we need to estimate the semantic segmentation from available input, such as RGB. For our investigation, we utilize two different types of input data in our experimental setup: the ground truth semantic segmentation input directly from the simulator and the estimated semantic segmentation generated from RGB images by EncNet. The EncNet model is trained offline on images collected from the CARLA simulator.

EncNet: We use an EncNet with the ResNet-101 as the backbone architecture on top of which a special module named Context Encoding Module is stacked. The main reason behind the selection of EncNet over other powerful CNNs is because of the availability of pre-trained EncNet weights on the large and diverse ADE20K. In addition, EncNet has low computation complexity compared to CNNs such as PSPNet and DeepLabv3 and provides better inference speed at runtime.

Decision Making

During training, the agent applies an epsilon-greedy policy to explore the world and to develop its learned concepts. Our decision-making strategy is designed in a way to increase the exploration at the beginning of learning and to diminish as it progresses. Therefore, the policy is gradually shifting from an epsilon-greedy to a greedy one. When learning converges, the agent primarily exploits its learned concepts for decision-making rather than exploring the world.

Our behavior policy is implemented in two steps. First, we use a greedy approach to select the greedy action \( a_{gd} = \arg \max_{a} (p(a|s_{t})) \). Second, we sample an action from the distribution:

\( p_{\pi}(a|s_{t}) = \begin{cases} \frac{1 - \tau}{|A|} + \tau, & \text{if } a = a_{gd} \\ \frac{1 - \tau}{|A|}, & \text{else} \end{cases} \)

where \( \tau \in [0, 1] \) is an increasing temperature to increase the probability of the greedy action as learning progresses.

Reward Design

We select four different reward signals representing important types of failures:

Collision
Off-road
Opposite-lane
Low-speed

These failures generate the reward signal, but only one of them is applied at a time based on its importance:

\( r = \begin{cases} -r_{k1}, & \text{if Collision} \\ -r_{k2} \cdot r_{o}, & \text{else if Off-road} \\ -r_{k3} \cdot r_{l}, & \text{else if Opposite-lane} \\ r_{\text{speed}}, & \text{else} \end{cases} \)

where \( r_{o} \) is calculated as the percentage of the car being off-road, and \( r_{l} \) is the percentage of the car not being in the correct lane. The values of \( r_{o} \) and \( r_{l} \) are received from the CARLA simulator. Finally, \( r_{speed} \) rewards the agent when it drives with a speed \( v_{t} \) relative to the target speed \( v_{target} \) at time step \( t \):

\( r_{\text{speed}} = \begin{cases} -r_{k4} \cdot \left( \frac{v_{t} - v_{\text{target}}}{v_{\text{target}}} \right)^{2}, & \text{if } v_{t} < 0 \\ -r_{k5} \cdot \left( \frac{v_{t} - v_{\text{target}}}{v_{\text{target}}} \right)^{2}, & \text{if } 0 < v_{t} < v_{\text{target}} \\ 0, & \text{else} \end{cases} \)

Additionally, a reward based on the road view, the percentage of the road visible in the image input to the agent, is always applied to align the agent with the road as:

\( r_{t} = r + r_{\text{road-view}} \)

Control Signals

The control signals that the simulator receives are, similarly to a real car, steering, throttle, brake, and a flag for the reverse gear. Our actions are chosen to correspond to four action primitives that are able to fully control the vehicle's velocity and direction, which are:

Drive Forward
Right-Turn
Left Turn
Drive Backward

each corresponding to a certain control signal.

Experiments & Results

There are multiple experiments conducted with proposed approach of this project which are presented in Gharaee et. al (2021) We use four different settings combining training and deployment, each contains nine models and we name them according to the following schedule:

TGDG: Training and Deployment w/ Ground-truth.
TEDE: Training and Deployment w/ Estimate.
TGDE: Training w/ Ground-truth, Deployment w/ Estimate.
TEDG: Training w/ Estimate, Deployment w/ Ground-truth.

I present the results of the models evaluated according to the following metrics:

Off-Road: Being off-road (e.g, sidewalks).
Other-Lane: Being in the meeting lane.
Either: Being off-road or in the meeting lane.
Success: Accomplished tasks.
No Collision: Tasks without collisions.
Score: Average of Either, Success and No Collision.
Dist: Total distance driven in meter.

In these experiments BRL_VBVC is compared with conditional Imitation Learning (IL) and deep Reinforcement Learning (RL) by using the provided pre-trained models and evaluating them in our validation settings.

Benchmark results

The benchmark proposed in Dosovitskiy et. al (2017) is comprised by four different tasks, driving straight forward, one-turn (left or right), and two navigation tasks with multiple turns, each of them using the full road-network including intersections. All tasks except for the final navigation task are set in a static environment without vehicle and pedestrians while the last one contains multiple instances of each kind. The reported metrics is the average kilometers driven between each type of infraction. Neither our nor the other methods have been trained on the environment in Town02 from the CARLA benchmark.