Whack All Your Moles at Once: Addressing Hidden Safety Flaws in Autonomous Vehicles

12th Jul 2023 dRISK

By Lorenzo Niccolini and Hamish Scott



The reason we still don’t have ubiquitous AVs is that dealing with the tail-end complexities of the real world turns out to be really hard. A manifestation of this is the ‘whack-a-mole’ problem: you – an AV/ADAS developer – encounter a problem with the vehicle’s performance on a specific case, say unprotected turns. You work on a fix and ship a new update. Success. Well nearly. Unfortunately you later notice that you are getting rear-ended a lot more than you used to… So now you go and fix that issue successfully, and your AV now boldly drives in a way that more closely matches human-follower expectations, while hopefully maintaining performance on unprotected turns. But lo and behold, you now find out that you’re driving more boldly than your perceptions can handle, and you deteriorate in reacting to jaywalking pedestrians. Or what’s worse, you as the AV developer don’t even recognise the new failure modes before your users start posting them on YouTube.

The cycle continues where you address each problem, only to encounter new ones that were not anticipated. In other words, you whack one issue only to find a new one cropping up and limiting your ability to monotonically improve performance.

So the question is: how could you have anticipated this? Yes you could have written perfect code that accounts for each eventuality. But that turns out to be pretty hard. dRISK has taken a different approach: we’ve built a training, testing and validation platform that makes the above discoveries self-evident. Allowing developers to uncover and understand hidden safety flaws faster, thus avoiding endless whack-a-mole.

Training and Testing on Edge Cases

We’ll start with some groundwork. Simulation based testing, including re-sim (re-playing of logs coming from real world failures) is really the only way to reach the scale needed to ensure that your autonomous vehicle is safe and robust enough for deployment in the real world. That’s not to say that real-world testing is not important – it is. But, by leveraging simulation properly you can test on a much vaster set of scenarios for a fraction of the cost.

More importantly though is what you’re testing on. The majority of your focus needs to be on the edge cases: the individually unlikely cases that together make up all the risk space. A cyclist running a red-light, a child running onto the road behind a ball, a fallen wire in the middle of the road. The incredible variety of edge cases causes the space to become incredibly sparse and high-dimensional. Driving thousands of miles on sunny, low traffic roads counts essentially as just one scenario in the sense that there are few features associated with it. But, when we start to consider more edge cases, the feature space explodes, each scenario occupying its own unique area of risk. Of course, this space is not random – it has structure, and harnessing this structure is exactly the problem.

Different sides of the elephant

To dig a little deeper into this complexity we will make use of another animal metaphor: the parable of the ‘blind men and elephant’ which describes the story of six blind men who come across an elephant and try to understand what it is by touching it. However, they each touch a different subset of it and they all misinterpret what the animal is.

Just like with the elephant, if you have access only to a narrow view of the scenario risk space you will misinterpret a pattern for something that it is not, you will fix the wrong problem and therefore encounter a new unexpected one. This is exactly why you end up with whack-a-mole in the first place.

To avoid doing this you need a way to see the whole picture. To see the space in as much context as possible. That’s why all our data is stored in one single data-structure and accessible in each of its dimensions through dRISK Edge. Three particular dimensions we will explore in a bit more detail now are (from left to right in Figure 1):


    • Taxonomic view: we annotate scenarios (and indeed we have automatic processes to do this) with many semantic features describing the content of each. E.g. number of entities in a scene, types of entities in a scene, intersection-type, maneuver type, intersection-type etc.
    • Geospatial view: we can also look at the scenarios geographically: where in the world or along the route did the scenario take place.
    • Embedding view: to get the broadest view of our data we can use combinations of all the above features to create embedding views. The upshot of this is to be able to see the scenarios in a much richer space. More on this is in a moment.


Figure 1: Taxonomic, geospatial and embedding views of a subset of scenarios from the dRISK knowledge-graph.


A worked example: improving AVs trained with deep reinforcement learning

To see how we can use each of these to avoid whack-a-mole lets go through another concrete example in which we trained an AV with deep reinforcement learning. One of the most intuitive ways to visualize the AV’s performance is through a geospatial view (Figure 2). By examining the performance geographically over the route, developers can start to identify patterns of behavior. However this view alone cannot reveal the complete picture as other features may be driving the performance distribution.

Figure 2: geographical performance distribution. Blue indicates good performance and red bad.


To gain a more comprehensive understanding we can turn to the embedding view. To create this we have first learned a deep embedding of the scenarios in the dataset based on their features. These include the types of entities, their interactions and the road geometry. To create a view of the embedding we use principal component analysis to perform a dimensional reduction. The result is a space that organizes scenarios based on their semantic similarity. Clusters in this view are more likely to represent similar patterns of behavior, enabling the identification of specific failure modes.

Figure 3: performance distribution over a deep embedding.


While the embedding view provides a holistic perspective, interpreting the clusters and identifying failure modes may not be immediately evident. This is where the rich taxonomic feature space also encoded in our knowledge graph comes into play. We can use the raw feature data of the scenarios in the clusters to produce an interpretation of the failure mode.

A few short steps of analysis can identify the features that best describe the cluster of poor performance. For example, we may find that the cluster consists mostly of scenarios with occluded pedestrians crossing the road. An interpretable description of the cluster will allow us to identify the true cause of the failure mode.

Figure 4: Understanding the cause of the failure mode using the taxonomic and embedding views


Once the failure mode is identified, we can proceed with the necessary fixes. This might mean fixing bugs in the code, retraining with a better training data distribution, or enhancing the AV’s architecture. By following this approach developers can achieve continuous improvement over the space of risk, addressing safety-critical scenarios effectively. Let’s conclude with a couple of other examples from dRISK customers. 

Worked examples: dRISK customers avoiding whack-a-mole

Figure 5 shows improvements from edge case testing, in a paper we published together with Hitachi Astemo’s motion planner.

Figure 5: Improved performance on edge-case scenarios across the risk space. Results from: Souflas, I., Lazzeretti, L., Ahrabian, A., Niccolini, L. and Curtis-Walcott, S., 2022. Virtual Verification of Decision Making and Motion Planning Functionalities for Automated Vehicles in Urban Edge Case Scenarios. SAE International Journal of Advances and Current Practices in Mobility, 4(2022-01-0841), pp.2135-2146.


Similarly to the embedding views above these views show the performance of the planner in a scenario space organized by similarity. The three main things to take away from the figure above are the following:

      1. Each node (dot) represents an edge case scenario on which Hitachi’s motion planner was tested. 
      2. The edge cases (nodes) are positioned in the space such that scenarios close to each other are similar while those far apart are more different.
      3. The larger nodes represent collisions produced by the planner. 


A five-fold improvement in collision rate was achieved after being exposed to edge cases which can be seen by comparing the left and right images.

A second example can be seen in Figure 6, which shows an (anonymised) progression view from a dRISK customer where a set of key indicators are tracked over time. This highlights a different advantage of this approach: having constant visibility of our performance on the entire risk space. This is invaluable to avoiding hidden failure modes while developing an AV. We can monitor these views to understand how performance changes with every update. By combining multiple views into the space of risk as well as the analysis of many taxonomic features, developers can gain comprehensive insights into AV performance. This empowers them to identify failure modes accurately and apply targeted fixes, resulting in safer and more reliable autonomous vehicles.

Figure 6: Example anonymised dRISK customer progression run. Performance of a key set of indicators improving over time.