Scene understanding using vision language models

2025-05-07

Scene understanding and scene analysis are essential in advanced driver assistance systems (ADAS) and autonomous driving (AD) for several reasons:

High data volume: Autonomous and semi-autonomous vehicles generate vast amounts of image data from multiple cameras operating at high frame rates. Typically, only specific frames containing relevant information are of interest. Effective scene understanding helps filter out irrelevant frames, reducing data volume and improving processing efficiency.
Diverse and representative data: The training dataset, enriched and filtered by Vision Language Models (like Phi-4-multimodal-instruct), should include a wide variety of driving scenarios, weather conditions, lighting conditions, and road types to ensure the system can handle different real-world situations.
Edge cases and rare events: While most frames depict normal driving conditions, a small fraction contains critical anomalies such as roadblocks, accidents, or hazardous conditions. Efficient scene understanding can help identify these critical scenes and define systems able to handle unexpected situations and long-tail scenarios.
Continuous updates: The dataset should be continuously updated with new data to reflect changes in the environment and improve the system's performance over time. The update is especially important for reflecting new types of vehicles, changing road infrastructure, and driving behaviors.

Challenges in scene understanding

Using Large Language Models (LLMs) for scene understanding in AV can present challenges. The daily generation of terabytes of data by vehicles makes scalable image analysis difficult from a computational point of view.

Small Language Models (SLMs) are more efficient but lack the domain knowledge required for effective scene analysis and need fine-tuning on domain-specific data.

Approach

To address these challenges, fine-tuning vision language/multimodal models can help in identifying specific scenes and filtering camera frames before further processing. A Student-Teacher architecture (model distillation) is effective in this context.

The illustrated workflow starts with a sensor camera installed on the vehicle, which captures terabytes of data in real-time. The Phi-4-multimodal-instruct fine-tuned model processes the data by performing initial scene understanding and tagging the images with relevant information. For instance, a frame might be tagged with classifications like critical, severity levels such as high, and justifications explaining the scene context, such as noting a vehicle drifting and producing visible dust or smoke, to indicate a potential hazard.

The Teacher model is a large, complex model pre-trained on extensive datasets, such as GPT models, that are pre-trained on extensive datasets across a wide range of tasks. Due to its vast training, the Teacher model possesses comprehensive knowledge and high accuracy in various tasks, including scene understanding to generate ground truth labels for the fine-tuning process.

The Student model, on the other hand, is smaller and more efficient. The model is trained to mimic the performance of the Teacher model for specific tasks. The aim is to achieve similar accuracy with reduced computational requirements.

The distillation process involves three main steps.

Selection of Teacher Model: The Teacher model is a general-purpose model that performs well across various tasks, including scene classification. The model is trained on a vast set of examples, enabling it to generalize effectively to different types of data, including data relevant to ADAS.
Knowledge Transfer: The Student model is then trained specifically for ADAS-related tasks using the outputs / labels (predictions) of the Teacher model as soft labels. The training involves minimizing the difference between the Student model's predictions and the Teacher model's predictions for tasks related to scene understanding and hazard detection in ADAS.
Optimization: During this process, techniques such as layer freezing, and parameter-efficient tuning methods (for example, LoRA) are employed to ensure the Student model learns effectively while remaining computationally efficient.

The most interesting and relevant images are then stored in the cloud for further analysis. Here, advanced models like GPT-4o or o3 are required, to perform deep scene analysis and provide more detailed and refined tagging of the stored and pre-filtered images. This iterative process improves the quality of scene understanding.

The student model, that is, Phi-4-multimodal-instruct, undergoes periodic fine-tuning using the improved tags generated by the deep scene analysis, ensuring that the model remains up-to-date and accurate. This continuous improvement cycle allows for the updated model to replace the old one, ensuring that the system consistently enhances its scene understanding and hazard detection capabilities.

This architecture allows the deployment of models, enabling scene understanding for critical situations, to be used for filtering long-tail scenarios.

Fine-tuning process overview

The fine-tuning process for Vision Language Models uses a Student-Teacher architecture for efficient edge deployment. The process begins with a curated dataset of driving scenes, which is enhanced with synthetic labels generated by an LLM (such as GPT-4o or o3), providing rich annotations. The base model (such as Phi-4-multimodal-instruct) is loaded, incorporating a Hugging Face (HF) processor to prepare the data for training.

Training involves the Teacher model, providing high-quality predictions and guidance, followed by fine-tuning on domain-specific data to tailor the model for road scenes and anomalies. Techniques such as PEFT, quantization, LoRA/QLoRA, and DeepSpeed optimize the model's performance while reducing computational requirements. GPU resources expedite the training and fine-tuning process, using Azure Machine Learning for scalable and efficient computation. Hugging Face's Data Collator aggregates the data, preparing it for evaluation.

Metrics such as accuracy, precision, and recall test the model's effectiveness, informing further fine-tuning and optimization efforts. The final output is a fine-tuned model achieving high performance with reduced computational needs, suitable for deployment on edge devices. This pipeline, orchestrated through Azure services, ensures that the model is robust, efficient, and capable scene understanding in ADAS applications.

Example prompt with chain-of-thought reasoning and output data structure

To improve the model's understanding and classification accuracy, structured prompts with chain-of-thought (CoT) reasoning can be used. This technique involves breaking down the reasoning process into sequential steps, which helps the model to understand the context and make more accurate predictions.

In this example, the model follows a step-by-step reasoning process to analyze the image, detect relevant objects, identify critical events, and classify the scene.

As the world's best image analyst, your task is to analyze images taken from the ego vehicle while driving. Your goal is to provide a detailed analysis of the scenes in the images and, based on this analysis, classify the scenes as 'normal' or 'critical' from an autonomous driving perspective.

Classification Definitions:
- Normal: Typical driving conditions where the driver can proceed without major risks.
- Critical: Situations that pose a potential risk or are unusual, such as accidents, erratic driving behavior, unexpected obstacles, or pedestrians in dangerous positions. If the scene is critical, the driver needs to change their behavior including slow down.
 Provide a severity level:
 - Low: The driver should slow down and be ready to brake or take actions.
 - High: The driver needs to take immediate actions, like braking or changing path.

Chain-of-Thought Reasoning:
1. Describe the Scene
2. Identify the relevant objects affecting the trajectory of the ego vehicle
3. Is it common to see this scenario while driving?
4. Can the ego vehicle drive safely in its presence considering the relevant objects?
5. Can this situation cause the ego vehicle to make unpredictable or unsafe maneuvers?
6. Based on the previous points, classify the image as "normal" or "critical" (If critical, provide the severity level): "normal"

Code references

For more detailed instructions on fine-tuning Phi models for vision tasks, including scene analytics in ADAS, you can refer to the PhiCookBook - FineTuning_Vision and Phi-4-multimodal-instruct - sample_finetune_vision from HuggingFace. These resources provide comprehensive guidelines and examples that you can use to fine-tune the Phi-4-multimodal-instruct model for your specific use case.

Inference comparison Example with a real world scenario

Important

Scene understanding specified in the following sections are only used for offline processes to identify and curate datasets, and not for real vehicles.

Consider the following three examples taken from real-world scenarios.

Note

The Fine-tuned models mentioned in the following sections are trained with labels from GPT-4o. However, the images in the sections are new for the models.

Scenario 1: Critical Scene

Consider the following example of a clear critical scene, where a person is lying on the road. This situation poses a potential hazard for both the individual and oncoming traffic. Here, you compare the inference results of different models: the vanilla versions of Phi-3.5-vision-instruct and Phi-4-multimodal-instruct, with their fine-tuned versions and GPT-4o used as the teacher. This comparison illustrates the varying levels of detection and analysis accuracy.

The Vanilla model from Phi-3.5 generates an incorrect response, recognizing a car accident instead of a person lying on the road. The Vanilla Phi-4-multimodal-instruct model isn't incorrect, but it isn't able to classify the scene as critical since it isn't considering the image from the ego vehicle perspective.

Inference results for a critical scene using Vanilla Phi-3.5-vision-instruct

Analysis: The image depicts a black car rolled over on its side on the side of a two-lane road surrounded by trees. The ego vehicle is stopped and facing the scene, with the driver's attention directed towards the overturned vehicle. There are no visible pedestrians or other vehicles in the immediate vicinity that appear to be causing the accident. The weather is clear, and there's no indication of any immediate danger to the ego vehicle from the accident itself. The overturned vehicle doesn't seem to be obstructing traffic, and there are no other vehicles approaching. The situation appears to be under control, and the ego vehicle can proceed without the need for immediate action.
Classification: Normal
Severity Level: None

Inference results for a critical scene using Vanilla Phi-4-multimodal-instruct

Analysis: The image shows a person lying on the ground on a forest path. The image seems to capture an accident, where the person has fallen and is lying on the ground. There's no immediate danger to other individuals as the path isn't a roadway for vehicles and there are no other people in sight. The situation is likely an accident involving the person who is lying down.
Classification: Normal
Severity Level: None

Inference results for a critical scene using Fine-tuned Phi-3.5-vision-instruct

Analysis: The image shows a person lying on the road, which appears to be an accident. The person is blocking the path of the ego vehicle, creating a significant hazard. The road is narrow and surrounded by trees, which might limit visibility and escape routes. Immediate action is required to avoid a collision with the person on the road.
Classification: Critical
Severity Level: High

Inference results for a critical scene using Fine-tuned Phi-4-multimodal-instruct

Analysis: The image shows a person lying on the road in the path of the ego vehicle. The image shows a critical situation as the person could be injured or in danger. The ego vehicle needs to stop to avoid hitting the person.
Classification: Critical
Severity Level: High

Inference results for a critical scene using GPT-4o (Teacher model)

Analysis: A person is lying in the middle of the road, creating an immediate hazard. The ego vehicle must urgently slow down or stop to avoid harm.
Classification: Critical
Severity Level: High

Scenario 2: Low-severity scene

Consider a borderline example where the pedestrian is walking close to the border of the street and the vehicle should be ready to brake or take actions.

Inference results for a low-severity scene using Vanilla Phi-3.5-vision-instruct

Analysis: The image shows a car approaching a pedestrian crossing with a warning sign. The vehicle is driving in the right lane, and there's a cyclist on the sidewalk. The environment suggests a residential area with some greenery and trees. There are no immediate signs of erratic behavior or unusual obstacles on the road.
Classification: Normal
Severity Level: None

Inference results for a low-severity scene using Vanilla Phi-4-multimodal-instruct

Analysis: The image shows a person walking on a path in a natural setting. There are no vehicles or any other elements in the image that suggest any risk or unusual situation. The person appears to be walking safely and there are no indications of any potential hazards.
Classification: Normal
Severity Level: None

Inference results for a low-severity scene using Fine-tuned Phi-3.5-vision-instruct

Analysis: The image shows a single pedestrian walking on the road from the perspective of the ego vehicle. The pedestrian isn't in a crosswalk and is directly on the path of the vehicle. This situation poses a potential risk as the vehicle needs to slow down and be prepared to stop to avoid a collision.
Classification: Critical
Severity Level: Low

Inference results for a low-severity scene using Fine-tuned Phi-4-multimodal-instruct

Analysis: The image shows a road with no visible vehicles, only a person walking on the sidewalk. The road is clear, and there are no signs indicating any hazards or directions. The person is walking at a normal pace, and there's no indication of any unusual behavior or potential risk.
Classification: Normal
Severity Level: None

Inference results for a low-severity scene using GPT-4o (Teacher model)

Analysis: A single pedestrian is walking along the middle of a narrow, tree-lined road. While this situation isn't an extreme hazard, the driver must slow down and give the pedestrian space, indicating the need for extra caution rather than normal driving behavior.
Classification: Critical
Severity Level: Low

Scenario 3: Normal scene

Consider a normal scene where there are no visible hazards or unusual conditions.

Inference results for a normal scene using Vanilla Phi-3.5-vision-instruct

Analysis: The image depicts a road that appears to be in a rural or less populated area with multiple trees on both sides, providing a canopy that might affect visibility. There are no visible pedestrians, cyclists, or other vehicles that seem to be obstructing or in the way of the vehicle. The road markings include a 'slow' sign, which suggests there might be a need for caution, but no immediate hazards are evident. The daylight and visibility are good, and the weather seems clear.
Classification: Normal
Severity Level: None

Inference results for a normal scene using Vanilla Phi-4-multimodal-instruct

Analysis: The image shows a road with trees on both sides and no visible traffic signs or signals. The road appears to be in good condition with no immediate obstructions or hazards. There are no pedestrians, animals, or vehicles in the immediate vicinity of the car, and the traffic conditions seem to be normal.
Classification: Normal
Severity Level: None

Inference results for a normal scene using Fine-tuned Phi-3.5-vision-instruct

Analysis: The image shows a narrow, tree-lined road with no visible vehicles, pedestrians, or obstacles. The road appears to be clear and there are no immediate hazards or unusual conditions present. The weather is clear and visibility is good.
Classification: Normal
Severity Level: None

Inference results for a normal scene using Fine-tuned Phi-4-multimodal-instruct

Analysis: The image shows a vehicle driving on a road with trees on both sides. The road markings are visible, and there are no signs of any unusual circumstances that would require immediate attention. The driver appears to be in a normal driving environment.
Classification: Normal
Severity Level: None

Inference results for a normal scene using GPT-4o (Teacher model)

Analysis: The image shows a quiet, tree-lined rural road with no visible traffic, pedestrians, or obstacles. There are no immediate hazards or unusual conditions.
Classification: Normal
Severity Level: None

Conclusion

The vanilla models exhibit a tendency to classify critical images as normal, resulting in a higher number of false negatives. Additionally, the responses from the vanilla models show more variations and sometimes don't adhere to the desired output format. The responses result in long descriptions that don't add value or incorrectly describe the scene. This limitation underscores the importance of fine-tuning.

Fine-tuned models like Phi-3.5-vision-instruct and Phi-4-multimodal-instruct demonstrate improved accuracy in scene analysis by correctly identifying and classifying both critical and normal driving situations. Fine-tuning helps in reducing false negatives, ensuring that critical scenes aren't incorrectly classified as normal, and improving the overall reliability of the model in real-world applications. Additionally, you can observe that the analyses provided by the fine-tuned models are more concise and focused on the most relevant aspects of the scene. This conciseness is also important for reducing latency, as fewer tokens are generated, leading to faster processing times.

Fine-tuned models such as Phi-3.5-vision-instruct or Phi-4-multimodal-instruct can effectively understand whether ingested scenes exhibit critical or unusual situations, which can be utilized in DataOps filtering processes to address long-tail scenarios.

Share via