Timely detection of traffic anomalies is critical for ensuring road safety and enhancing traffic efficiency. While existing convolutional neural network-based unimodal and multimodal methods have shown promise, they face limitations in complex scene adaptability and deep semantic understanding. To address this, this study proposes an optimized vision-language model (VLM) framework for traffic anomaly detection, integrating static image semantic understanding with dynamic video multi-frame reasoning. The framework comprises two stages: enhancing recognition accuracy using image data and improving complex scene understanding with video data. The image dataset emphasizes concise, targeted information, while the video dataset  incorporates  cross-frame  actions,  trajectory  relationships,  and  temporal  semantics  for comprehensive and accurate analysis. In the first stage, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO) training were employed, but the model struggled with complex events like “accidents” and “abnormal parking”. To address the lack of temporal context in static images, the second stage enhances dynamic scene reasoning through multi-frame temporal recognition. A unified evaluation system assesses model performance across stages. Training utilized traffic anomaly image data from Shanghai’s highways and urban expressways, covering ten typical events, with semantic datasets focusing on event type, objects, location, and impact. Experimental results show the optimized model improved image understanding accuracy from 0.497 to 0.789, with redundancy reduced by 98.6% (from 0.519 to 0.007). In video reasoning tasks, accuracy for “accidents ” and “abnormal parking” reached 0.59, a 64.8% improvement over static image scenarios. This study offers an efficient, viable approach for intelligent traffic systems.