
Driver behavior analysis plays a crucial role in improving road safety, insurance assessment, and fleet management. Smartphone-based telematics offers a scalable and cost-effective way to capture driving dynamics through sensors, GPS, and video. This study presents a transformer-based framework for recognizing aggressive and normal driving behaviors using the UAH-DriveSet dataset. Three configurations are investigated: a sensor-only transformer, a sensor and GPS model, and a multimodal transformer combining sensor and vision data through cross-attention fusion. Sensor signals are resampled to a uniform 10 Hz and segmented into 30-second rolling windows for temporal modeling. The multimodal model incorporates a Vision Transformer to extract visual context from selected video frames. Experimental results show that the multimodal transformer achieves the best overall performance, surpassing unimodal approaches in accuracy and robustness. The inclusion of visual information improves the model’s ability to distinguish complex driving scenarios. The proposed approach demonstrates the potential of transformer-based multimodal fusion for smartphone-driven telematics and driver monitoring applications.
| ID | pc641 |
| Presentation | |
| Full Text | |
| Tags |





