The objective of this doctoral dissertation is to assess road crash risk by fusing infrastructure, traffic, and driving behaviour data. For this reason, two distinct databases were developed. The first one concerned motorway segments and included road crash, traffic, road geometry and driver behaviour data, while the second database concerned urban and interurban road segments of a broader area for which crash and traffic data were unavailable.
The results of the negative binomial regression model for the motorway segments showed a positive and statistically significant relationship between road crash frequency and events of harsh driving behaviour. Subsequently, taking into account the number of road crashes per segment length and traffic volume, four crash risk levels of the motorway segments were formulated using hierarchical clustering. These four crash risk levels were used as the response variable in five machine learning classifiers that included predictors related to road geometry and risky driving behaviours. Among the five classification models, Random Forest demonstrated superior classification performance across all crash risk levels. Based on the SHAP values, it was revealed that harsh braking events serve as a more suitable Surrogate Safety Measure than harsh accelerations in terms of crash risk level prediction.
For this reason, harsh brakings were used as the dependent variable in the analyses for urban and interurban segments of the broader road network. In addition to developing non-spatial models, the identification of spatial autocorrelation led to the development of spatial modelling techniques to account for spatial dependencies. It was found that the number of trips per segment, segment length and linearity, speeding and mobile phone use are positively correlated with harsh brakings. Conversely, motorways exhibited fewer harsh braking events compared to other road types. Furthermore, the number of trips per examined road segment was found to be the most influential predictor, highlighting its importance as a proxy measure of risk exposure. In terms of model performance, the Spatial Lag Model outperformed both the log-linear model and the Spatial Error Model. Better fit was also observed for the spatial Zero-Inflated Negative Binomial model, compared to the corresponding non-spatial model. Finally, the Spatial Random Forest model reduced the absolute values of spatial autocorrelation in the residuals and showed a better fit to the observed data compared to the conventional Random Forest model.