Predicting Denver Traffic Accidents using machine learning techniques in Python

Project Overview

The number of car accidents in Denver has been steadily increasing over the past few years. While almost everyone agrees that human costs in terms of lost lives and injuries are unacceptable and that economic consequences, for example, high car insurance premiums, are felt by many, there is no consensus on what factors are driving this trend. So, I came up with this exploratory project; its main goal is to analyze car accident data for the last 5 years in order to try to find variables that can be used as predictors for future car accidents.

In casual conversations and on social media people often mention at least four stereotypical reasons for growing number of accidents – explosive population growth due to out of state migration, alcohol consumption, recreational marijuana use and drugs in general, and everchanging Colorado weather. So, in this project, I would like to assess the validity of these claims.

The main source of accidents data for this project is the traffic accident dataset maintained by the City and County of Denver Police Department. It contains data about all accidents with reported damage more than 1ooo dollars or involving fatalities, injuries or evidence of drug or alcohol use for the past 5 years. It is a dynamic dataset, it is updated every workday, including refining information about previously reported incidents. When I downloaded it for my project, it contained 164116 observations of 20 variables. Additional sources of information included data about retail alcohol and recreational marihuana sales in the state of Colorado, monthly employment statistics (as a proxy for population data), weather observations and others.

Due to the scope of the project and the amount of data preprocessing, it was split into several jupyter notebook files with interim results saved as csv files, so each stage of the project can be run independently from others.

EDA (Exploratory Data Analysis)

Please see MSDSProjectGitHub/MSDS692_trafficAccidents.ipynb file.

Implementation

Data collection and preprocessing:

Please see:

MSDS692_AlcoholSales.ipynb
MSDS692_Holidays.ipynb
MSDS_LaborForce.ipynb
MSDS692_MJSales.ipynb
MSDS692_SunriseSunsetData.ipynb
MSDS692_WeatherData.ipynb

Merging data and preparing for ML:

MSDS692_IndividualAccidents.ipynb

ML models and discussion of the results:

Linear Regression: MSDS692_MonthlyAccidents.ipymb
KMeans Clustering: MSDS692_Clustering.ipynb
ANN: MSDS692_Forecasting.ipynb
Time Series Analysis: MSDS692_TimeSeries.ipynb

Conclusions

While this project has not resulted in building a robust predictive model, I think it still provided some important insights into the situation with car accidents in Denver:

Car accidents in Denver cannot be reliably predicted exclusively based on population growth, alcohol, and recreational marijuana use, or weather conditions. Traffic accidents in Denver are a complex phenomenon that cannot be reduced to stereotypical factors.
To increase the predictive power of the models, we need to include other factors (e.g., traffic patterns, fleet characteristics, driver behavior (e.g, estimated by the number of speeding tickets issued), enforcement practices etc.).
Need to reevaluate assumptions and include more granular and precise data (e.g., road surface conditions not just weather observations).
Not all algorithms are equally suitable for the task (e.g., K-means was only able to pick up day-of-the-week differences), need to use a different approach to clustering.
Unexpectedly, time series analysis provided the most precise prediction. I was expecting the neural network to be better suited, but it is most likely can be explained by the choice of independent variables.
Note to myself for future projects: building datasets from scratch takes time and might be not the best solution for a time-restricted school project.
In order to improve my python programming skills I used it exclusively throughout the project, otherwise, some particular tasks can be more efficiently accomplished using other DS tools (e.g., time series analysis seems to be easier in R).

References

Video presentation for this project: https://www.youtube.com/watch?v=Lj7lx8ZfaRE

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Code		Code
Data		Data
Maps		Maps
Plots		Plots
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting Denver Traffic Accidents using machine learning techniques in Python

Project Overview

EDA (Exploratory Data Analysis)

Implementation

Data collection and preprocessing:

Merging data and preparing for ML:

ML models and discussion of the results:

Conclusions

References

About

Uh oh!

Releases

Packages

Languages

License

nweakly/MSDSProjectGitHub

Folders and files

Latest commit

History

Repository files navigation

Predicting Denver Traffic Accidents using machine learning techniques in Python

Project Overview

EDA (Exploratory Data Analysis)

Implementation

Data collection and preprocessing:

Merging data and preparing for ML:

ML models and discussion of the results:

Conclusions

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages