This project involves the analysis and visualization of open-source police data from two areas, Leicestershire Street and Northumbria Street, for the month of March 2021. The analysis utilizes Apache Spark SQL for data cleansing, configuration, and pre-processing. Insights are visualized using various graphs and charts to depict crime patterns and their impacts on public safety.
- Apache Spark SQL: Used for data processing and querying.
- Python (PySpark, matplotlib, pandas): For data manipulation and visualization.
- Jupyter Notebook: The environment for running and documenting the analysis.
- Leicestershire Street Data: Contains crime records for March 2021.
- Northumbria Street Data: Contains crime records for March 2021.
Both datasets are sourced from data.police.uk.
- Environment Setup: Installation and configuration of Jupyter Notebook and necessary Python libraries.
- Data Cleaning and Transformation: Removing or rectifying incorrect, inaccurate, or missing data, and transforming data into suitable formats.
- Exploratory Data Analysis: Using SQL queries and Python functions to gain insights into crime patterns.
- Visualization: Creating bar charts, pie charts, and maps to pictorially represent the data.
- Crime Types: Leicestershire sees more "Violence and sexual offences", Northumbria more "Anti-social behaviour".
- Geographic Influence: Crime rates and types vary significantly by location.
- Investigation Outcomes: Many cases in Leicestershire are unresolved; Northumbria often has no suspect identified.
- Population Density: Northumbria has higher "Anti-social behaviour" rates despite lower population density.
- Data Gaps: Missing data affects the completeness of the analysis.