Clustering San Francisco Crime Data

I grabbed crimes from January 1, 2014 - Crime Incidents - Current Year - as a json and wrote a python script to use K-Means clustering to clean up and analyze the points. The clustering sorts the points around a centroid, so that every point is closer to that centroid than to any other. Crudely, it delineates clusters of points grouped around "hotspots." The map displays the generated centroids along with Police Districts and Neighborhoods.

Most reports in the city present crimes grouped by some other category: neighborhoods (orange) or police districts (green). Reporting crimes by natural groups of incidents reveals slightly more nuanced patterns.

For example, two of the crime "hotspots" that fall within the Tenderloin police district teeter on the border of the Southern police district and SOMA neighborhood. While crime reporting isolates Southern and SOMA crime, those incidents contribute (over 50%) to a crime pattern originating from the Tenderloin district. Additionally, the fourth largest hotspot, in Bayview (10288), compared to the largest, Tenderloin (18509), falls within the second largest police district (Bayview), whereas Tenderloin police district is the smallest of all. 

This is the first part of a larger series I'll be working on to evaluate the distribution of crime and police resources, as well as perceptions of crime and safety, in San Francisco.

See the code.