Independently from each other, FiveThirtyEight and BusinessInsider have published interesting articles about the recently increased level of crime in USA. At FiveThirtyEight they came to the conclusion that murders were increasing at a fast pace, while another article mentioned specifically that the crime in Chicago was increasing. This seemed surprising to me, and I wondered whether there is a reason for that.
The Chicago open data portal has made available a dataset on crime, which allows us to take a closer look at the problem. It weighs 1.5GB, so trying to open the file in a text editor is not the best use of our time. Earlier we have seen the summary of the New York Times corpus and here we follow a similar approach. First, we notice column labels in the dataset and examine which of them are of interest. There are 20 columns of which we select the ones that matter to us most—in our case these are date, primary type, description, latitude and longitude.
By taking out only these columns and saving the file we obtain a 428.4MB file, but there is still some work to do. We could try to understand which unique primary types of crime are available within the dataset: criminal damage, narcotics, other offense, theft, battery, assault, burglary, deceptive practice and criminal tresspass. Knowing this, we select a category of interest, in our case "assault". Then we seek which unique descriptions exist for primary type of assault: simple and aggravated. Aggravated sounds more interesting to us, so we filter the dataset once again and store only date, latitude and longitude.
Unfortunately, this is not sufficient. As we know, most data is noisy and various values could be missing. In our case, some crimes were registered without geo coordinates, which now becomes obvious, so we choose to remove these values as we don't know has to make them useful. We are left with 1.1MB data, which describes 26426 cases of aggravated assaults.
Since we have the datetimes of the events, we can color-code them based on the year they happened. We can choose a blue shade for the events closer to the year 2001, when the data collection started and red shades for most recent events. This gives us the following picture.