bio_img_loren

Loren on the Art of MATLAB

Turn ideas into MATLAB

Note

Loren on the Art of MATLAB has been archived and will not be updated.

Fighting Crime with Predictive Policing

Today’s guest blogger, Toshi, is back again.

I recently noticed that there is a Kaggle competition titled San Francisco Crime Classification which asks you to predict the category of crimes that occurred in San Franciso from 1/1/2003 to 5/13/2015 in theSFPD Crime Incident Reporting system. The goal of the competition is to predict the category of crime that occurred based on time and location.

This reminded me of the movie Minority Report in which a special unit of police arrests people before they commit crimes, but that’s SciFi. A more realistic approach is to deter crime by analyzing the past data to predict when and where crimes take place and deploy law enforcement resources to such hot spots. This approach is known as predictive policing.

Let’s take a look at the SFPD data to see what we can learn from it. This has nothing to do with the goal of the competition, but Kaggle is ” also encouraging you to explore the dataset visually “. So why not?

Contents

The SFPD Crime Incident Report Data

You need to first download the zipped data files from Kaggle website, unzip and place them into your current folder. Let’s load the data see what attributes are included.

T = readtable('train.csv','Format','%D%C%q%C%C%q%q%f%f');   % load data from csv
week = {'Sunday','Monday','Tuesday','Wednesday', ...        % define order
    'Thursday','Friday','Saturday'};
T.(4) = reordercats(T.(4), week);                           % reorder categories
T.(6) = categorical(T.(6));                                 % convert to categorical
T.Properties.VariableNames                                  % show variable names
ans = 
  Columns 1 through 5
    'Dates'    'Category'    'Descript'    'DayOfWeek'    'PdDistrict'
  Columns 6 through 9
    'Resolution'    'Address'    'X'    'Y'

Let’s also add a new column to assign dates to specific weekly internvals for time series analysis.

t = datetime('2003-1-5') + days(0:7:4515);                  % weekly date intervals
weeks = NaT(size(T.Dates));                                 % empty datetime array
for i = 1:length(t) - 1                                     % loop over weekly intervals
    weeks(T.Dates >= t(i) & T.Dates < t(i+1)) = t(i);       % dates to weekly intervals
end
T.Week = weeks;                                             % add weekly intervals

Now let’s see what is included in ‘Category’. There are 38 categories of crime and non-crime incidents, such as ARSON, ASSAULT, and BAD CHECKS, but which ones should we focus on?

T.Category = mergecats(T.Category,{'TRESPASS','TREA'});     % merge mislabeled categories
tab = tabulate(T.Category);                                 % tabulate categories
[Count, idx] = sort(cell2mat(tab(:,2)),'descend');          % sort by category total
figure                                                      % new figure
bar(Count)                                                  % plot bar chart
ax = gca;                                                   % get current axes handle
ax.XTick = 1:size(tab, 1);                                  % use categories as tick
ax.XTickLabel = tab(idx,1);                                 % rename tick labels
ax.XTickLabelRotation = -90;                                % rotate text vertically

Broken Windows Theory

This theory, embraced by New York City Police Commissioner (then NYPD Chief) William Bratton in the 1990s, says that leaving broken windows unrepaired leads to more vandalism and greater social disorder in the neighborhood, because it signals that no one cares there. Up to that pointhe NYPD focused on solving serious crimes. However, the theory suggested that cracking down on petty crimes might lead to reduction in more serious ones. While New York saw a drop in the crime rate under Bratton, this theory is not without critics because it also led to the controversial stop and frisk practice.

Perhaps this theory provides one starting point for our exploration. Does the SFPD data show a correlation between a petty crime like vandalism and other more serious crimes? Let’s start with bivariate histogram using histogram2 to plot the vandalism incidents by location. Check sfcrime_load_map.m to see how you retrieve a raster map from Web Map Service.

sfcrime_load_map                                            % load map from WMS
vandalism = T(T.Category == 'VANDALISM', [1,3:5,8:10]);     % subset T by category
nbins = 100;                                                % number of bins
xbinedges = linspace(lim.lon(1),lim.lon(2),nbins);          % x bin edges
ybinedges = linspace(lim.lat(1),lim.lat(2),nbins);          % y bin edges
map = flipud(A);                                            % flip image
figure                                                      % new figure
colormap cool                                               % set colormap
histogram2(vandalism.X, vandalism.Y, ...                    % plot 3D bivariate
    xbinedges, ybinedges, ...                               % histogram
    'FaceColor', 'flat', ...
    'FaceAlpha', 0.5, 'EdgeAlpha', 0.5)
hold on                                                     % don't overwrite
image(lim.lon,lim.lat,map)                                  % add the map
hold off                                                    % restore default
ax = gca;                                                   % get current axes handle
ax.CLim = [0 100];                                          % color axis scaling
title({'San Francisco Crime Map'; ...                       % add title
    'Vandalism: Jan 2003 - May 2015'})
zlabel('Count of Police Reports')                           % add axis label
text(-122.52,37.82, 300, 'Golden Gate Bridge')              % annotate landmark
text(-122.4,37.82, 200, 'Bay Bridge')                       % annotate landmark

You can see that those incidents are highly concentrated in several hot spots indicated by the tall magenta bars. There is one particularly tall bar that sticks high above the rest. Where is it? We can plot the top 50 locations of high vandalism incidents with the highest spot marked as “#1”. Check sfcrime_draw_locs.m to see the details.

figure                                                      % new figure
usamap(lim.lat, lim.lon);                                   % set map coordinates
geoshow(A, R)                                               % display map
sfcrime_draw_locs(vandalism,lim.lat,lim.lon,nbins,50,'m')   % draw top 50 locations
title({'Vandalism: Jan 2013 - May 2015';'Top 50 Locations'})% add title

Crime Heat Map

If we process other categories of crime with the same grid, we can create a matrix of crimes by location and we can plot it as a heat map with imagesc to make comparison easier.

Bright horizontal stripes indicate a location where different types of crimes are committed. There is one particularly bright stripe with ASSAULT, DRUG/NARCOTIC, LARCENY/THEFT, NON-CRIMINAL, OTHER OFFENSES, and WARRANTS. ROBBERY, SUSPICIOUS OCC, and VANDALISM also appear in a lighter shade.

If you look at LARCENY/THEFT, it doesn’t necessarily form rows of stripes everywhere it is bright. If you are going to steal, it may be worthwhile to go to places you find high value targets.

cats = categories(T.Category);                              % extract categories
rawCounts = zeros((nbins-1)^2,length(cats));                % set up an accumulator
for i = 1:length(cats)                                      % loop over categories
    data = T(T.Category == cats{i}, 8:9);                   % subset T by category
    [N,~,~] = histcounts2(data.X, data.Y, ...               % get bivariate histogram
        xbinedges, ybinedges);                              % bin counts
    rawCounts(:,i) = N(:);                                  % add to accumulator
end                                                         % as a vector

figure                                                      % new figure
imagesc(rawCounts)                                          % plot matrix as an image
ax = gca;                                                   % get current axes handle
ax.CLim = [0 200];                                          % color axis scaling
ax.XTick = 1:length(cats);                                  % use categories as tick
ax.XTickLabel = cats;                                       % rename tick labels
ax.XTickLabelRotation = -90;                                % rotate text vertically
ylabel('Locations')                                         % add axis label
title('SF Crime Heat Map by Location')                      % add title
colorbar                                                    % add colorbar

Principal Components Analysis

We can use Principal Component Analysis uisng pca to get a better sense of the relationship among different categories and visualize the result with biplot. We need to use weighted PCA to address the big scale differences among categories. We also need to hide some categories as the output will be too cluttered if we try to display all 38 of them.

w = 1 ./ var(rawCounts);                                    % inverse variable variances
[wcoeff, score, latent, tsquared, explained] = ...          % weighted PCA with w
    pca(rawCounts, 'VariableWeights', w);
coefforth = diag(sqrt(w)) * wcoeff;                         % turn wcoeff to orthonormal
labels = cats;                                              % categories as labels
labels([4,9,10,12,13,15,18,20,21,23,25,27,28,31,32]) = {''};% drop some labels to avoid clutter
figure                                                      % new figure
biplot(coefforth(:,1:2), 'Scores', score(:,1:2), ...        % 2D biplot with the first two comps
    'VarLabels', labels)
xlabel(sprintf('Component 1 (%.2f%%)', explained(1)))       % add variance explained to x axis label
ylabel(sprintf('Component 2 (%.2f%%)', explained(2)))       % add variance explained to y axis label
axis([-0.1 0.6 -0.3 0.4]);                                  % define axis limits
title({'Principal Components Analysis'; ...                 % add title
    sprintf('Variance Explained %.2f%%',sum(explained(1:2)))})
htext = findobj(gca,'String','VEHICLE THEFT');              % find text object
htext.HorizontalAlignment = 'right';                        % change text alignment
htext = findobj(gca,'String','ASSAULT');                    % find text object
htext.Position = [0.2235 0.0909 0];                         % move label position
htext = findobj(gca,'String','ROBBERY');                    % find text object
htext.Position = [0.2148 0.1268 0];                         % move label position
htext = findobj(gca,'String','ARSON');                      % find text object
htext.HorizontalAlignment = 'right';                        % change text alignment
htext = findobj(gca,'String','EXTORTION');                  % find text object
htext.HorizontalAlignment = 'right';                        % change text alignment

Assault, Robbery and Vehicle Theft

The upper half of the biplot seems to show less sophisticated crimes than the lower half, and VANDALISM is more closely related to ARSON, and SUSPICIOUS OCC, and it is also related to LARCENY/THEFT but LARCENY/THEFT itself is closer to BAD CHECKS, EXTORTION, FRAUD and SEX OFFENSES FORCIBLE.

You can also see ASSAULT, ROBBERY, and VEHICLE THEFT are also very closely related. Among those, ASSAULT has the largest count of incidents. Maybe that’s the crime we need to focus on. Let’s check the top 50 locations for those crimes. As you would expect, you see good overlap of those locations.

assault = T(T.Category == 'ASSAULT', [1,3:5,8:10]);         % subset T by category
vehicle = T(T.Category == 'VEHICLE THEFT', [1,3:5,8:10]);   % subset T by category
robbery = T(T.Category == 'ROBBERY', [1,3:5,8:10]);         % subset T by category

figure                                                      % new figure
usamap(lim.lat, lim.lon);                                   % set map coordinates
geoshow(A, R)                                               % display map
topN = 50;                                                  % get top 50
hold on                                                     % don't overwrite
sfcrime_draw_locs(assault,lim.lat,lim.lon,nbins,topN,'r')   % draw locations in red
sfcrime_draw_locs(vehicle,lim.lat,lim.lon,nbins,topN,'g')   % draw locations in green
sfcrime_draw_locs(robbery,lim.lat,lim.lon,nbins,topN,'b')   % draw locations in blue
hold off                                                    % restore default
title({'Assault, Robbery, and Vehicle Theft'; ...           % add title
    sprintf('Top %d Locations',topN)})

Grand Theft Auto

To see if we can use the data for prediction, we need to check the time dimension as well. We can use the weekly interval column to plot the weekly trends. This is strange. VEHICLE THEFT suddenly dropped in 2006! It’s time to dig deeper.

figure                                                      % new figure
[G, t] = findgroups(assault.Week);                          % group by weekly intervals
weekly = splitapply(@numel, assault.Week, G);               % get weekly count
plot(t, weekly)                                             % plot weekly count
hold on                                                     % don't overwrite
[G, t] = findgroups(vehicle.Week);                          % group by weekly intervals
weekly = splitapply(@numel, vehicle.Week, G);               % get weekly count
plot(t, weekly)                                             % plot weekly count
[G, t] = findgroups(robbery.Week);                          % group by weekly intervals
weekly = splitapply(@numel,robbery.Week, G);                % get weekly count
plot(t, weekly)                                             % plot weekly count
hold off                                                    % restore default
title('ASSAULT, ROBBERY, VEHICLE THEFT - Weekly')           % add title
ylabel('Count of Incidence Reports')                        % add axis label
legend('ASSAULT','VEHICLE THEFT', 'ROBBERY')                % add legend

Eureka!

Looking at descriptions, you notice that recovered vehicles are also reported as incidents. For some reason, it was the incidents of recovered vehicles that dropped off since 2006. Such a sudden change is usually caused by a change in reporting criteria, but it looks like half of the stolen cars were often recoveverd eventually in the good old days? Are they still recovered but not reported, or are they no longer recovered? This Boston Globe article mentions that “Of the 10 vehicles most frequently stolen, all were made before 2007,” and credit new antitheft devices for the decline, and say those still stolen are shipped overseas (not likely to be recovered).

Anyhow, this time series analysis shows that there is a lot more going on than just time and location in crime. We could deal with the change in car theft reporting by omitting the data prior to 2006, but we would have to redo the heat map and run PCA again. Perhaps the Broken Windows Theory is not that useful as the basis of our analysis.

isRecovered = strfind(vehicle.Descript,'RECOVERED');        % find 'RECOVERED' in descrption
isRecovered = ~cellfun(@(x) isempty(x), isRecovered);       % is recovered if not empty
[G, t] = findgroups(vehicle.Week(~isRecovered));            % group by weekly intervals
weekly = splitapply(@numel, vehicle.Week(~isRecovered), G); % get weekly count
plot(t, weekly)                                             % plot weekly count
hold on                                                     % don't overwrite
[G, t] = findgroups(vehicle.Week(isRecovered));             % group by weekly intervals
weekly = splitapply(@numel, vehicle.Week(isRecovered), G);  % get weekly count
plot(t, weekly)                                             % plot weekly count
hold off                                                    % restore default
title('VEHICLE THEFT - Weekly')                             % add title
ylabel('Count of Incidence Reports')                        % add axis label
legend('UNRECOVRED', 'RECOVERED')                           % add legend

Another Kind of Broken Windows

Loren shared a NY Times article San Francisco Torn as Some See ‘Street Behavior’ Worsen with me. It is about the rise of smash-and-grab thefts from vehicles from the perspective of a local resident at Lombard Street, famous for its zigzags. The article says victims are often tourists and out-of-town visitors. LARCENY/THEFT is clearly on the rise, and it indeed comes mainly from Auto-related thefts.

larceny = T(T.Category == 'LARCENY/THEFT', [1,3:5,8:10]);   % subset T by category
isAuto = strfind(larceny.Descript,'LOCKED');                % find 'LOCKED' in descrption
isAuto = ~cellfun(@(x) isempty(x), isAuto);                 % is auto if not empty
figure                                                      % new figure
subplot(1,2,1)                                              % subplot 1
[G, t] = findgroups(larceny.Week(isAuto));                  % group by weekly intervals
weekly = splitapply(@numel, larceny.Week(isAuto), G);       % get weekly count
plot(t, weekly)                                             % plot weekly count
title('LARCENY/THEFT, Auto')                                % add title
subplot(1,2,2)                                              % subplot 2
[G, t] = findgroups(larceny.Week(~isAuto));                 % group by weekly intervals
weekly = splitapply(@numel, larceny.Week(~isAuto), G);      % get weekly count
plot(t, weekly)                                             % plot weekly count
title('LARCENY/THEFT, Non-Auto')                            % add title
ylim([0 500])                                               % adjust y-axis scale
hold off                                                    % restore default

When you map the top 100 locations of car-related LARCENY/THEFT, Lombard Street doesn’t make the cut, but the distribution indeed looks different from VEHICLE THEFT. You see several locations near famous tourist attractions like Fisherman’s Wharf and SoMa, known for the concentration of tech companies. It appears they do go after tourists and business vistors who are not familiar with the lay of the land. Now we found factors that affect one type of crime!

figure                                                      % new figure
usamap(lim.lat, lim.lon);                                   % set map coordinates
geoshow(A, R)                                               % display map
topN = 100;                                                 % get top 100
hold on                                                     % don't overwrite
sfcrime_draw_locs(larceny(isAuto,:),lim.lat,lim.lon, ...    % draw locations in red
    nbins,topN,'r')
sfcrime_draw_locs(vehicle,lim.lat,lim.lon,nbins,topN,'b')   % draw locations in blue
plotm(37.802139, -122.41874, '+g')                          % add landmark
plotm(37.808119, -122.41790, '+g')                          % add landmark
plotm(37.7808297, -122.4301075, '+g')                       % add landmark
plotm(37.7842048, -122.3969652, '+g')                       % add landmark
plotm(37.7786559, -122.5106296, '+g')                       % add landmark
plotm(37.8038433, -122.4418518, '+g')                       % add landmark
hold off                                                    % restore default
title({'LARCENY/THEFT - Auto vs. VEHICLE THEFT'; ...        % add title
    sprintf('Top %d Locations',topN)})
textm(37.802139, -122.41574, 'Lombard Street', ...          % annotate landmark
    'Color', 'g')
textm(37.8141187, -122.43450, 'Fisherman''s Wharf', ...     % annotate landmark
    'Color', 'g')
textm(37.7808297, -122.4651075, 'Japan Town', ...           % annotate landmark
    'Color', 'g')
textm(37.7842048, -122.3949652, 'SoMa', ...                 % annotate landmark
    'Color', 'g')
textm(37.7786559, -122.5086296, 'Sutro Baths', ...          % annotate landmark
    'Color', 'g')
textm(37.8088629, -122.4628518, 'Marina District', ...      % annotate landmark
    'Color', 'g')
textm(37.715564, -122.475662, ...                           % add note
    {'Red:  LARCENY/THEFT - Auto','Blue: VEHICLE THEFT'}, 'BackgroundColor','w')

Summary

We looked at the Broken Windows Theory as a starting point of this exploration, but the SFPD data doesn’t provide easily detectable correlation among crimes that you would expect based on this theory, and time series analysis shows that there is a lot more going on than just time and location that affects crime. When we focused on specific types of crime that are in decline or on the rise, and cross referenced those with external data sources, we learned a lot more. This points to a potential for enriching this dataset with data from other sources like demographics to improve predictive capability, but it also creates a dilemma of civil rights violation if not done carefully. British Transit Police got creative and did an interesting experiment to place “watching eyes” posters to deter bike thefts. This is a good creative use of insights from doing this type of analysis.

By the way, I couldn’t help playing with the camera object in MATLAB Graphics. Here is a fun flyover animation of the San Francisco crime scene! Check out sfcrime_flyover.m for more details.

Hopefully you now see how you can fight crime with data. Download this post in standard MATLAB script (click “_Get the MATLAB code_” below) and use it as the starting point for your exploration or even join Kaggle competition. If you find anything interesting, please let us know!

Published with MATLAB® R2016a


  • print

댓글

댓글을 남기려면 링크 를 클릭하여 MathWorks 계정에 로그인하거나 계정을 새로 만드십시오.