Better Neighborhoods with R: Exploring and Analyzing SeeClickFix Data (part 1)

NH_CHIC_BURL The ‎ National Day of Civic Hacking took place on June 1-2nd at various locations across the United States. The event brought citizens, government, and companies together across the nation to:

“collaboratively create, build, and invent using publicly-released data, code and technology to solve challenges relevant to our neighborhoods, our cities, our states and our country.”

I attended the Connecticut Civic Hack Day in New Haven Connecticut. The event was graciously hosted by the company SeeClickFix at their dynamite headquarters in New Haven, Connecticut. It was an informal gathering of developers, designers, activists, and government officials with a common mission of building solutions and solving issues that impact our communities and state. The Civic Hack Day was the first hackathon event I attended. It certainly won’t be my last. The experience was thoroughly enjoyable and enriching. I met some incredibly talented people and learned about a local company whose mission is to improve the world by involving and connecting concerned citizens locally. I was able to strengthen and practice a some data analysis and R development skills with the open, real-time, geolocated SeeClickFix data that is openly available.

Overview:

SeeClickFix enables citizens to report incidents, concerns, and questions pertaining to their community to government staff and fellow citzens/neighbors. Potholes, graffiti, illegally dumped debris, and other concerns seem to be typical topics seen in the datasets. Mobile phones become powerful data acquisition and transmission devices when tied with the SeeClickFix services. I found the variety of issues, the geospatial aspects of the data, the broad data mining, machine learning, and analytics potential enticing so I decided to start the beginnings of an R API client for the SeeClickFix during the hackathon.

Initial Goals:

The goals I set for myself for this hackathon were to:

  • Develop a simple R client to analyze, model, and classify seeclickfix and open311 data.
  • Perform some exploratory data analysis to demonstrate proof-of-concept and ease of extensibility.
  • Share what I’ve learned and developed through a short series of blog posts. This first post will focus on the very basics of using R to retrieve, process and plot SeeClickFix data. We’ll perform some simple static data visualizations to get some familiarity with the data. Future posts will more fully develop an API client, interactive visualizations/dashboards, and try to build some predictive analytics dashboards.

The Data:

Seeclickfix provides developers with a well documented API to interface with and extend it’s services. As ou will see in teh docs, authentication is needed for some operations. To keep things simple, I took the advice of SeeClickFix COO Kam Lasater and focused on retrieving data which does not authentication.

As stated on their Open Data page:

“SeeClickFix data sources, including XML, RSS, KML and JSON, are licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. Attribution should be to seeclickfix.com. Persons or organizations wishing to reuse large portions (ie., more than occasional queries ) of data are required to contact us first at team@seeclickfix.com. We are very interested in working with researchers and/or agencies to ensure that this data is put to good use!”

For this article I downloaded a relatively small amount of data. If you intend to download and reuse large amounts of data then please contact the SeeClickFix team. I decided to explore the “Issues” data for my first foray into the SeeClickFix world. I encourage you to read the API listing issues details here. The options available to developers are extensive and impressive. As described in this post, I focussed on a tiny portion of the overall functionality to get my feet wet. There is much more to explore with the rich API provided.

You can retrieve the data in both the XML and JSON formats. I chose the JSON format for initially to get some practice using the href=”http://cran.r-project.org/web/packages/rjson/index.html” target=”_blank”>RJSON package during the hackathon. If you prefer XML, just change the issues.json to issues.xml in the URL string. (You’ll have to use the XML R package or similar in your modified code as well.)

The key to retreiving the data of interest ( time frame, geo location, number of pages, map zoom, sort order, etc.) is constructing the correct URL. For this example, I defined a typical URL as:

http://seeclicktest.com/api/issues.json?at=New+Haven,+CT&start=50&end=0&page=1&num_results=100&sort=issues.created_at

If you enter the url in the browser, you’ll get a collection of JSON objects similar to:

[{"id":558710,"issue_id":558710,"page":1,"summary":"Furniture and trash dumped at several sites along the street","status":"Open","address":"Mechanic St New Haven, Connecticut","rating":1,"request_type_id":1250,"vote_count":0,"description":"There is trash dumped all over this neighborhood, including Nash, Mechanic, and Lawrence St.","slug":"558710-furniture-and-trash-dumped-at-several-sites-along-the-street","lat":41.3167309,"lng":-72.9086932,"bitly":"http://bit.ly/15Kaoy5","minutes_since_created":40,"updated_at":"06/05/2013 at 02:58PM","updated_at_raw":"2013-06-05T14:58:13-04:00","created_at":"06/05/2013 at 02:58PM","user_id":113576},

{"id":558512,"issue_id":558512,"page":1,"summary":"Audubon Court Garage Entry Ticket Dispenser","status":"Open","address":"Orange And Audubon Streets New Haven, Connecticut","rating":2,"request_type_id":374,"vote_count":1,"description":"Orange Street garage entrance ticket dispenser shows a time which is five minutes fast. (Not sure about Audobon Street entrance.) I mentioned this discrepancy to the ticket taker two weeks ago. He indicated he would check it out. Dispenser entry time is still fast. ","slug":"558512-audubon-court-garage-entry-ticket-dispenser","lat":41.3108666,"lng":-72.920186,"bitly":"http://bit.ly/15JTzn5","minutes_since_created":111,"updated_at":"06/05/2013 at 02:39PM","updated_at_raw":"2013-06-05T14:39:25-04:00","created_at":"06/05/2013 at 01:47PM","user_id":""}...]

As described in the API docs, the above URL returns up to the first 100 issues on the first page of issues for New Haven CT (actually only 19 issues per page are presently returned by the API). The issues are sorted by in decending order by the “created.at” timestamp. The start=50 element in the URL sets number of hours ago the oldest issue can be to 50 hours. Note that you can also set request partial hours such as 10.50 ( 10 hours and 30 minutes).

Some R Code and Exploratory Results:

First, let’s download some New Haven Data:

library("rjson")
library("plyr")
library(ggplot2)
library(maps) # for future mapping
library(lubridate) # for working withe dates and times

#remove all variables and values the environment
rm(list=ls(all=TRUE)) 

#load us map data
all_states <- map_data("state")
#plot all states with ggplot
max_pages=1000 # be respectful, limit the number of pages being polled.
df=data.frame() #stores the consolidated list of issues from SeeClickFix
for (i in 1:max_pages) 
{
  #construct the URL to retrieve a page of data
 url = paste0("http://seeclicktest.com/api/issues.json?at=New+Haven,+CT&start=50000&end=0&page=", toString(i), "&num_results=100&sort=issues.created_at")
  # print(url)
  seeclick_data <- fromJSON(paste(readLines(url), collapse=""))
   df1 = ldply (seeclick_data, data.frame, stringsAsFactors = FALSE )

  if ( length(df1)== 0 ) {  #if no more data is available, an empy record is returned. 
    breakFlag = TRUE
    break
  }
  df = rbind(df,df1)        # append the page of data to the overall results. 

}

Now that we have some SeeClickFix data in a dataframe, let’s generate an exploratory graph of Open, Acknowledged, and Closed issues in the dataframe:

# convert updated_raw date/time into date object
df$date_updated = ymd_hms(df$updated_at_raw)
df$days_since_created = df$minutes_since_created/60/24
#earliest update in data_frame
min(df$date_updated)
# most recent update in data_frame
max(df$date_updated)
#calculate a sequence of months that spans the min and max  dates in the dataframe 
months_spanned = seq(min(df$date_updated),max(df$date_updated), by = 'weeks')

#plot a facete view on status of days since issues was created verus the date the 
#issue was updated. 
qplot(data=df, y= date_updated, x = days_since_created, color = status, breaks = months_spanned) + facet_grid(.~status)

which creates the following graph (you can click on it for a larger view): NewHavenQuickView

Word Clouds … for the fun of it.

Next, I decided to visually compare the issue descriptions of three very different Cities: New Haven, CT, Chicago, and Burlington, VT. To accomplish this I used the R worldcloud package along with the excellent tutorial at the One R tip A Day blog (see this post). As shown, the code below also uses the R text mining package TM for creating the corpus, removing stop words, stripping out punctuation, and converting all words to lowercase.

Along with the New Haven data, I downloaded Chicago, IL and Burlington, VT sample datasets. The following R code creates a word cloud for a given dataframe “df”. I took the subset of descriptions which have a string length > 0 ( not empty):

# source / inspiration: http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html
library(tm)
library(wordcloud)
library(RColorBrewer)

descript_text <- df
descript_text <- descript_text$description[descript_text$description>0]

ds <-VectorSource(descript_text)
descript_text.corpus <- Corpus(ds)
descript_text.corpus <- tm_map(descript_text.corpus, removePunctuation)
descript_text.corpus <- tm_map(descript_text.corpus, tolower)
descript_text.corpus <- tm_map(descript_text.corpus, function(x) removeWords(x,         stopwords("english")))
tdm <- TermDocumentMatrix(descript_text.corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# pal <- brewer.pal(9, "BuGn")
# pal <- pal[-(1:2)]
pal2 <- brewer.pal(8,"Dark2")
# png("wordcloud.png", width=1280,height=800)
png("wordcloud.png", width=3280,height=1800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal2, vfont=c("sans serif","plain"))
dev.off()

The resulting graphs are:

New Haven,CT:

Sample of  New Haven SeeClickFix Issue Descriptions

]1 Sample of New Haven, CT SeeClickFix Descriptions

Chicago, IL:

Sample of  Chicago SeeClickFix Issue Descriptions

]2 Sample of Chicago, IL SeeClickFix Issue Descriptions

Burlington, VT:

Sample of  New Haven SeeClickFix Issue Descriptions

]3 Sample of Burlington, VT SeeClickFix Issue Descriptions

Next Steps:

I look forward to:

  • Creating some interactive visualizations combining R with the the D3.js javascript library.
    - Refactoring the Hackathon quality code to improve code quality.

  • Building a more comprehensive R Client for the SeeClickFix data

  • Building an R client for the open311 API.

Source Code

The R source files ( “rough but working” at the moment) are available at my public github repository:

 https://github.com/mspan/r-open311

I will update the repo with datasets, knitter files, and additional source files in conjunction with part 2 of this article.

Personal Impact

Thanks to the Civic Hack Day experience, I was inspired to issue a complaint in my small town regarding some annoying graffiti on a stop sign about 30 feet from my front door. The town road crew came by and changed out the sign a few days ago. It was a small but meaningful improvement in our neighborhood. The change made my kids’ walk to school every day just a little bit nicer. Thanks for inspiration SeeClickFix!