Fumi's blog: Project311- About the Data

When the Great East Japan Earthquake with Magnitude 9.0 struck Japan on 3/11 2011, an enormous amount of information was dispersed through social networking sites and the mass media. While much of it was accurate, there were also many falsehoods and rumors, emphasizing just how important information can be. How was misleading information conveyed, exactly? And what were the causes behind these mixups?

It was impossible to do such data verification during and right after the earthquake, with all the chaotic situation. But almost 2 years have passed, and it is time to review the data from the time of the disaster and digest the learning from it, and figure out what we can do to prepare for the next disaster. If we don't do it now, we will start forgetting about what happened.

That is why we - Google and Twitter - hosted "The Great East Japan Earthquake Big Data Workshop: Project 311" back in 9/12-10/28 2012. We provided participants data that was produced during the week following the earthquake. They reexamined the data, discussed what can be done to prepare for future disasters, and apply it to service brainstorming.

====Data that was provided for the workshop====

Asahi Shimbun newspaper articles from the week after March 11 (source: The Asahi Shimbun Company)
Google Insights for Search (source: Google Japan Inc.)
Text summary of television broadcasts made just after the Great East Japan Earthquake (source: JCC Corp.)
Tweets from the week after March 11 (source: Twitter Japan K.K.)
Transcripts of the audio broadcasted by NHK-G in the 24 hours following the disaster and a ranking of frequently used words (source: Japan Broadcasting Corporation, NHK)
Honda Internavi traveled roads data (source: Honda Motor Company, Ltd.)
Rescuenow's railroad operation information and various disaster-related information (source: Rescuenow. Inc.)
Traffic congestion statistics (source: ZENRIN DataCom Co., Ltd.)
Short link data of bit.ly (source: Bitly Inc)
Citizens' report on damages from earthquake and tsunami, lifeline information (source: Weather News)
Information on earthquake and tsunami prediction, data from AMEDAS (source: Japan Weather Association)

A bit more info about some of those data:

Honda's Internavi roads data is based on the data they had from car navigations, which was extremely useful when aggregated as a drivable map, since many of the roads were damaged by tsunami, and was not drivable. You can still see it at Google's Crisis Response page.

Some videos to understand more about Honda's data.

Zenrin DataCom's traffic congestion statistics data comes from mobile phones. This data is extremely useful when aggregated to understand where people were and their moves. In the following video you can see data from Tokyo on 3/11/2011. People are moving quickly commuting in the morning, the city of Tokyo getting really congested, and the motions suddenly slows down on and after 14:45 when the earthquake happened, since many of the public transportation stopped. They slowly move out of Tokyo, and the speed accelerates towards the night as the public transportation recovers.

====Results====

-600+ teams registered and signed TOS. They were mixture of researchers, developers, journalists, professors and students, NPOs, etc
-Massive collaboration among people who'd never met, from different universities, different companies, etc
-50 teams presented their research results at the report event (more teams wanted to present, but event time was limited)
-Report event summary of all the abstracts, slides, and video
Part1: Information in the devastated area [ja] (8 presentations)
Part2: Crisis information on Twitter [ja] (5 presentations)
Part3: Understanding what happened during the crisis [ja] (12 presentations)
Part4: Chaos in areas around Tokyo [ja] (6 presentations)
Part5: The role of mass media during crisis [ja] (6 presentations)
Part6: What should the citizens tell others, to whom and how [ja] (6 presentations)
Part7: Information circulation via Twitter [ja] (7 presentations)
Poster sessions [ja] (15 presentations)
Comments from the commentators and data providers [ja]

Further postings planned for individual project report:
-Comments from Mr. Suzuki from Kesennuma
-Comments from Professor Murai
-Project Hayano, analysis of Iodine emitted from Fukushima nuclear power plant
-Visualization of the Evacuation Process During the Tsunami

====What worked and what didn't====

-Open Data
Many of the data provided for this workshop was not available till the workshop. It was a good opportunity not only for the participants, but also for the data providers as well, to reconsider their policy, and prepare what they can do with their data in the future. They learned the value of opening their data and having the citizens use that data for more valuable findings. Some companies are already getting ready for their actions. True, we would've loved to have more data providers- but I did hear comment from a participant that "companies that can't provide data for workshops won't be able to provide data during crisis anyways."

-1.5 months analysis period
We announced the event on 9/12, had office hour on 9/19 where all the participants can come to Google Japan office, and all the data providers will answer all of their questions. We held mid-term report event on 10/13 and final report event on 10/28. There were many discussions about this schedule. Some say it was too short for deep analysis. Some say it was good that it was short, they were able to concentrate and come up with results in short period of time. During the next crisis, we won't be able to analyze for 1.5 months, we need to move quickly to provide valuable analysis, so I think we did the right thing. I may be wrong.

-5 minute presentation time
The format was 5 minute presentation and 2 minute QA/comments, with the exception of top 5 teams that were voted to speak 2 minutes extra. At the mid-term report event, I was unable to stop some people going over time- the presentations were extremely interesting and was hard to stop them, but I did get some complaints in terms of fairness. At the final report event, we used a gong and stopped everyone however interesting they are. The event went smoothly on schedule. Although 5 minutes were very short, the presentations became crispy and did not get bored even after listening to 50 presentations in one day so I think it was good.

-Final slide rule
After the mid-term report event, one of the participants- Professor Ryugo Hayano told me we should fix the format. People were just reporting what they learned from the analysis, which is not enough. Therefore, for the final report event, we made a rule for the presenters to finish the last slide with visions on how their analysis will benefit the victims from Tohoku disaster, or future disasters. Some actionable proposals came out from their last slides.

-More collaboration
Communication tools among the participants were the offline events, mailing list, Twitter, Google+ etc and some people leveraged from that, some didn't. An example of a good collaboration was "geolocation info tweet list project". Twitter data was just a list of tweets with geo-id and tweet-id, and each participant had to use Twitter API to get the location information. So one of the projects that came up was making a JSON format dataset of tweets with geolocation data. Some people announced to start working on it so that it will be valuable for others to use, and many who thought it is valuable started helping out. They have never met each other, but started collaborating online. "Twitter data cleaning project" was another one- data that was provided by Twitter had line breaks and was difficult to use. Therefore, some of the participants started a project to clean the data and provide a script in a matter of hours or couple of days- some wrote a Python script some with Ruby, some wrote a script to extract posts with specific words, some made a tool to put into charts. This thread started with "Houston, I have a problem with Twitter data. I think this is applicable to all of you" and others just jumped in to help and was able to avoid duplicate works in the very early stage and fixing the issue - so I think it worked more on the tools side. The problem was analysis side- since many teams were working on rumor analysis from Twitter data, it was hard for them to throw away their own analysis and join other groups which were doing similar analysis. So we had some duplicate efforts there, which could have been solved by more collaboration.

====Comments====

I really think it is important that we remember and reflect and learn from the past experiences so that we don't make the same mistake again. Also, it is important to share what we learned. That is exactly why I'm writing this post.

It is easy for the governments and data owners to resist opening the data- they have a bunch of reasons not to. Maybe citizens will not use the data even if they make it open (after a lot of efforts). Maybe some bad things happen. This project is antithesis to that. People WILL come and use the data and come up with valuable analysis if we open the data. I am glad we were able to prove that.

I think it is easy for the analysts, researchers and developers to just complain that the data is not open, it doesn't exists, the format is bad, etc. This project is antithesis to that as well. The dataset was not complete- it had some restrictions. We didn't have some data that we would've wanted to have. But there will never be a single day that EVERYTHING is open. It shouldn't be- some data should be open, some should be closed, like personal data. Complaining is easy- but I believe if we can show the value of opening the data and making use of it, more governments, more companies, more entities will start opening their data. I think Project311 was evidence of such move.

Also, the data was too big for some people to handle- their tools didn't work. But there will never be a single day that a perfect tool will be ready for you- we'll just need to collaborate and work around and tackle the data.

Oh and by the way, it doesn't even have to be "big data". It should be "valuable data", but rubbish data for some people may be valuable to others, so don't discount that. "Open data in a standardized format". No PDF, please.

Lastly, open data is not just about crisis. The Japanese government, companies and the society needs to get used to opening various data all the time. Also, "opening data" does not merely mean making data available- it includes the responsibility of the citizens, researchers, developers etc to make use of those data, so information literacy and data literacy is a big issue here.

====Photos from the workshop====

Professor Jun Murai from Keio University (left), the father of Internet in Japan, and Mr. Hidemitsu Suzuki (middle), who was in charge of coping with the disaster in the city of Kesennuma Crisis Management division stayed the full 9 hours of the presentations and commented on each and every 50 presentations that took place. In fact, they were still discussing about the presentations during the breaks. On the right is Professor Fumihiko Imamura from Tsunami Engineering Laboratory Disaster Control Research Center or Tohoku University.

University of Tokyo kindly offered to provide us their classroom for the presentation.

We livestreamed the presentations on Hangout On Air so that those who could not come to the venue can watch. We also had NPO in Sendai working on crisis response and local city government official from Ishinomaki city join the hangout to provide comments to the presentations. I think it is important to involve people on the grounds on those researches, so that we are not doing those activities for self satisfaction- but to work on something that is actually useful for those in need, via direct discussion.

I have seen too many projects after the earthquake that are self satisfactory for engineers in Tokyo that is not useful on the grounds. It hurts to hear those honest opinion, but we need to listen to those voices and refocus on what is actually useful, than using valuable resources on projects that are not going to be used.