The U.S. intelligence community has long considered its mission as the gatherer of secrets. But sometimes insights are hiding in plain sight among tweets, blog posts, online videos, newspaper articles, academic journals and public records. Leveraging such open source information can greatly enhance our understanding of the world and critical events with national security implications.
The Cipher Brief’s Levi Maxey spoke with Jason Matheny, the director of Intelligence Advanced Research Projects Activity (IARPA), the U.S. intelligence community’s over-the-horizon research and development wing, about the value the intelligence community places on open source data.
Jason Matheny: There has been some empirical investigation on the relative value of classified and unclassified analysis. Measuring cost per unit citation in intelligence products, including things like the presidential daily briefing found that per citation, open source intelligence was less than one-tenth the cost. This was an unclassified result.
We have done some work here at IARPA looking at the relative accuracy of open source and classified intelligence, including as part of a program called “Ace” that we ran here, in which you had people who lacked clearances, who made judgments about the same questions that were also posed to people who had clearances. And on most events, the difference in accuracy wasn’t significant. But there was a subset of events for which the differences in accuracy were significant and where classified information seems to make a big difference.
Broader topics, things like how political elections would turn out, or whether there would be civil unrest, whether there would be disease outbreaks – those were the kinds of questions where unclassified information was often quite relevant. Same was also true for economic forecasts.
So these forecasts about events where information is already pretty widely distributed, in which there are multiple observers and multiple participants – things such as for political elections, economic events, epidemics and science and technology – all of those were cases are where unclassified information was highly accurate, at least when aggregated in ways that were empirically well founded.
Think of the other category of events, where you don’t have many participants, you don’t have lots of observable data [and] there aren’t unclassified or open source datasets that are highly relevant. That includes topics like weapons tests or leadership decisions in countries that have tightly controlled information. I think one of the best things to look at is the head-to-head comparisons of unclassified versus classified information, and how accurate judgments are based on either sets of information.
Now, regarding attitudes towards classified information versus unclassified information, there has also been some research on this, particularly a paper that coins a term called “the secrecy heuristic.” Basically the idea of this is that if you take a document that is in fact unclassified and you create two versions of this document and one is correctly marked as being unclassified and another mislabeled as classified, and you randomly assign these two documents to groups of intelligence analysts or policymakers.
The recipients of the classified document are likely to assign higher credence or confidence in the document marked “classified” even though the content of the documents is identical. There could be a bias towards classified information even when it may not be more accurate or more reliable.
While intelligence analysts have long trawled local newspapers and radio waves from around the globe, open source intelligence as a discipline has grown exponentially with the advent of the internet. Some point to the 2009 Green Revolution in Iran as a pivotal moment in OSINT (open source intelligence) history with content on blogs and Twitter providing insight to the happenings of a secluded country. But Matheny believes open-source collection and analysis has not yet seen its prime.
Matheny: I’m reminded of a Gandhi quote where he is asked “What do you think of Western civilization?” and he says, “I think it would be a good idea,” to suggest that it hasn’t happened yet. I think that is where we are with open source intelligence. I don’t think it has had its heyday. We don’t invest very much in open source intelligence compared to classified sources of intelligence as an intelligence community.
Now, that said, it’s unfair to open source intelligence to think about it as a fraction of the intelligence community budget or the allocation of personnel, since so much open source intelligence is performed out in the world. It includes virtually all journalism, social media that is crowd-collected, and in aggregate, all of the people and resources that are allocated to open source intelligence probably dwarfs what the intelligence community actually spends.
So, maybe one question is – for what kinds of problems do we need the intelligence community, that is the 17 national intelligence organizations, to focus on open source? Or is what the intelligence community brings that has a comparative advantage looking at classified sources of intelligence that rest of society doesn’t have the resources to look at?
I think we probably have more opportunities to leverage open source in the intelligence community. For instance, by making more use of systems that can automatically utilize things like unclassified imagery, and texts and social media, or web search queries or financial market data.
We had a program at IARPA called “Open Source Indicators” that looked at over 30,000 different open source data streams for their intelligence value in predicting elections, disease outbreaks and societal instability. There were data streams that were surprisingly valuable in helping to predict certain kinds of events.
For example, one of the earliest indicators for disease outbreaks was people canceling dinner reservations, because if they are sick, they are going to cancel their dinner date. And you can actually detect this by looking at aggregate data from apps like Open Table.
There were other indicators too, like people having large numbers of web search queries for symptoms related to a disease or describing their symptoms, or the device that they are posting from on social media not going to work and back during a weekday, but instead staying home, to suggest that somebody is home ill. There are a number of ways of using open source unclassified data in new ways to reveal important events in societies around the world, and I think there’s lots of opportunities to be inventive in the way that we extract meaning from open source data.
There is a whole research question about what sorts of open source data are most useful in answering different intelligence questions, and we have very little empirical data about that research question. We run forecasting tournaments at IARPA in part to answer that question. We do it not just for political events or for societal events, but also science and technology events.
So, what kind of indicators, say publications and patents, let us know something about what’s actually happening within the scientific community? Or what kinds of open source data help us to predict cyber attacks by, for instance, looking at chatter on hacker forums, or even looking at the black market prices for malware, which like other economic goods, increase in price when demand goes up while supply stays the same? Or looking at changes in web search queries for certain IP addresses that suggest somebody is mapping a network. There are a variety of open source data streams whose value we haven’t measured, but we can measure. We can look at the way in which key datasets can be used to help us understand global events sooner and more accurately.
But for many intelligence services, Matheny said analysts experience a bandwidth issue by being overwhelmed by the deluge of open source information and this is where technologies such as machine learning can become useful.
Matheny: The volume of open source data is orders of many magnitudes greater, so the problems of data overload are even more severe. There simply aren’t enough human eyeballs to look at all of the open source data that is relevant to questions about events.
There needs to be tools that describe what is happening in videos, for example. There is a lot of open source video that reflects activities of terrorist groups. Some terrorist groups post martyrdom or how-to videos on how to make IEDs and other weapons. Being able to detect that those videos are posted isn’t possible through keyword searching because they don’t tag their own videos and say ‘hi this is a martyrdom video’ or “this is an IED how-to video.” So you have to develop tools then that automatically identify the content of the video in a way that would take too long for a human analyst to pour over literally millions of hours of video each day. Automated methods are essential for the large volumes of text, imagery, and video. They can help make sense of it and identify key data that then requires further human analysis. Without that level of automation to do an initial screen it would simply be an impossible task.
Open source data can prompt further investigation using classified methods of intelligence collection, Matheny added. LinkedIn, for example, can be an excellent source of data in determining who to target for human or signals intelligence collection.
Matheny: There is a tipping and queuing function that open source intelligence can play because it is much cheaper to use open source than to use a classified form of collection. It makes sense to try to get first approximation by looking at what is available in open source in the same way that when I am trying to learn about a subject my first stop is Wikipedia. Then if I need to go through a more costly form of information then I might read a journal article or check out a book from the library or buy it from a bookstore.
There are some questions for which open source information might actually be more accurate, I would say, than classified forms of intelligence. In cases of predicting societal unrest, or in detecting disease outbreaks, unclassified open source information may in fact be more accurate and timelier.
So it doesn’t only play the role of tipping and queuing – in some cases it might actually be the princpal form of intelligence to rely on.
Science and technology would be another example. Most of the information that is relevant about science and technology trends comes from publications and patents. If it is a technology with broad commercial value – say, machine learning or advances in semi-conductors – you are going to see lots of published work and patents. The raw number of those does not correlate with impact, but if you look at citations, that is highly predictive. It would be hard to find classified sources of information that tell us as much about computing or material science from open source as well.