Big Brother has a long history in the movies. One of the better known additions to the genre is “Enemy of the State”, a 1998 thriller directed by Tony Scott. I can’t remember how many times I’ve seen this movie, but it recently resurfaced on TV and the movie continues to entertain. However, fifteen years have taken a toll and the once-futuristic surveillance technology looks commonplace today — when the National Security Agency’s spy satellite zooms down to photograph a building, you immediately think of Google’s Street View.
Another area where the movie falls down is in its depiction of the NSA’s monitoring capabilities. Underground computers at Fort Meade monitor suspect phone calls for trigger words like “bomb,” “president” and “Allah.” But real-world terrorists are not that stupid and intelligence agencies know that simple keyword searches miss implicitly conveyed information.
So, one of the biggest challenges for intelligence services is seeing through language to discover the “hidden” meaning in texts and, particularly, informal communications.
Take for example this snippet of a telephone conversation:
A: Where were you? We waited all day for you and you never came.
B: I couldn’t make it through, there was no way. They…they were everywhere.
A: You should have found a way. You know we need the stuff for the…the party tomorrow. We need a new place to meet…tonight. How about the…uh…uh…the house? You know, the one where we met last time.
Clearly, if the two people are suspect terrorists, then this conversation is loaded with hidden meaning. But if they are not, there is no context and so the meaning is ambiguous and vague.
The above example comes from a fascinating presentation of a new research initiative called Deep Exploration and Filtering of Text (DEFT), which is funded by DARPA, the US agency charged with “maintaining the technological superiority” of the US military.
Over 100 defence contractors went to Washington last month to hear about DEFT, which aims to use automated natural language analysis to unlock the power of inference.
DARPA officials acknowledge that current manual methods used for analyzing texts are just too time-consuming in today’s information age.
Since “Enemy of the State” was filmed, there has been an exponential growth in the amount of information — today there are at least 50bn web pages. So, as well as monitoring phone conversations, modern surveillance has grown to include monitoring websites, blog posts, email messages, Facebook updates, Twitter tweets and so on.
Using sophisticated artificial intelligence techniques DEFT aims to enable defence analysts to efficiently investigate and discover meaning in massive volumes of content so that they can discover implicitly expressed, actionable information contained within them.
By building on the NLP technologies developed in other DARPA programmes and ongoing academic research into deep language understanding and artificial intelligence, DEFT aims to address “capability gaps” related to inference, causal relationships and anomaly detection. Put simply, DEFT plans to use natural language processing to find the needles in a very big haystack, automatically summarizing texts so that human analysts can quickly grasp the hidden meaning.
So, In the case of the dialogue above, DEFT would mark up the transcription to show the people, their associations, their activities, the causal links, the geospatial-temporal links and entity-event linking — in this case, there is going to be a re-schedule meeting at “the house” where meetings have taken place before.
The algorithms developed for DEFT will initially be used to mark up narrative English text, conversational English speech (automatically transcribed) and conversational foreign language text (automatically translated).
The bar gets raised in successive phases as the algorithms must then be able to understand conversational foreign language text without prior translation and, even more difficult, conversational foreign language speech that has been automatically transcribed and translated.
Perhaps the most revealing nugget in the DEFT proposal concerns the languages for the data sets against which the algorithms will be tested. In addition to English, contractors must include one or more of the following languages: dialectal Arabic, Cebuano, Chinese, Pashto, Dari, Farsi, Ilocano, Spanish, Tagalog, or Urdu.
Would-be spies can find much more detail on DEFT here… Just joking, NSA.