WaPo Monkey Cage- "Here’s what data science tells us about Hillary Clinton’s emails"

Wednesday, November 2, 2016 - 5:00pm

From the Washington Post "Monkey Cage" blog, declassified State Department records from the 1970s, specifically, "almost a million diplomatic cables from 1973 to 1978 with full text and many kinds of metadata," using machine-learning algorithms could provide insight into whether "Clinton's team" was "negligent when they sent emails on an insecure system that other officials later deemed to be confidential, secret or top secret?"

We had two goals: First, find out whether, and to what extent, being classified as “secret” or “confidential” has historically been random or predictable. Second, learn what is normal and what might be considered negligent in how officials manage large numbers of potentially sensitive communications.

Here’s how we did it. Through machine-learning, a type of artificial intelligence, we create algorithms to measure and compare features in a data set that is already classified. In this case, the data consists of State Department communications, and the classes are secret, confidential and unclassified. High-performance computers systematically sort out what tends to differentiate these communications, whether that’s by subject matter, senders and receivers, or words in the message.

Connelly and Shah conclude:

Clinton can’t be considered negligent until we know how her record compares with the error rate for the rest of the State Department. Clearly, officials make errors in identifying sensitive information. But even though the government spends more than $16 billion a year guarding official secrets — almost 40 times more than it allocates to answer (or not answer) Freedom of Information Act requests — it has never studied to what extent officials agree on what they should keep secret, and how reliably they protect these secrets. Without that kind of research, we simply cannot know whether Clinton was better or worse than average in recognizing sensitive information and protecting it on a secure system.

The paper to which the article refers is Renato Rocha Souza, Flavio Codeco Coelho, Rohan Shah, Matthew Connelly, "Using Artificial Intelligence to Identify State Secrets," Computers and Society, https://arxiv.org/abs/1611.00356:

Whether officials can be trusted to protect national security information has become a matter of great public controversy, reigniting a long-standing debate about the scope and nature of official secrecy. The declassification of millions of electronic records has made it possible to analyze these issues with greater rigor and precision. Using machine-learning methods, we examined nearly a million State Department cables from the 1970s to identify features of records that are more likely to be classified, such as international negotiations, military operations, and high-level communications. Even with incomplete data, algorithms can use such features to identify 90% of classified cables with <11% false positives. But our results also show that there are longstanding problems in the identification of sensitive information. Error analysis reveals many examples of both overclassification and underclassification. This indicates both the need for research on inter-coder reliability among officials as to what constitutes classified material and the opportunity to develop recommender systems to better manage both classification and declassification.

The Secretary Clinton e-mails have been released by the U.S. Department of State and can be viewed on their Virtual Reading Room as announced on their FOIA site.  The History Lab site has sorted and made searchable the collection of released e-mails on their Clinton E-mail Collection page.

Geographical Area: 
United States
Era: 
Barack Obama administration: 2009-present