Analyzing the Mueller Report with Python, R, and NLP APIs

The Special Counsel’s Investigation into Russian interference in the 2016 United States election, better known as the Mueller Report, has been one of the more fascinating stories of our time. Responsible for a number of indictments and convictions of political operatives related to the President and the Trump campaign, the Report ultimately did not conclude that the President committed obstruction of justice beyond a reasonable doubt. It did, however, indicate that it could not clear the President of those allegations either - leaving the truth-seeking portion of the US public in a tough spot.

All other considerations aside, the report itself is an impressive document - 400 plus pages of testimony, research, and analysis generated over a two year period, and touching nearly all members of the Trump campaign and many in the White House. While I haven’t yet made the commitment to reading all 400 pages, I did think it might be interesting to try to take the Report on from a data perspective and see what other takeaways it might offer.

Processing the Report

Because the Mueller report was released as a scanned PDF with manual redactions, interacting with the text at any level deeper than just reading is difficult. I knew that if I wanted to do any sort of text analytics on the document, I would have to find a way to extract the text in a format that I could work with more programmatically. After testing a lot of different platforms and options, I ultimately decided to use the following set of tools:

  • Google Cloud Vision API: to perform document text detection (aka OCR with more metadata) and transform each page into an output JSON file
  • Blingfire: to read through and tokenize the text by sentence and word
  • AWS Comprehend: to perform entity, key phrase, and sentiment analysis of the document with a pre-trained natural language processing API

There are a ton of things I left off the table here - others have already done much more extensive investigations/analyses of the Report and its findings. I mostly just wanted to text out some new NLP tools and familiarize myself with the Report at a high level, so I hope it’s interesting to see what I was able to put together!

Google Cloud Vision API

The first stage of my analysis was performed using the Google Cloud Vision document text detection API, which allowed me to read the entire Mueller report in as a PDF and batch output the OCR’ed text into structured JSON files.

After uploading the source file into a publicly available bucket, all I needed to do was create a Google Cloud service account, grant read/write permissions to Google Cloud Storage, and then save the account credentials. From there, I made a single POST request to the Vision API at https://vision.googleapis.com/v1/files:asyncBatchAnnotate to kick off the OCR process.

API Request Body

{
  "requests":[
    {
      "inputConfig": {
        "gcsSource": {
          "uri": "gs://mueller-report-conormclaughlin/mueller-report-searchable.pdf"
        },
        "mimeType": "application/pdf"
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ],
      "outputConfig": {
        "gcsDestination": {
          "uri": "gs://mueller-report-conormclaughlin/output/"
        },
        "batchSize": 2
      }
    }
  ]
}

Sample OCR Output

{
  "inputConfig": {
    "gcsSource": {
      "uri": "gs://mueller-report-conormclaughlin/mueller-report-searchable.pdf"
    },
    "mimeType": "application/pdf"
  },
  "responses": [
    {
      "fullTextAnnotation": {
        "pages": [ "MANY TEXT BOXES IN HERE"
        ],
        "text": "U.S. Department of Justice\nAttorney Work Produet // May Contain Material Proteeted Under Fed. R. Crim. P. 6fe)\nPapadopoulos was dismissed from the Trump Campaign in early October 2016, after an\ninterview he gave to the Russian news agency Interfax generated adverse publicity. 492\nf. Trump Campaign Knowledge of “Dirt”\nPapadopoulos admitted telling at least one individual outside of the Campaign-\nspecifically, the then-Greek foreign minister—about Russia's obtaining Clinton-related emails.493\nIn addition, a different foreign government informed the FBI that, 10 days after meeting with\nMifsud in late April 2016, Papadopoulos suggested that the Trump Campaign had received\nindications from the Russian government that it could assist the Campaign through the anonymous\nrelease of information that would be damaging to Hillary Clinton.494 (This conversation occurred\nafter the GRU spearphished Clinton Campaign chairman John Podesta and stole his emails, and\nthe GRU hacked into the DCCC and DNC, see Volume I, Sections III.A & III.B, supra.) Such\ndisclosures raised questions about whether Papadopoulos informed any Trump Campaign official\nabout the emails.\nWhen interviewed, Papadopoulos and the Campaign officials who interacted with him told\nthe Office that they could not recall Papadopoulos's sharing the information that Russia had\nobtained \"dirt” on candidate Clinton in the form of emails or that Russia could assist the Campaign\nthrough the anonymous release of information about Clinton. Papadopoulos stated that he could\nnot clearly recall having told anyone on the Campaign and wavered about whether he accurately\nremembered an incident in which Clovis had been upset after hearing Papadopoulos tell Clovis\nthat Papadopoulos thought \"they have her emails.9495 The Campaign officials who interacted or\ncorresponded with Papadopoulos have similarly stated, with varying degrees of certainty, that he\ndid not tell them. Senior policy advisor Stephen Miller, for example, did not remember hearing\nanything from Papadopoulos or Clovis about Russia having emails of or dirt on candidate\nClinton.496 Clovis stated that he did not recall anyone, including Papadopoulos, having given him\nnon-public information that a foreign government might be in possession of material damaging to\nHillary Clinton.497 Grand Jury\n98 Grand Jury\n492 George Papadopoulos: Sanctions Have Done Little More Than to Turn Russia Towards China,\nInterfax (Sept. 30, 2016).\n493 Papadopoulos 9/19/17 302, at 14-15; Def. Sent. Mem., United States v. George Papadopoulos,\n1:17-cr-182 (D.D.C. Aug. 31, 2018), Doc. 45.\n494 See footnote 465 of Volume I, Section IV.A.2.d, supra.\n495 Papadopoulos 8/10/17 302, at 5; Papadopoulos 8/11/17 302, at 5; Papadopoulos 9/20/17 302,\nat 2.\n496 S. Miller 12/14/17 302, at 10.\n497 Grand Jury.\n498 Grand Jury\n"
      },
      "context": {
        "uri": "gs://mueller-report-conormclaughlin/mueller-report-searchable.pdf",
        "pageNumber": 101
      }
    },
    {
      "fullTextAnnotation": {
        "pages": [ "MANY TEXT BOXES IN HERE"
        ],
        "text": "U.S. Department of Justice\nAtomey Work Produet // May Contain Material-Proteeted Under Fed. R. Erin. P. 6te)\nGrand Jury\n499 No documentary evidence, and nothing in the email accounts or other\ncommunications facilities reviewed by the Office, shows that Papadopoulos shared this\ninformation with the Campaign.\ng. Additional George Papadopoulos Contact\nThe Office investigated another Russia-related contact with Papadopoulos. The Office was\nnot fully able to explore the contact because the individual at issue—Sergei Millian—remained\nout of the country since the inception of our investigation and declined to meet with members of\nthe Office despite our repeated efforts to obtain an interview.\nPapadopoulos first connected with Millian via LinkedIn on July 15, 2016, shortly after\nPapadopoulos had attended the TAG Summit with Clovis.500 Millian, an American citizen who is\na native of Belarus, introduced himself as president of [theNew York-based Russian American\nChamber of Commerce,\" and claimed that through that position he had “insider knowledge and\ndirect access to the top hierarchy in Russian politics.9501 Papadopoulos asked Timofeev whether\nhe had heard of Millian.502 Although Timofeev said no, 503 Papadopoulos met Millian in New York\nCity.504 The meetings took place on July 30 and August 1, 2016.505 Afterwards, Millian invited\nPapadopoulos to attend—and potentially speak at-two international energy conferences,\nincluding one that was to be held in Moscow in September 2016.506 Papadopoulos ultimately did\nnot attend either conference.\nOn July 31, 2016, following his first in-person meeting with Millian, Papadopoulos\nemailed Trump Campaign official Bo Denysyk to say that he had been contacted \"by some leaders\nof Russian-American voters here in the US about their interest in voting for Mr. Trump,\" and to\nask whether he should \"put you in touch with their group (US-Russia chamber of commerce).9507\nDenysyk thanked Papadopoulos \"for taking the initiative,” but asked him to \"hold off with\n499 Grand Jury\n500 7/15/16 LinkedIn Message, Millian to Papadopoulos.\n501 7/15/16 LinkedIn Message, Millian to Papadopoulos.\n502 7/22/16 Facebook Message, Papadopoulos to Timofeev (7:40:23 p.m.); 7/26/16 Facebook\nMessage, Papadopoulos to Timofeev (3:08:57 p.m.).\n503 7/23/16 Facebook Message, Timofeev to Papadopoulos (4:31:37 a.m.); 7/26/16 Facebook\nMessage, Timofeev to Papadopoulos (3:37:16 p.m.).\n504 7/16/16 Text Messages, Papadopoulos & Millian (7:55:43 p.m.).\n5057/30/16 Text Messages, Papadopoulos & Millian (5:38 & 6:05 p.m.); 7/31/16 Text Messages,\nMillian & Papadopoulos (3:48 & 4:18 p.m.); 8/1/16 Text Message, Millian to Papadopoulos (8:19 p.m.).\n506 8/2/16 Text Messages, Millian & Papadopoulos (3:04 & 3:05 p.m.); 8/3/16 Facebook Messages,\nPapadopoulos & Millian (4:07:37 a.m. & 1:11:58 p.m.).\n507 7/31/16 Email, Papadopoulos to Denysyk (12:29:59 p.m.).\n94\n"
      },
      "context": {
        "uri": "gs://mueller-report-conormclaughlin/mueller-report-searchable.pdf",
        "pageNumber": 102
      }
    }
  ]
}

Parsing the Report Text

Tokenization

With the whole corpus of the report now in usable form, I wanted to tokenize the report to further prepare the content for some light NLP. In a very loose sense, tokenization cleans up things like punctuation and syntax and separates/standardizes the strings that make up a given sentence. I chose to use the recently open-sourced Blingfire package for Python to perform the tokenization after being very impressed with the ease in which one can run the text_to_words and text_to_sentences functions.

The code below downloads each JSON output file from Google Cloud (each of which has 2 pages of data), tokenizes the text, and adds it to a Python dictionary to save the text and page number.

from blingfire import *
import requests
import json

report = {}

# 9-395 - main text body
for i in range(9, 395, 2):
	temp_url = 'https://storage.googleapis.com/mueller-report-conormclaughlin/output/output-%s-to-%s.json' % (i, i+1) 
	temp_json = requests.get(temp_url).json()
	for item in temp_json["responses"]:
	  page_text = item['fullTextAnnotation']['text']
	  tokenized_text = text_to_sentences(page_text)
	  for sentence in tokenized_text.split('\n'):
	    cleaned_text = sentence\		# replace boilerplate page headers 
	    	.replace("U.S. Department of Justice Attorney Work Produet // May Contain Material Proteeted Under Fed. R. Crim.", "")\
	    	.replace("U.S. Department of Justice Attorney Work Produet // May Contain-Material Proteeted Under Fed. R. Erim.", "")\
	    	.replace("U.S. Department of Justice Attorney Work Produet // May Contain Material Proteeted Under Fed. R. Erim.", "")
	    report[cleaned_text] = item['context']['pageNumber'] 

Data Intake

NLP with AWS Comprehend APIs

To quickly generate some insights into this data, without having to set up and train an NLP model locally, I decided to use AWS’s Comprehend service, which offers a few simple APIs for language, sentiment, entity, and phrase detection. The CloudyR group has built an awesome package for R which makes using these APIs extremely simple - performing phrase detection is as simple as running detect_phrases() and passing in your text as an argument.

For the Mueller text, I ran the entire report (tokenized by sentence) through each of the entity, phrase, and sentiment APIs, and then saved the results to local data frames for analysis (and avoid future API calls $$). Check out the code snippet below to see how this worked in practice - I pretty much just changed detect_entities to detect_sentiment and detect_phrases to generate the other data points!

AWS Comprehend Code Example - Detect Entities

library("aws.signature")
library("aws.comprehend")
library(tidyverse)

# log in with saved AWS credentials
use_credentials()

entities_list = list()

for (i in 1:nrow(report_text)) {	# for each sentence in the report
  temp <- as.data.frame(
  			detect_entities(report_text$sentence[i])) %>% 	# get entities from AWS
  				mutate(page = report_text$page[i]	# get sentence's page number
  		  )
  if (typeof(temp$Index) == "double") {		# check that the line has output data
    entities_list[[i]] <- temp 				# if it does, add it to the output list
  }
}

entities <- bind_rows(entities_list)	# create DF of entity results

save(entities,file="entities.Rda")		# save DF 

Visualizing the Report Data

Key Phrases

While “the President” is the dominant character throughout the Report, it’s interesting to look at the cast of other characters who were also mentioned at length in the text.

Key Phrases in the Mueller Report

Distribution of Phrases by Page

Furthermore, we can take the most commonly referenced entities and plot when in the report they show up. Some individuals, like Comey, or Flynn, are mentioned primarily in a single section of the report. Other topics and individuals were mentioned throughout, with one or more spikes. If you’re looking to read specifically about McGahn or Papadopoulos, this will show you where to look!

Distribution of Key Phrases in the Mueller Report

Sentiment by Page

Finally, we’ll close with page-by-page breakdown of text sentiment throughout the Report. While much of the language is quite “lawyerly” and often quite technical, the Report leans negative throughout, with many of the subjects making false statements or interacting with Russian interference into the election.

Sentiment by Page in the Mueller Report