kindofdoon: The Blog of Daniel W. Dichter: NYT "Resistance" Op-Ed: Part 2

Overview

Last month, I waded into forensic linguistics, writing about using natural language processing along with other contextual clues to identify the most likely author of the New York Times' controversial "I Am Part of the Resistance Inside the Trump Administration", published anonymously by a "senior administration official". My conclusion was that the most likely author was Fiona Hill, a senior-level but low-profile policy adviser on Russia and President Putin. In this post, I set aside the "Resistance" op-ed and attempt to quantify more generally how accurate my model may be, and what the implications are for my prediction.

Exploring Validation Data

Like a game of Guess Who, the analysis began by gathering a shortlist of officials considered to be frontrunners by media and betting venues - around 40 in total. For each candidate, I manually collected about 5,000 words of their writing from op-eds, essays, speeches, etc., attempting to maximize similarity to the subject, venue, and tone of the anonymous op-ed, though some compromises had to be made, as the various candidates speak and write for widely varying audiences and occasions. Each writing sample was around 1,000 words in length, and the number of samples per candidate was about five.

My validation process is straightforward: from my original text data, withhold one writing sample (from a known author), train the model on the remaining data, and rank the candidates in terms of linguistic similarity to the withheld writing sample. Then, repeat this process for all remaining writing samples, around 200 in total, and assess. The initial result is underwhelming:

A few authors (Hassett, Hill, McMahon, etc.) were correctly identified most of the time, but most were never correctly identified. However, from my limited understanding, forensic linguistics is more probabilistic than exact, and with such a large field of candidates, determining exact authorship of a relatively short writing sample is a long shot. To dig deeper, I set aside prediction accuracy and looked instead at relative rankings:

This is a bit more encouraging. ~85% of candidates rank in the top half for their respective writing samples, on average. At the bottom of the chart are linguistic chimeras, whose writing samples apparently are so disparate that the author sounds like multiple different voices. The more likely cause though is inconsistent text data with significant stylistic differences between samples, e.g. an op-ed vs. a prepared speech. Another issue exists: the same "usual suspects" frequently appear as top candidates across multiple different authors. Presumably this is due to their linguistic style being highly generic and often "mistaken" for others'. To get a sense of this, I plotted overall ranking averages for all writing samples, including those by the author and those not:

Another way to evaluate this method is by asking how it compares to random guesswork. Plotting this same data as a PDF and a CDF:

Clearly, this method is substantially better than guesswork, though imperfect. The CDF also illustrates a very useful trade: pick the top candidate, and you're 15% likely to be correct. Top two picks is 18%, five is 40%, 10 is 65%, 15 is 75%, and so on. Looking at it from the opposite perspective, we can be 75% sure that the bottom (37-15) = 22 candidates are not the author, and ruling people out may be nearly as useful.

Implications for the NYT Op-Ed

On Fiona Hill

Hill is shown here to be one of the most frequent false-positive candidates, i.e. the algorithm often erroneously attributes other authors' text to her. On the other hand, text by Hill is correctly attributed 85% of the time. Per the PDF, I can now ascribe a confidence level of 15% to my prior Hill prediction. This is low in absolute terms but high relative to a field of 37 candidates where a null probability is 2.7%.

Widening the Net

As explained under the CDF, allowing a larger field of authors increases prediction certainty. Let's re-examine a graph of overall linguistic similarity from my previous post:

From the PDF, we see expectedly that the candidate ranked first is most likely to be correct. Continuing on, the next-most likely candidate is counter-intuitively not the candidate ranked second, but ranked fourth, and so on down the list. So using similarity together with the PDF, if we pick a confidence level, e.g. 60%, we can pick the top N candidates such that we are 60% confident that the author is contained in the set, per the CDF. For a simple preponderance of the evidence, i.e. >50% certainty, the field required is 6 candidates. So the author is probably one of:

Fiona Hill

Andrew Wheeler

Mike Pence

Rick Perry

Nikki Haley

Jared Kushner

Conversely, the author is probably not any of the remaining candidates. But language can be deceptive, as this is still essentially a coin toss scenario with low certainty. I'd rather be at least "pretty sure". So I'll increase the confidence interval to 80%, and from the CDF see that a field of 16 candidates is required. So, we can be pretty sure that the author is one of:

Fiona Hill
Andrew Wheeler
Mike Pence
Rick Perry
Nikki Haley
Jared Kushner
Ryan Zinke
John Bolton
Mike Pompeo
Kevin Hassett
John Sullivan
Linda McMahon
Steven Mnuchin
Betsy DeVos
Jim Mattis
John Kelly

And, we can be pretty sure that the author is not:

Mick Mulvaney
Joseph Simons
Alex Azar
Robert Wilkie
Alexander Acosta
Ben Carson
Raj Shah
Larry Kudlow
Ivanka Trump
Gina Haspel
Ajit Pai
Jeff Sessions
Dan Coats
Elaine Chao
Kellyanne Conway
Jon Huntsman
Wilbur Ross
Melania Trump
Kirstjen Nielsen
Robert Lighthizer
Sonny Perdue

Further Speculation

Setting aside linguistic similarity and and looking at the "pretty sure" shortlist, suppose we eliminate all officials not involved in foreign policy, the primary policy area of the "Resistance" op-ed:

Fiona Hill
~~Andrew Wheeler~~
Mike Pence
~~Rick Perry~~
Nikki Haley
Jared Kushner
~~Ryan Zinke~~
John Bolton
Mike Pompeo
~~Kevin Hassett~~
John Sullivan
~~Linda McMahon~~
~~Steven Mnuchin~~
~~Betsy DeVos~~
Jim Mattis
~~John Kelly~~

Suppose we then eliminate all candidates who have denied writing the op-ed:

Fiona Hill
~~Mike Pence~~
~~Nikki Haley~~
Jared Kushner
~~John Bolton~~
~~Mike Pompeo~~
John Sullivan
~~Jim Mattis~~

That leaves just:

Fiona Hill
Jared Kushner
John Sullivan

And that's as far as I want to take this now. I welcome further speculation and remarks in the comments below.

2 comments:

Kate HampsteadOctober 11, 2019 at 9:59 PM
It looks as if Fiona Hill will be testifying before Congress on Monday 10/14/19. If she does and a transcript becomes available, it would be interesting if you could compare her spoken testimony to the excellent analysis you've done in this piece.

Thursday, December 6, 2018

NYT "Resistance" Op-Ed: Part 2

2 comments: