Last month, I waded into forensic linguistics, writing about using natural language processing along with other contextual clues to identify the most likely author of the New York Times' controversial "I Am Part of the Resistance Inside the Trump Administration", published anonymously by a "senior administration official". My conclusion was that the most likely author was Fiona Hill, a senior-level but low-profile policy adviser on Russia and President Putin. In this post, I set aside the "Resistance" op-ed and attempt to quantify more generally how accurate my model may be, and what the implications are for my prediction.
Exploring Validation Data
Like a game of Guess Who, the analysis began by gathering a shortlist of officials considered to be frontrunners by media and betting venues - around 40 in total. For each candidate, I manually collected about 5,000 words of their writing from op-eds, essays, speeches, etc., attempting to maximize similarity to the subject, venue, and tone of the anonymous op-ed, though some compromises had to be made, as the various candidates speak and write for widely varying audiences and occasions. Each writing sample was around 1,000 words in length, and the number of samples per candidate was about five.
My validation process is straightforward: from my original text data, withhold one writing sample (from a known author), train the model on the remaining data, and rank the candidates in terms of linguistic similarity to the withheld writing sample. Then, repeat this process for all remaining writing samples, around 200 in total, and assess. The initial result is underwhelming:
A few authors (Hassett, Hill, McMahon, etc.) were correctly identified most of the time, but most were never correctly identified. However, from my limited understanding, forensic linguistics is more probabilistic than exact, and with such a large field of candidates, determining exact authorship of a relatively short writing sample is a long shot. To dig deeper, I set aside prediction accuracy and looked instead at relative rankings:
This is a bit more encouraging. ~85% of candidates rank in the top half for their respective writing samples, on average. At the bottom of the chart are linguistic chimeras, whose writing samples apparently are so disparate that the author sounds like multiple different voices. The more likely cause though is inconsistent text data with significant stylistic differences between samples, e.g. an op-ed vs. a prepared speech. Another issue exists: the same "usual suspects" frequently appear as top candidates across multiple different authors. Presumably this is due to their linguistic style being highly generic and often "mistaken" for others'. To get a sense of this, I plotted overall ranking averages for all writing samples, including those by the author and those not:
Another way to evaluate this method is by asking how it compares to random guesswork. Plotting this same data as a PDF and a CDF:
Clearly, this method is substantially better than guesswork, though imperfect. The CDF also illustrates a very useful trade: pick the top candidate, and you're 15% likely to be correct. Top two picks is 18%, five is 40%, 10 is 65%, 15 is 75%, and so on. Looking at it from the opposite perspective, we can be 75% sure that the bottom (37-15) = 22 candidates are not the author, and ruling people out may be nearly as useful.
Implications for the NYT Op-Ed
On Fiona Hill
Hill is shown here to be one of the most frequent false-positive candidates, i.e. the algorithm often erroneously attributes other authors' text to her. On the other hand, text by Hill is correctly attributed 85% of the time. Per the PDF, I can now ascribe a confidence level of 15% to my prior Hill prediction. This is low in absolute terms but high relative to a field of 37 candidates where a null probability is 2.7%.
Widening the Net
As explained under the CDF, allowing a larger field of authors increases prediction certainty. Let's re-examine a graph of overall linguistic similarity from my previous post:
From the PDF, we see expectedly that the candidate ranked first is most likely to be correct. Continuing on, the next-most likely candidate is counter-intuitively not the candidate ranked second, but ranked fourth, and so on down the list. So using similarity together with the PDF, if we pick a confidence level, e.g. 60%, we can pick the top N candidates such that we are 60% confident that the author is contained in the set, per the CDF. For a simple preponderance of the evidence, i.e. >50% certainty, the field required is 6 candidates. So the author is probably one of:
- Fiona Hill
- Andrew Wheeler
- Mike Pence
- Rick Perry
- Nikki Haley
- Jared Kushner
Conversely, the author is probably not any of the remaining candidates. But language can be deceptive, as this is still essentially a coin toss scenario with low certainty. I'd rather be at least "pretty sure". So I'll increase the confidence interval to 80%, and from the CDF see that a field of 16 candidates is required. So, we can be pretty sure that the author is one of:
- Fiona Hill
- Andrew Wheeler
- Mike Pence
- Rick Perry
- Nikki Haley
- Jared Kushner
- Ryan Zinke
- John Bolton
- Mike Pompeo
- Kevin Hassett
- John Sullivan
- Linda McMahon
- Steven Mnuchin
- Betsy DeVos
- Jim Mattis
- John Kelly
And, we can be pretty sure that the author is not:
- Mick Mulvaney
- Joseph Simons
- Alex Azar
- Robert Wilkie
- Alexander Acosta
- Ben Carson
- Raj Shah
- Larry Kudlow
- Ivanka Trump
- Gina Haspel
- Ajit Pai
- Jeff Sessions
- Dan Coats
- Elaine Chao
- Kellyanne Conway
- Jon Huntsman
- Wilbur Ross
- Melania Trump
- Kirstjen Nielsen
- Robert Lighthizer
- Sonny Perdue
Further Speculation
- Fiona Hill
Andrew Wheeler- Mike Pence
Rick Perry- Nikki Haley
- Jared Kushner
Ryan Zinke- John Bolton
- Mike Pompeo
Kevin Hassett- John Sullivan
Linda McMahonSteven MnuchinBetsy DeVos- Jim Mattis
John Kelly
- Fiona Hill
Mike PenceNikki Haley- Jared Kushner
John BoltonMike Pompeo- John Sullivan
Jim Mattis
That leaves just:
- Fiona Hill
- Jared Kushner
- John Sullivan
And that's as far as I want to take this now. I welcome further speculation and remarks in the comments below.
It looks as if Fiona Hill will be testifying before Congress on Monday 10/14/19. If she does and a transcript becomes available, it would be interesting if you could compare her spoken testimony to the excellent analysis you've done in this piece.
ReplyDeleteNeedless to say I was very intrigued to learn this, and will be watching closely. However, I'm hesitant to introduce this new testimony into this analysis, as it will be spoken and partially in Q&A format. Whenever possible I used long-form essays to match the format of the op-ed and isolate linguistic style. Nevertheless, I will think on how to best follow up with this development.
Delete