Thursday, December 6, 2018

NYT "Resistance" Op-Ed: Part 2

Overview

Last month, I waded into forensic linguistics, writing about using natural language processing along with other contextual clues to identify the most likely author of the New York Times' controversial "I Am Part of the Resistance Inside the Trump Administration", published anonymously by a "senior administration official". My conclusion was that the most likely author was Fiona Hill, a senior-level but low-profile policy adviser on Russia and President Putin. In this post, I set aside the "Resistance" op-ed and attempt to quantify more generally how accurate my model may be, and what the implications are for my prediction.

Exploring Validation Data


Like a game of Guess Who, the analysis began by gathering a shortlist of officials considered to be frontrunners by media and betting venues - around 40 in total. For each candidate, I manually collected about 5,000 words of their writing from op-eds, essays, speeches, etc., attempting to maximize similarity to the subject, venue, and tone of the anonymous op-ed, though some compromises had to be made, as the various candidates speak and write for widely varying audiences and occasions. Each writing sample was around 1,000 words in length, and the number of samples per candidate was about five.

My validation process is straightforward: from my original text data, withhold one writing sample (from a known author), train the model on the remaining data, and rank the candidates in terms of linguistic similarity to the withheld writing sample. Then, repeat this process for all remaining writing samples, around 200 in total, and assess. The initial result is underwhelming:



A few authors (Hassett, Hill, McMahon, etc.) were correctly identified most of the time, but most were never correctly identified. However, from my limited understanding, forensic linguistics is more probabilistic than exact, and with such a large field of candidates, determining exact authorship of a relatively short writing sample is a long shot. To dig deeper, I set aside prediction accuracy and looked instead at relative rankings:



This is a bit more encouraging. ~85% of candidates rank in the top half for their respective writing samples, on average. At the bottom of the chart are linguistic chimeras, whose writing samples apparently are so disparate that the author sounds like multiple different voices. The more likely cause though is inconsistent text data with significant stylistic differences between samples, e.g. an op-ed vs. a prepared speech. Another issue exists: the same "usual suspects" frequently appear as top candidates across multiple different authors. Presumably this is due to their linguistic style being highly generic and often "mistaken" for others'. To get a sense of this, I plotted overall ranking averages for all writing samples, including those by the author and those not:



Another way to evaluate this method is by asking how it compares to random guesswork. Plotting this same data as a PDF and a CDF:



Clearly, this method is substantially better than guesswork, though imperfect. The CDF also illustrates a very useful trade: pick the top candidate, and you're 15% likely to be correct. Top two picks is 18%, five is 40%, 10 is 65%, 15 is 75%, and so on. Looking at it from the opposite perspective, we can be 75% sure that the bottom (37-15) = 22 candidates are not the author, and ruling people out may be nearly as useful.

Implications for the NYT Op-Ed

On Fiona Hill

Hill is shown here to be one of the most frequent false-positive candidates, i.e. the algorithm often erroneously attributes other authors' text to her. On the other hand, text by Hill is correctly attributed 85% of the time. Per the PDF, I can now ascribe a confidence level of 15% to my prior Hill prediction. This is low in absolute terms but high relative to a field of 37 candidates where a null probability is 2.7%.

Widening the Net

As explained under the CDF, allowing a larger field of authors increases prediction certainty. Let's re-examine a graph of overall linguistic similarity from my previous post:



From the PDF, we see expectedly that the candidate ranked first is most likely to be correct. Continuing on, the next-most likely candidate is counter-intuitively not the candidate ranked second, but ranked fourth, and so on down the list. So using similarity together with the PDF, if we pick a confidence level, e.g. 60%, we can pick the top N candidates such that we are 60% confident that the author is contained in the set, per the CDF. For a simple preponderance of the evidence, i.e. >50% certainty, the field required is 6 candidates. So the author is probably one of:
  1. Fiona Hill
  2. Andrew Wheeler
  3. Mike Pence
  4. Rick Perry
  5. Nikki Haley
  6. Jared Kushner

Conversely, the author is probably not any of the remaining candidates. But language can be deceptive, as this is still essentially a coin toss scenario with low certainty. I'd rather be at least "pretty sure". So I'll increase the confidence interval to 80%, and from the CDF see that a field of 16 candidates is required. So, we can be pretty sure that the author is one of:
  1. Fiona Hill
  2. Andrew Wheeler
  3. Mike Pence
  4. Rick Perry
  5. Nikki Haley
  6. Jared Kushner
  7. Ryan Zinke
  8. John Bolton
  9. Mike Pompeo
  10. Kevin Hassett
  11. John Sullivan
  12. Linda McMahon
  13. Steven Mnuchin
  14. Betsy DeVos
  15. Jim Mattis
  16. John Kelly
And, we can be pretty sure that the author is not:
  • Mick Mulvaney
  • Joseph Simons
  • Alex Azar
  • Robert Wilkie
  • Alexander Acosta
  • Ben Carson
  • Raj Shah
  • Larry Kudlow
  • Ivanka Trump
  • Gina Haspel
  • Ajit Pai
  • Jeff Sessions
  • Dan Coats
  • Elaine Chao
  • Kellyanne Conway
  • Jon Huntsman
  • Wilbur Ross
  • Melania Trump
  • Kirstjen Nielsen
  • Robert Lighthizer
  • Sonny Perdue
Further Speculation

Setting aside linguistic similarity and and looking at the "pretty sure" shortlist, suppose we eliminate all officials not involved in foreign policy, the primary policy area of the "Resistance" op-ed:
  1. Fiona Hill
  2. Andrew Wheeler
  3. Mike Pence
  4. Rick Perry
  5. Nikki Haley
  6. Jared Kushner
  7. Ryan Zinke
  8. John Bolton
  9. Mike Pompeo
  10. Kevin Hassett
  11. John Sullivan
  12. Linda McMahon
  13. Steven Mnuchin
  14. Betsy DeVos
  15. Jim Mattis
  16. John Kelly
Suppose we then eliminate all candidates who have denied writing the op-ed:
  1. Fiona Hill
  2. Mike Pence
  3. Nikki Haley
  4. Jared Kushner
  5. John Bolton
  6. Mike Pompeo
  7. John Sullivan
  8. Jim Mattis
That leaves just:
  1. Fiona Hill
  2. Jared Kushner
  3. John Sullivan
And that's as far as I want to take this now. I welcome further speculation and remarks in the comments below.

No comments:

Post a Comment