On Theory, or Something Like It

April 17th, 2018

Back in 2008, Walter Mischel wrote the following for his column in the APS Observer:

Years ago, a wit (I have forgotten his name) called it the toothbrush problem: Psychologists treat other peoples’ theories like toothbrushes — no self-respecting person wants to use anyone else’s. It’s amusing, but it also points to a conflict that we may be nurturing within our profession to the detriment of our science.

It's worth going back and reading these moments of self-reflection, composed in the relative calm before prominent failures to replicate began to accumulate. Mischel is voicing a concern, but is hardly raising an alarm bell. To him, the fact that (a) every tenured faculty members has their own toothbrush, and (b) these appear to coexist in parallel is not a sign that the science being done is fundamentally flawed. Rather, he writes that it's a sign that the pressures of publishing and tenure are making the field, at worst, stressful and inefficient.

Now, compare Mischel's prognosis with this account of psychological science, published by Paul Meehl in 1990.

Null hypothesis testing of correlational predictions from weak substantive theories in soft psychology is subject to the influence of ten obfuscating factors whose effects are usually (1) sizeable, (2) opposed, (3) variable, and (4) unknown. The net epistemic effect of these ten obfuscating influences is that the usual research literature review is well nigh uninterpretable. Major changes in graduate education, conduct of research, and editorial policy are proposed.

It's very much worth your time to read Meehl's article in its entirety, and to do so with a self-critical frame of mind. For a long time, psychology has been a field in which "agree to disagree" has been an acceptable position for everyone to adopt, from graduate students to faculty to editors. The hard truth, however, is that a field in which everyone can agree to disagree has no substantive criterion for deciding what is and is not true. Circa 2008, the status quo was one in which most researchers were content to live in collective wonderment at the possibilities of their theories, without calling one another to task on what, exactly, those theories entailed or implied, or whether those theories were internally consistent, or whether those implications appeared to be at odds with the implications of other theories. It's not that disagreement never occurs, but that a priority has been given to peaceful cohabitation within the field, even in cases where some of us have been walking around with highly suspect toothbrushes.

Obviously, I'm speaking very generally, and the degree to which this has been true lab to lab, or department to department, has varied. It's also not always been true in psych. Compared to the 1990s and 2000s, the 70s and 80s were bitterly contentious, thanks to the Cognitive Revolution. In some departments, half the faculty could hardly stand to talk to the other half. "You keep to your theory, I'll keep to mine" might in part be a generational reaction. So, for the last 30 years, insofar as a researcher has done "pure psychology" (e.g. without collaboration with engineers, biologists, or medical researchers), I'm comfortable saying that the way to get ahead was to become the champion of an uncontested territory, being "The Expert" in some narrow-defined domain. This conferred both the illusion of mastery (it's easy to be an "expert" of a keyword you coined that no one else is trying to shoulder in on) and consensus (if no one is challenging you, it must mean everyone tacitly agrees).

This is a very different state of affairs from how most other natural sciences do things. In physics, the empirical implications of both General Relativity and The Standard Model are at once very specific and in excellent agreement with observed measurement. New theories that can't be squared with this agreed-upon Ground Truth can be immediately discarded. When a theorist vehemently insists, against evidence, on their radical alternative, "agree to disagree" is not on the menu. Such proposals are instead considered "pathological."

It's often said that physics is a "mature" science, relative to psychology, and thus that it is unfair to compare the two. I've never really understood this argument, as we've had well over a century to start getting precise. However, in its current form, mainstream psychology is making no effort to move in that direction. We do have facts that we collectively agree upon, but most are qualitative and nonspecific. In order to establish a ground truth, researchers will have to (a) prioritize the reliability of measurement over the inventiveness of interpretation, and (b) propose theoretical models that are both specific enough and quantitative enough to be carefully compared to those measurements in order to detect a discrepancy. "If it disagrees with experiment, it's wrong."

At present, we neither measure nor model what we study systematically. Ask yourself: When was the last time you ran a two-group experiment, and before collecting data, you made a prediction, with error bars, as to how big you thought the effect size would be? How would you go about generating such a prediction, apart from an intuitive guess? How would you go about putting errors bars on that prediction? Supposing you have generated predictions in the past, have you ever taken your result, gone back to your prediction, and looked at how far off it was? Have you perhaps found that it's a different prediction from the one you remember having made? After all, what we believe about an experiment is powerfully influenced by having peeked at the data. And, most importantly, have you scored the discrepancy in terms of the SD of your prediction?

Instead, we are trained, even encouraged, to take an ad hoc approach: We run experiments that, although clever in their design, are also usually inventive. When every experiment is a little different from everything that has come before, then no two experiments can be directly compared. We have been telling ourselves for decades that this brings novelty and impact to the table, but I've come to think it has instead helped us to cover our asses when things don't turn out how we vaguely expected, and also making it impossible to even begin formulating a general theory. It's easy to advocate for our pet toothbrush theory if we are allowed to do so every time we can get p under 0.05, provided we can chalk any surprising discrepancies up to whatever new twist our latest experiment added.

Additionally, most of us approach measurement with a shotgun approach: Making only vague and qualitative predictions, we measure as many covariates as we please, and then make a big deal of the main effects, or perhaps the interactions, or perhaps the mediators or moderators, or any (almost always unspecified) combination of the above. In my opinion, the majority of results that are presented in the literature as confirmatory were run as though they were exploratory. Indeed, until the field starts making predictions more specific than "we should see some difference, or perhaps an interaction, I'll let you know once I've looked at the data," it won't be reasonable to say that psychology knows what a "confirmatory" analysis even looks like.

The field has made many mistakes (and it is not alone in doing so - science is hard), but I think the central mistake of the last generation has been the game that we have collectively agreed to play regarding what a "discovery" is. Since the cognitive revolution, the rule has been that when obtaining a statistically significant result, we in psychology are authorized to announce that something was discovered, and that it happens to be our theory of choice. If that result agrees with the intuitions of luminaries in the field, no further test has been required: That discovery may now join the ranks of the field's many toothbrush theories. This "intuition" element is crucial, and has fueled one fad in psychological research after another. We have been all too accepting of results that fit into the current zeitgeist, because that's much easier than running the experiment on one's own time and dime to be sure of it. Only when a result clashes with the intuition of senior researchers do author have to struggle to get their view accepted. Because we haven't been systematic in checking the work of our colleagues, scads of false positives have made it into print because they matched the fashion. Furthermore, many real effects have likely been badly misinterpreted by researchers who never got around to reading the Methods sections of the papers they cite. It's as though we have spent the last thirty years furiously building houses as fast as we could, only to discover today that some of those houses are unsafe to live in, and that we lack the tools to tell which is which.

This brings us, finally, to Amy Cuddy, who had the misfortune to be winning the old game at just the wrong time. She and her collaborators "discovered" an effect (except that they probably didn't - hypothesis testing is a noisy and unreliable filter, and we mistake noise for news all the time), and because it fit like a jigsaw piece into the various aspects current zeitgeist, she rose far too fast, with too much professional pressure, to have the time to slow down and validate her results. Unlike more senior members of the field who have come under recent scrutiny (being either tenured or retired), she had neither the clout, the analytic training, nor a deep bench of former doctoral advisees to mount a vigorous defense (although not for lack of trying). The red flags in her original research, and the weaknesses in her latest reply, have all been discussed at length elsewhere, and need not be rehashed here.

In the end, Cuddy deserves neither the Internet frenzy that has been directed at her nor the disdain of colleagues who are clearly afraid that her notoriety will rub off on them. The fault really isn't hers, because she was doing research the way she had been trained. The field failed her, as it continues, as a community, to fail most of us. If she has made bad arguments in defense of her work, I really feel that it's because those are the defensive tactics that the field, and especially researchers in senior positions, have used for decades. She deserves the same chance as the rest of the field to come to grips with a hard truth: The very real possibility that many of the phenomena studied in the last thirty years are mirages, born as much of credulous storytelling as of imperfect analysis. It takes humility to admit that many of our toothbrushes are lousy, and they can't all be right (indeed, most have to either be wrong or too vague to be useful for making specific predictions).

How we move forward as a field is unclear. Our only easy answer lies with the hopefully rare cases of outright fraud: If detected, their swift exile is the obvious course of action. We can't expel the Hausers and the Stapels of the field fast enough. At the same time, a shift toward making our data public will help smoke out those who remain in hiding. However, fraud isn't what got us here. What ails the field are misunderstandings about how to agree on what "real" effects are, both in terms of how we measure and how we predict measurements. Progress requires that we do both well, but the field presently hardly knows how to do either at all.

Meanwhile, it is likely the senior researchers who have been most guilty of overconfident interpretation, inadvertent p-hacking, and protean unfalsifiability. It appears likely that our departments are overrun with Wansinks who are oblivious to their own bad practices and who greet accusation with astonishment. Many are safe behind the wall of tenure. Because this is a generational problem that afflicts an entire cohort, name-and-shame will not be an effective strategy. An entire generation of psychologists is currently weighing the implications of the replication crisis, and many (although certainly not all) will sooner cling to the familiar way of doing things than to admit that much of their past work needs to be called into question. Furthermore, those safe at the top of the tower retain an outsized ability to make decisions about who will get hired, who gets tenure, and what curriculum graduate students will be required to complete. For now, on the wagon of someone's academic career, the squeaky wheel gets the ax, rather than the grease.

Another important step will be to rely less on the glib just-so stories that have a way of creeping into our theorizing. Most psychologists know who Kitty Genovese was, but most don't realize that the story they were taught about her murder is false. A shockingly wide range of people believe that women who live together develop synchronized menstrual cycles, but they probably don't. Going forward, serious-minded researchers should approach these bits of "psychology trivia" the way they would a random assertion made in an Internet comment section: With reflexive skepticism.

Our best hope, I think, is for younger researchers who have begun to understand the field's fundamental problems to actively advocate for a better way of doing things. Rather than adding our own share of scorn upon the Internet's latest Two Minutes Hate, we should sit down with our senior collaborators and try to persuade them that there's a better way of doing things. A way that yields more durable results and lets us tell fact from fantasy. A way that promises, through its pursuit of accurate prediction, to have greater translational application in medicine, public health, criminal justice reform, and policy making. Yes, doing things differently will mean both relearning and unlearning. Yes, this will mean that some of our past results, although dearly held, may evaporate under deeper scrutiny. Nevertheless, in the face of mounting challenges on both theoretical and technical grounds, we can no longer be a field that refuses to admit error. That ship has sailed, and isn't coming back.

I hope this doesn't come across too harshly. I'm not on a very high horse here; I didn't see the replication crisis coming either, and was happy to do science the way I was trained until the rabble-rousers started asking the hard questions (like "How did Bem get such an obviously absurd result if he followed the proper procedure?"). I have the benefit of hindsight, but the good news is that so do all the rest of us, provided we look back and reflect on how we got here.

I'm trying to take the approach I've described, to the extent that I am able, and it's hard. Building models that make specific predictions has forced me to learn new skills that graduate school offered no training in, and that none of my mentors were able to help with beyond their patience and encouragement while I taught myself (which was invaluable, and for which I remain very grateful). Meanwhile, relating my experimental results to those predictions involves a level of minutiae that has been a tough sell with reviewers and, at times, with my collaborators. I know I'm not advocating an easy path, but I'm hopeful that it's a better one.