A psychologist's thoughts on how and why we play games

Sunday, May 14, 2017

Curiously Strong effects

The reliability of scientific knowledge can be threatened by a number of bad behaviors. The problems of p-hacking and publication bias are now well understood, but there is a third problem that has received relatively little attention. This third problem currently cannot be detected through any statistical test, and its effects on theory may be stronger than that of p-hacking.

I call this problem curiously strong effects.

The Problem of Curiously Strong

Has this ever happened to you? You come across a paper with a preposterous-sounding hypothesis and a method that sounds like it would produce only the tiniest change, if any. You skim down to the results, expecting to see a bunch of barely-significant results. But instead of p = .04, d = 0.46 [0.01, 0.91], you see p < .001, d = 2.35 [1.90, 2.80]. This unlikely effect is apparently not only real, but it is four or five times stronger than most effects in psychology, and it has a p-value that borders on impregnable. It is curiously strong.

The result is so curiously strong that it is hard to believe that the effect is actually that big. In these cases, if you are feeling uncharitable, you may begin to wonder if there hasn't been some mistake in the data analysis. Worse, you might suspect that perhaps the data have been tampered with or falsified.

Spuriously strong results can have lasting effects on future research. Naive researchers are likely to accept the results at face value, cite them uncritically, and attempt to expand upon them. Less naive researchers may still be reassured by the highly significant p-values and cite the work uncritically. Curiously strong results can enter meta-analyses, heavily influencing the mean effect size, Type I error rate, and any adjustments for publication bias.

Curiously strong results might, in this way, be more harmful than p-hacked results. With p-hacking, the results are often just barely significant, yielding the smallest effect size that is still statistically significant. Curiously strong results are much larger and have greater leverage on meta-analysis, especially when they have large sample sizes. Curiously strong results are also harder to detect and criticize: We can recognize p-hacking, and we can address it by asking authors to provide all their conditions, manipulations, and outcomes. We don't have such a contingency plan for curiously strong results.

What should be done?

My question to the community is this: What can or should be done about such implausible, curiously strong results?

This is complicated, because there are a number of viable responses and explanations for such results:

1) The effect really is that big.
2) Okay, maybe the effect is overestimated because of demand effects. But the effect is probably still real, so there's no reason to correct or retract the report.
3) Here are the data, which show that the effect is this big. You're not insinuating somebody made the data up, are you?

In general, there's no clear policy on how to handle curiously strong effects, which leaves the field poorly equipped to deal with them. Peer reviewers know to raise objections when they see p = .034, p = .048, p = .041. They don't know to raise objections when they see d = 2.1 or r = 0.83 or η2 = .88.

Nor is it clear that curiously strong effects should be a concern in peer review. One could imagine the problems that ensue when one starts rejecting papers or flinging accusations because the effects seem too large. Our minds and our journals should be open to the possibility of large effects.

The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies. Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication. ("It's not like we p-hacked this initial result -- look at how good the p-value is!")

It would be nice if the whole mess could be nipped in the bud. But I'm not sure how it can.

A future without the curiously strong?

This may be naive of me, but it seems that in other sciences it is easier to criticize curiously strong effects, because the prior expectations on effects are more precise.

In physics, theory and measurement are well-developed enough that it is a relatively simple matter to say "You did not observe the speed of light to be 10 mph." But in psychology, one can still insist with a straight face that (to make up an example) subliminal luck priming lead to a 2 standard deviation improvement in health.

In the future, we may be able to approach this enviable state of physics. Richard, Bond Jr., and Stokes-Zoota (2003) gathered up 322 meta-analyses and concluded that the modal effect size in social psych is r = .21, approximately d = 0.42. (Note that even this is probably an overestimate considering publication bias.) Simmons, Nelson, and Simonsohn (2013) collected data on obvious-sounding effects to provide benchmark effect sizes. Together, these reports show that an effect of d > 2 is several times stronger than most effects in social psychology and stronger even than obvious effects like "men are taller than women (d = 1.85)" or "liberals see social equality as more important than conservatives (d = 0.69)".

By using our prior knowledge to describe what is within the bounds of psychological science, we could tell what effects need scrutiny. Even then, one is likely to need corroborating evidence to garner a correction, expression of concern, or retraction, and such evidence may be hard to find.

In the meantime, I don't know what to do when I see d = 2.50 other than to groan. Is there something that should be done about curiously strong effects, or is this just another way for me to indulge my motivated reasoning?

Wednesday, March 22, 2017

Comment on Data Colada [58]: Funnel plots, done correctly, are extremely useful

In DataColada [58], Simonsohn argues that funnel plots are not useful. The argument is, for true effect size δ and sample size n:
  • Funnel plots are based on the assumption that r(δ, n) = 0.
  • Under some potentially common circumstances, r(δ, n) != 0. 
  • When r(δ, n) != 0, there is the risk of mistaking benign funnel plot asymmetry (small-study effects) for publication bias.

I do not think that any of this is controversial. It is always challenging to determine how to interpret small-study effects. They can be caused by publication bias, or they can be caused by, as Simonsohn argues, researchers planning their sample sizes in anticipation of some large and some small true effects.

There is a simple solution to this that preserves the validity and utility of funnel plots. If your research literature is expected to contain some large and some small effects, and these are reflected by clear differences in experimental methodology and/or subject population, then analyze those separate methods and populations separately. 

For this post, I will call this making homogeneous subgroups. 

Once you have made homogeneous subgroups, r(δ, n) = 0 is not a crazy assumption at all. Indeed, it can be a more sensible assumption than r(δ, δguess) = .6.

Making homogeneous subgroups

Suppose we are interested in the efficacy of a new psychotherapeutic technique for depression and wish to meta-analyze the available literature. 

It would be silly to combine studies looking at the efficacy of this technique for reducing depression and improving IQ and reducing aggression and reducing racial bias and losing weight. These are all different effects and different hypotheses. It would be much more informative to test each of these separately.

In keeping with the longest-running cliche in meta-analysis, here's an "apples to oranges" metaphor.

For example, when we investigated the funnel plots from Anderson et al.'s (2010) meta-analysis of violent video game effects, we preserved the original authors' decision to separate studies by design (experiment, cross-section, longitudinal) and by classes of outcome (behavior, cognition, affect). When Carter & McCullough (2014) inspected the effects of ego depletion, they separated their analysis by classes of outcome.

In short, combine studies of similar methods and similar outcomes. Studies of dissimilar methods and dissimilar outcomes should probably be analyzed separately.

The bilingual advantage example

I think the deBruin, Treccani, and Della Sala (2014) paper that serves as the post's motivating example is a little too laissez-faire about combining dissimilar studies. The hypothesis "bilingualism is good for you" seems much too broad, encompassing far too many heterogeneous studies.

Simonsohn's criticism here has less to do with a fatal flaw in funnel plots and more to do with a suboptimal application of the technique. Let's talk about why this is suboptimal and how it could have been improved.

To ask whether bilingualism improves working memory among young adults is one question. To ask whether bilingualism delays the onset of Alzheimer's disease is another. To combine the two is of questionable value. 

It would be more informative to restrict the analysis to a more limited, homogeneous hypothesis such as "bilingualism improves working memory". Even after that, it might be useful to explore different working memory tasks separately.

When r(δ, n) = 0 is reasonable

Once you have parsed the studies out into homogeneous subsamples, the assumption that r(δ, n) = 0 becomes quite reasonable. This is because:
  • Choosing homogeneous studies minimizes the variance in delta across studies.
  • Given homogeneous methods, outcomes, and populations, researchers cannot plan for variance in delta.
Let's look at each in turn.

Minimizing variance in delta

Our concern is that the true effect size δ varies from study to study -- sometimes it is large, and sometimes it is small. This variance may covary with study design and with sample size, leading to a small-study effect. Because study design is confounded with sample size, there is a risk of mistaking this for publication bias.

Partitioning into homogeneous subsets addresses this concern. As methods and populations become more similar, we reduce the variance in delta. As we reduce the variance in delta, we restrict its range, and correlations between delta and confounds will shrink, leading us towards the desirable case that r(δ, n) = 0.

Researchers cannot plan for the true effect size within homogeneous subgroup

Simonsohn assumes that researchers have some intuition for the true effect size -- that they are able to guess it with some accuracy such that r(δ, δguess) = .6.

True and guessed effect sizes in Data Colada 58. r = .6 is a pretty strong estimate of researcher intuition, although Simonsohn's concern still applies (albeit less so) at lower levels of intuition.

This may be a reasonable assumption when we are considering a wide array of heterogeneous studies. I can guess that the Stroop effect is large, that the contrast mapping effect is medium in size, and that the effect of elderly primes is zero.

However, once we have made homogeneous subsamples, this assumption becomes much less tenable. Can we predict when and for whom the Stroop effect is larger or smaller? Do we know under which conditions the effect of elderly primes is nonzero?

Indeed, you are probably performing a meta-analysis exactly because researchers have poor intuition for the true effect size. You want to know whether the effect is δ = 0, 0.5, or 1. You are performing moderator analyses to see if you can learn what makes the effect larger or smaller. 

Presuming you are the first to do this, it is unclear how researchers could have powered their studies accordingly. Within this homogeneous subset, nobody can predict when the effect should be large or small. To make this correlation between sample size and effect size, researchers would need access to knowledge that does not yet exist.

Once you have made a homogeneous subgroup, r(δ, n) = 0 can be a more reasonable assumption than r(δ, δguess) = .6.

Meta-regression is just regression

Meta-analysis seems intimidating, but the funnel plot is just a regression equation. Confounds are a hazard in regression, but we still use regression because we can mitigate the hazard and the resulting information is often useful. The same is true of meta-regression.  

Because this is regression, all the old strategies apply. Can you find a third variable that explains the relationship between sample size and effect size? Moderator analyses and inspection of the funnel plots can help to look for, and test, such potential confounds.

I think that Simonsohn does not see this presented often in papers, and so he is under the impression that this sort of quality check is uncommon. In my experience, my reviewers were definitely very careful asking me to rule out confounds in my own funnel plot analysis.

That said, it's definitely possible that these steps don't make it to the published literature: perhaps they are performed internally, or shared with just the peer reviewers, or maybe studies where the funnel plot contains confounds are not interesting enough to publish. Maybe greater attention can be paid to this in our popular discourse.


Into every life some heterogeneity must fall. There is the risk that, even after these efforts, there is some confound that you mistake for publication bias. That's regression for you.

There is also the risk that, if you get carried away chasing after perfectly homogeneous subgroups, you may find yourself conducting a billion analyses of only one or two studies each. This is not helpful either for reasons that will be obvious.

Simonsohn is concerned that we can never truly reach such homogeneity that r(δ, n) = 0 is true. This seems possible, but it is hard to say without access to 1) the true effect size and 2) the actual power analyses of researchers. I think that we can at least reach such a point that we have reached the limits of researcher's ability to plan for larger vs. smaller effects.


The funnel plot represents the relationship between effect size δ and the sample size n. These may be correlated because of publication bias, or they may be correlated because of genuine differences in δ that have been planned for in power analysis. By conditioning your analysis on homogeneous subsets, you reduce variance in δ and the potential influence of power analysis.

My favorite video game is The Legend of Zelda: Plot of the Funnel

Within homogeneous subsets, researchers do not know when the effect is larger vs. smaller, and so cannot plan their sample sizes accordingly. Under these conditions, the assumption that r(δ, n) = 0 can be quite reasonable, and perhaps more reasonable than the assumption that r(δ, δguess) = .6.

Applied judiciously, funnel plots can be valid, informative, expressive, and useful. They encourage attention to effect size, reveal outliers, and demonstrate small-study effects that can often be attributed to publication bias.



I also disagree with Simonsohn that "It should be considered malpractice to publish papers with PET-PEESE." Simonsohn is generally soft-spoken, so I was a bit surprised to see such a stern admonishment.

PET and PEESE are definitely imperfect, and their weaknesses are well-documented: PET is biased downwards when δ != 0, and PEESE is biased upwards when δ = 0. This sucks if you want to know whether δ = 0. 

Still, I think PEESE has some promise; assuming there is an effect, how big is it likely to be? Yes, these methods depend heavily on the funnel plot, assuming that any small-study effect is attributable to publication bias, but again, this is can be a reasonable assumption under the right conditions. Some simulations I'm working on with Felix Schonbrodt, Evan Carter, and Will Gervais indicate that it's at least no worse than trim-and-fill (low bar, I know).

Of course, no one technique is perfect. I would recommend using these methods in concert with other analyses such as the Hedges & Vevea 3-parameter selection model or, sure, p-curve or p-uniform.

Monday, February 27, 2017

Publication bias can hide your moderators

It is a common goal of meta-analysis to provide not only an overall average effect size, but also to test for moderators that cause the effect size to become larger or smaller. For example, researchers who study the effects of violent media would like to know who is most at risk for adverse effects. Researchers who study psychotherapy would like to recommend a particular therapy as being most helpful.

However, meta-analysis does not often generate these insights. For example, research has not found that violent-media effects are larger for children than for adults (Anderson et al. 2010). Similarly, it is often reported that all therapies are roughly equally effective (the "dodo bird verdict," Luborsky, Singer, & Luborsky, 1975; Wampold et al., 1997).

"Everybody has won, and all must have prizes. At least, that's what it looks like if you only look at what got published."

It seems to me that publication bias may obscure such patterns of moderation. Publication bias introduces a “small-study effect” in which the observed effect size is highly dependent on the sample size. Large-sample studies can reach statistical significance with smaller effect sizes. Small-sample studies can only reach statistical significance by reporting enormous effect sizes. The observed effect sizes gathered in meta-analysis, therefore, may be more a function of the sample size than they are a function of theoretically-important moderators such as age group or treatment type.

In this simulation, I compare the statistical power of meta-analysis to detect moderators when there is, or when there is not, publication bias.


Simulations cover 4 scenarios in a 2 (Effects: large or medium) × 2 (Pub bias: absent or present) design.

When effect sizes were large, the true effects were δ = 0 in the first population, δ = 0.3 in the second population, and δ = 0.6 in the third population. When effect sizes were medium, the true effects were δ = 0 in the first population, δ = 0.2 in the second population, and δ = 0.4 in the third population. Thus, each scenario represents one group with no effect, a group with a medium-small effect, and a group with an effect twice as large.

When studies were simulated without publication bias, twenty studies were conducted on each population, and all were reported. When studies were simulated with publication bias, studies were simulated, then published and/or file-drawered until at least 70% of the published effects were statistically significant. When results were not statistically significant and were file-drawered, further studies were simulated until 20 statistically significant results were obtained. This keeps the number of studies k constant at 20, which prevents confounding the influence of publication bias with the influence of fewer observed studies.

For each condition, I report the observed effect size for each group, the statistical power of the test for moderators, and the statistical power of the Egger test for publication bias. I simulated 500 meta-analyses within each condition in order to obtain stable estimates.


Large effects.

Without publication bias: 
  • In 100% of the metas, the difference between δ = 0 and δ = 0.6 was detected.
  • In 92% of the metas, the difference between δ = 0 and δ = 0.3 was detected. 
  • In only 4.2% of cases was the δ = 0 group mistaken as having a significant effect.
  • Effect sizes within each group were accurately estimated (in the long run) as δ = 0, 0.3, and 0.6.

With publication bias: 
  • Only 15% of the metas were able to tell the difference between δ = 0 and δ = 0.3.
  • 91% of meta-analyses were able to tell the difference between δ = 0 and δ = 0.6. 
  • 100% of the metas mistook the δ = 0 group as having a significant effect.  
  • Effect sizes within each group were overestimated: d = .45, .58, and .73 instead of 0, 0.3, and 0.6.  
Here's a plot of the moderator parameters across the 500 simulations without bias (bottom) and with bias (top).
Moderator values are dramatically underestimated in context of publication bias.

Medium effects.  

Without publication bias:
  • 99% of metas detected the difference between δ = 0 and δ = 0.4. 
  • 60% of metas detected the difference between δ = 0 and δ = 0.2. 
  • The Type I error rate in the δ = 0 group was 5.6%. 
  • In the long run, effect sizes within each group were accurately recovered as d = 0, 0.2, and 0.4.

With publication bias:
  • Only 35% were able to detect the difference between δ = 0 and δ = 0.4.
  • Only 2.2% of the meta-analyses were able to detect the difference between δ = 0 and δ = 0.2, 
  • 100% of meta-analyses mistook the δ = 0 group as reflecting a significant effect. 
  • Effect sizes within each group were overestimated: d = .46, .53, and .62 instead of δ = 0, 0.2, and 0.4.
Here's a plot of the moderator parameters across the 500 simulations without bias (bottom) and with bias (top).
Again, pub bias causes parameter estimates of the moderator to be biased downwards.


Publication bias can hurt statistical power for your moderators.  Obvious differences such as that between d = 0 and d = 0.6 may retain decent power, but power will fall dramatically for more modest differences such as that between d = 0 and d = 0.4. Meta-regression may be stymied by publication bias.

Monday, February 13, 2017

Why retractions are so slow

A few months ago, I had the opportunity to attend a symposium on research integrity. The timing was interesting because, on the same day, Retraction Watch ran a story on two retractions in my research area, the effects of violent media. Although one of these retractions had been quite swift, the other retraction had been three years in coming, which was a major source of heartache and frustration among all parties involved.

Insofar as some of us are concerned about the possible role of fraud as a contaminating influence in the scientific literature, I thought it might be helpful to share what I learned at the symposium. This regards the multiple steps and stakeholders in a retraction process, which may in part be the cause of common frustrations about the opacity and gradualness of the retraction process.

The Process

On paper, the process for handling concerns about a paper looks something like this:
  1. Somebody points out the concerns about the legitimacy of an article.
  2. The journal posts an expression of concern, summarizing the issues with the article.
  3. If misconduct is suspected, the university investigates for possible malfeasance.
  4. If malfeasance is discovered, the article is retracted.
We can see that it is an expression of concern can be posted quickly, whereas a retraction can take years of investigation. Because there is no way to resolve investigations faster, scientific self-correction can be expected to be slow. The exception to this is that, when the authors voluntarily withdraw an article in response to concerns, a retraction no longer requires an investigation.

Multiple stakeholders in investigations

Regarding investigations, it is not always clear what is being done or how seriously concerns are being addressed. In the Retraction Watch story at the top of the article, the plaintiffs spent about three years waiting for action on a data set with signs of tampering.

From the perspective of a scientist, one might wish for a system of retractions that acts swiftly and transparently. Through swiftness, the influence of fraudulent papers might be minimized, and through transparency, one might be appraised of the status of each concern.

Despite these goals, the accused has rights and must be considered innocent until found guilty. The accused, therefore, retains certain rights and protections. Because an ongoing investigation can harm one's reputation and career, oversight committees will not comment on the status or existence of an investigation.

Even when the accused is indeed guilty, they may recruit lawyers to apply legal pressure to universities, journals, or whistleblowers to avoid the career damage of a retraction. This can further complicate and frustrate scientific self-correction.

Should internal investigation really be necessary?

From a researcher's perspective, it's a shame that retraction seems to require a misconduct investigation. Such investigations are time-consuming. It is also difficult to prove intent absent some confession -- this may be why Diederik Stapel has 58 retractions, but only three of eight suspicious Jens Forster papers have been retracted.

Additionally, I'm not sure that a misconduct investigation is strictly necessary to find a paper worthy of retraction. When a paper's conclusions do not follow from the data, or the data are clearly mistaken, a speedy retraction would be nice.

Sometimes we are fortunate enough to see papers voluntarily withdrawn without a full-fledged investigation. Often this is possible only when there is some escape valve for blame: There is some honest mistake that can be offered up, or some collaborator can be offered as blameworthy. For example, this retraction could be lodged quickly because the data manipulation was performed by an unnamed graduate student. Imagine a different case where the PI was at fault -- it would have required years of investigation.


Whistleblowers are often upset that clearly suspicious papers are sometimes labeled only with an expression of concern. These frustrations are exacerbated by the opacity of investigations, in that it is often unclear whether there is an investigation at all, much less what progress has been made in the investigation.

Personally, I hope that journals will make effective use of expressions of concern as appropriate. I also appreciate the efforts of honest authors to voluntarily withdraw papers, as this allows for much
faster self-correction than would be possible if university investigation were necessary.

Unfortunately, detection of malfeasance will remain time-consuming and imperfect. Retraction is quick only when authors are either (1) honest and cooperative, issuing a voluntary withdrawal or (2) dishonest but with a guilty conscience, confessing quickly under scrutiny. However, science still has few tools against sophisticated and tenacious frauds with hefty legal war chests.

Sunday, October 23, 2016

Outrageous Fortune: 2. The variability of rare drops

Growing up, I played a lot of role-playing games for the Super Nintendo. One trope of late-game design for role-playing games are rare drops -- highly desirable items that have a low probability of appearing after a battle. These items are generally included as a way to let players kill an awful lot of time as they roll the dice again and again trying to get the desired item.

For example, in Final Fantasy 4, there is an item called the "pink tail" that grants you the best armor in the game. It has a 1/64 chance of dropping when you fight a particular monster. In Earthbound, the "Sword of Kings" has a 1/128 chance of dropping when you fight a particular monster. In Pokemon, there are "shiny" versions of normal Pokemon that have a very small chance of appearing (in the newest games, the chance is something like 1/4096).

Watching people try to get these items reveals an interesting misunderstanding about how probability works. Intuitively, it makes sense that if the item has a 1/64 chance of dropping, then by the time you've fought the monster 64 times, you should have a pretty good chance of having the item.

Although it's true that the average number of required pulls is 64, there's still a substantial role of chance. This is a recurring theme in gaming and probability -- yes, we know what the average experience is, but the amount of variability around that can be quite large. (See also my old post on how often a +5% chance to hit actually converts into more hits.)

It turns out that if your desired item has a drop rate of 1/64, then after 64 pulls, there's only a 63.6% chance that you have the item.

To understand the variability around the drop chance, we have to use the negative binomial distribution. The binomial distribution takes in a number of attempts and a probability of success to tell us how many successes to expect. The negative binomial inverts this: it takes in the number of desired successes and the probability of success to tell us how many attempts we'll need to make.

In R, we can model this with the function pnbinom(), which gives us the cumulative density function. This tells us what proportion of players will get a success by X number of pulls.
Let's look at some examples.

Earthbound: The Sword of Kings

The Sword of Kings has a drop rate of 1/128 when the player fights a Starman Super. In a normal game, a player probably fights five or ten Starman Supers. How many players can be expected to find a Sword of Kings in the course of normal play? Some players will decide they want a Sword of Kings and will hang out in the dungeon fighting Starman Supers ("Starmen Super"?) until they find one. How long will most players have to grind to get this item?

By the time the player finds a Sword of Kings, the party is probably so overleveled from killing dozens of Starman Supers that they don't really need the Sword anyway. (From http://starmendotnet.tumblr.com/post/96897409329/sijbrenschenkels-finally-found-sword-of)

We use pnbinom() to get the cumulative probabilities. We'll use the dplyr package too because I like piping and being able to use filter() later for a nice table.

eb <- data.frame(x = 1:800) %>% 
  mutate(p = pnbinom(x, size = 1, prob = 1/128))

filter(eb, x %in% c(1, 10, 50, 100, 128, 200, 256, 400))

with(eb, plot(x, p, type = 'l',
              xlab = "Starman Supers defeated",
              ylab = "Probability of at least one drop",

              main = "Grinding for a Sword of Kings"))

A lucky 8% of all players will get the item in their first ten fights, a number that might be found in the course of normal play. 21% of players still won't have gotten one after two hundred combats, and 4% of players won't have gotten the Sword of Kings even after fighting four hundred Starman Supers!

Final Fantasy IV: The Pink Tail

There's a monster in the last dungeon of Final Fantasy IV. When you encounter monsters in a particular room, there is a 1/64 chance that you will find the "Pink Puff" monster. Every Pink Puff you kill has a 1/64 chance of dropping a pink tail.

ff4 <- data.frame(x = 1:400) %>%
  mutate(p = pnbinom(x, size = 1, prob = 1/64))

filter(ff4, x %in% c(1, 5, 10, 50, 64, 100, 200))

Just to find the Pink Puff monster is a major endeavor. A lucky 3% of players will find a Pink Puff on their first combat, and about 16% of players will run into one in the course of normal play (10 combats). But 20% of players won't have found one even after a hundred combats, and 4% of players won't have found a Pink Puff even after two hundred combats.

Finding a pack of Pink Puffs is a 1/64 chance, and that's just the start of it.

After you find and kill the Pink Puff, it still has to drop the pink tail, which is a 1/64 chance per Pink Puff. So 20% of players won't find a pink tail even after killing a hundred Pink Puffs. Consider next that one finds, on average, one group of Pink Puffs per 64 combats, and Pink Puffs come in groups of five. You could run through more than a thousand fights in order to find 100 Pink Puffs and still not get a pink tail. Ridiculous!

Here's a guy on the IGN forums saying "I've been trying for a week and I haven't gotten one."

Shiny Pokemon

A "shiny" pokemon is a rarer version of any other pokemon. In the newest pokemon game, any wild pokemon has a 1/4096 chance of being shiny.

This is so rare that we'll put things into log scale so that we're calling pnbinom() 100 times rather than 22000 times.

pkmn <- data.frame(x = seq(.1, 10, by = .1)) %>% 
  mutate(p = pnbinom(exp(x), size = 1, prob = 1/4096))
with(pkmn, plot(exp(x), p, typ = 'l'))

filter(pkmn, exp(x) <= 500) %>% tail(1)
filter(pkmn, exp(x) <= 2000) %>% tail(1)
filter(pkmn, exp(x) <= 10000) %>% tail(1)

11% of players will find one shiny pokemon within 500 encounters. About 39% will find one within 2000 encounters. 9% of players will grind through ten thousand encounters and still not find one.

There's a video on youtube of a kid playing three or four gameboys at once until he finds a shiny pokemon after about 26000 encounters. (I think this was in one of the earlier pokemon games where the encounter rate was about 1/8000.) There seems to be a whole genre of YouTube streamers showing off their shiny pokemon that they had gone through thousands of encounters looking for.

World of Warcraft

I don't know anything about World of Warcraft, but googling for its idea of a rare drop turns up a sword with a 1 in 1,500 chance of dropping as a quest reward.

wow <- data.frame(x = 1:5e3) %>% 
  mutate(p = pnbinom(x, size = 1, prob = 1/1500))
with(wow, plot(x, p, type = 'l'))

filter(wow, round(p, 3) == .5)
filter(wow, round(p, 3) == .8)
filter(wow, x == 2500)

Suppose your dedicated fanbase decides to try grinding for this item. Players can do 1000 quests and only half of them will get this sword. Among players running 2500 quests, 19% of players still won't have gotten one.

My Thoughts

In general, I feel like rare drops aren't worth grinding for. There's a lot of chance involved in the negative binomial function, and you could get very lucky or very unlucky.

Sometimes single-player RPGs seem to include them as a somewhat cynical way to keep kids busy when they have too much time to kill. In this case, players seem to know they're going to have to grind a long time, but they may not realize just how long they could grind and still not get it. The draw for these items seems to be more about the spectacle of the rarity than about the actual utility of the item.

In massively multiplayer games, it seems like drops are made so rare that players aren't really expected to grind for them. An item with a 1/1500 drop chance isn't something any one player can hope to get, even if they are deliberately trying to farm the item. Thus, rare items in MMOs are more like Willy Wonka Golden Tickets that a few lucky players get, rather than something that one determined player works to get. One player could try a thousand times and still not get the item, but across a hundred thousand players trying once, a few will get it, and that's enough to keep the in-game economy interesting.

My preference is that chance is something that encourages a shift in strategy rather than something that strictly makes your character better or worse. Maybe the player is guaranteed a nice item, but that nice item can be one of three things, each encouraging a different playstyle.

Still, it's fun sometimes to get something unexpected and get a boost from it. Imagine being one of the ~8% of players who gets a Sword of Kings by chance. Another approach is to provide a later way to ensure getting the desired item. In the roguelike Dungeon Crawl, some desirable items can be found early by luck, but they are also guaranteed to appear later in the game by skill.

Anyway, don't bet on getting that rare drop -- you may find yourself grinding much longer than you'd thought.

Tuesday, October 18, 2016

Publishing the Null Shows Sensitivity to Data

Some months ago, a paper argued for the validity of an unusual measurement of aggression. According to this paper, the number of pins a participant sticks into a paper voodoo doll representing their child seems to be a valid proxy for aggressive parenting.

Normally, I might be suspicious of such a paper because the measurement sounds kind of farfetched. Some of my friends in aggression research scoffed at the research, calling bullshit. But I felt I could trust the research.

Why? The first author has published null results before.

I cannot stress enough how much an author's published null results encourages my trust of a published significant result. With some authors, the moment you read the methods section, you know what the results section will say. When every paper supports the lab's theory, one is left wondering whether there are null results hiding in the wings. One starts to worry that the tested hypotheses are never in danger of falsification.

"Attached are ten stickers you can use to harm the child.
You can stick these onto the child to get out your bad feelings.
You could think of this like sticking pins into a Voodoo doll."

In the case of the voodoo doll paper, the first author is Randy McCarthy. Years ago, I became aware of Dr. McCarthy when he carefully tried to replicate the finding that heat-related word primes influence hostile perceptions (DeWall & Bushman, 2009) and reported null results (McCarthy, 2014).

The voodoo doll paper from McCarthy and colleagues is also a replication attempt of sorts. The measure was first presented by DeWall et al. (2013); McCarthy et al. perform conceptual replications testing the measure's validity. On the whole, the replication and extension is quite enthusiastic about the measure. And that means all the more to me given my hunch that McCarthy started this project by saying "I'm not sure I trust this voodoo doll task..."

Similar commendable frankness can be seen in work from Michael McCullough's lab. In 2012, McCullough et al. reported that religious thoughts influence male's stereotypically-male behavior. Iin 2014, one of McCullough's grad students published that she couldn't replicate the 2012 result (Hone & McCullough, 2014).

I see it as something like a Receiver Operating Characteristic curve. If the classifier has only ever given positive responses, that's probably not a very useful classifier -- you can't tell if there's any specificity to the classifier. A classifier that gives a mixture of positive and negative responses is much more likely to be useful.

A researcher that publishes a null now and again is a researcher I trust to call the results as they are.

[Conflict of interest statement: In the spirit of full disclosure, Randy McCarthy once gave me a small Amazon gift card for delivering a lecture to the Bayesian Interest Group at Northern Illinois University.]

Friday, August 19, 2016

Comment on Strack (2016)

Yesterday, Perspectives on Psychological Science published a 17-laboratory Registered Replication Report, totaling nearly 1900 subjects. In this RRR, researchers replicated an influential study of the Facial Feedback Effect, showing that being surreptitiously made to smile or to pout could influence emotional reactions.

The results were null, indicating that there may not be much to this effect.

The first author of the original study, Fritz Strack, was invited to comment. In his comment, Strack makes four criticisms of the original study that, in his view, undermine the results of the RRR to some degree. I am not convinced by these arguments; below, I address each in sequence.

"Hypothesis-aware subjects eliminate the effect."

First, Strack says that participants may have learned of the effect in class and thus failed to demonstrate it. To support this argument, he performs a post-hoc analysis demonstrating that the 14 studies using psychology pools found an effect size of d = -0.03, whereas the three studies using non-psychology undergrad pools found an effect size of d = 0.16, p = .037.

However, the RRR took pains to exclude hypothesis-aware subjects. Psychology students were also, we are told, recruited prior to coverage of the Strack et al. study in their classes. Neither of these steps ensure that all hypothesis-aware subjects were removed, of course, but it certainly helps. And as Sanjay Srivastava points out, why would hypothesis awareness necessarily shrink the effect? It could just as well enhance it by demand characteristics.

Also, d = 0.16 is quite small -- like, 480-per-group for a one-tailed 80% power test small. If Strack is correct, and the true effect size is indeed d = 0.16, this would seem to be a very thin success for the Facial Feedback Hypothesis, and still far from consistent with the original study's effect.

"The Far Side isn't funny anymore."

Second, Strack suggests that, despite the stimulus testing data indicating otherwise, perhaps The Far Side is too 1980s to provide an effective stimulus.

I am not sure why he feels it necessary to disregard the data, which indicates that these cartoons sit nicely in the midpoint of the scale. I am also at a loss as to why the cartoons need to be unambiguously funny -- had the cartoons been too funny, one could have argued there was a ceiling effect.

"Cameras obliterate the facial feedback effect."

Third, Strack suggests that the "RRR labs deviated from the original study by directing a camera at the participants." He argues that research on objective self-awareness demonstrates that cameras induce subjective self-focus, tampering with the emotional response.

This argument would be more compelling if any studies were cited, but in either case, I feel the burden of proof rests with this novel hypothesis that the facial feedback effect is moderated by the presence of cameras.

"The RRR shows signs of small-study effects."

Finally, Strack closes by using a funnel plot to suggest that the RRR results are suffering from a statistical anomaly.

He shows a funnel plot that compares sample size and Cohen's d, arguing that it is not appropriately pyramidal. (Indeed, it looks rather frisbee-shaped.)

Further, he conducts a correlation test between sample size and Cohen's d. This result is not, strictly speaking, statistically significant (p = .069), but he interprets it all the same as a warning sign. (It bears mention here that an Egger test with an additive error term is a more appropriate test. Such a test yields p = .235, quite far from significance.)

Strack says that he does not mean to insinuate that there is "reverse p-hacking" at play, but I am not sure how else we are to interpret this criticism. In any case, he recommends that "the current anomaly needs to be further explored," which I will below.

Strack's funnel plot does not appear pyramidal because the studies are all of roughly equal size, and so the default scale of the axes is way off. Here I present a funnel plot with axes of more appropriate scale. Again, the datapoints do not form a pyramid shape, but we see now that this is because there is little variance in sample size or standard error with which to make a pyramid shape. You're used to seeing taller, more funnel-y funnels because sample sizes in social psych tend to range broadly from 40 to 400, whereas here they vary narrowly from 80 to 140.

You can also see that there's really only one of the 17 studies that contributes to the correlation, having a negative effect size and larger standard error. This study is still well within the range of all the other results, of course; together, the studies are very nicely homogeneous (I^2 = 0%, tau^2 = 0), indicating that there's no evidence this study's results measure a different true effect size.

Still, this study has influence on the funnel plot -- it has a Cook's distance of 0.46, whereas all the others have distances of 0.20 or less. Removing this one study abolishes the correlation between d and sample size (r(14) = .27, p = .304), and the resulting meta-analysis is still quite null (raw effect size = 0.04, [-0.09, 0.18]). Strack is interpreting a correlation that hinges upon one influential observation.

I am willing to bet that this purported small-study effect is a pattern detected in noise. (Not that it was ever statistically significant in the first place.)

Admittedly, I am sensitive to the suggestion that an RRR would somehow be marred by reverse p-hacking. If all the safeguards of an RRR can't stop psychologists from reaching whatever their predetermined result, we are completely and utterly fucked, and it's time to pursue a more productive career in refrigerator maintenance.

Fortunately, that does not seem to be the case. The RRR does not show evidence of small-study effects or reverse p-hacking, and its null result is robust to exclusion of the most negative result.