Based on analysis of over 35,000 speeches spanning more than a decade, this article discusses the consistent gender gap in speaker scores, and looks at some of its features and possible explanations.

Debate is a male-dominated activity, and reports of sexism are common and occasionally high-profile (as evidenced by the widely discussed Glasgow debating scandal1). Such anecdotes become more powerful when supplemented by statistical analysis, which can offer a broader view of the effects of sexism, the factors at play, and the changes over time.

Here we offer a systematic statistical analysis of how females actually perform in competitive debate. We analyze data from 2,225 teams with 35,062 speaker scores spanning more than a decade. We analyze 14 tournaments: the European University Debating Championships (EUDC) 2001-2013, and the World University Debating Championships (WUDC) 2013.2 The central result of our analysis is simple and incontrovertible: across all tournaments, male speaker scores are higher than female speaker scores by an average of 1.2 points per round, a highly statistically significant discrepancy (p<10-11, t-test 3). We first describe a number of interesting characteristics of this “gender gap”; we then present analysis as to its causes; we conclude by discussing potential solutions. To facilitate future analysis, we are making the datasets used in this paper available in CSV format.4 All data was parsed from online records or, in the case of detailed judging data, solicited from tab directors. (We are not making the latter available due to privacy concerns.)

Characteristics of the gender gap

We discuss three factors that affect the size of the gender gap.

1. Females on female-female teams speak lower than females on gender-mixed teams, and males on male-male teams speak higher than males on gender-mixed teams. While the scores of males on gender-mixed teams are slightly higher than their female partners’ (.2 points per round; p=.007, t-test), this gap is much smaller than the difference for males and females overall (Figure 1). Mixed teams are also rarer than one would expect if males and females were equally likely to partner with both genders: across all years, 40% of teams are mixed, as compared to the 45% one would expect from uniform mixing (p<10-10, χ2). This may indicate that males prefer to partner with males, and females with females, but there may also be other causal explanations–people tend to pair with partners of equal experience, for example, and males tend to have more experience, as we discuss below.

Figure 1: The gender gap is larger for single-gender teams.

2. The size of the gender gap varies from year to year. In EUDCs 2002 and 2003, females actually outspoke males. It would be worth considering whether these tournaments had policies that reduced the size of the male-female discrepancy.

3. Females earned higher speaker scores on topics related to gender. We analyzed scores in 9 gender-related topics including, for example, “This house believes that women should have equal rights and equal obligations in the Army” (EUDC 2002) or “This house believes that custody hearings should not take a child’s biological parentage into account” (EUDC 2009). Females spoke better in every single round than they had in the tournament as a whole, by an average of .6 speaker points (p<10-7, t-test). Males spoke slightly worse, by an average of .1 speaker points, but this difference was not statistically significant.

What Causes the Gender Gap?

We examined the statistical plausibility of two hypotheses: that the gender gap is caused by differences in experience levels, and that it is caused by sexism in judging. We first note that inferring causation from speaker score statistics alone is a perilous process. It is possible, for example, that females earn higher speaker scores in rounds about gender because they tend to be better informed on these topics and give better speeches; it is also possible that judges are simply sexist, and give females more credit for making identical arguments. One cannot on the basis of the speaker score data discriminate between these hypotheses. More broadly, the mere presence of a gender gap in speaker scores does not inherently imply sexism in judging, any more than the presence of a gender gap in ovarian cancer diagnosis rates implies sexism in doctors: males don’t have ovaries, and females may be giving worse speeches.5 Statistical analyses must thus be supplemented by the more subjective ones published elsewhere in this journal, and ideally also by controlled experiments (in which, for example, the gender of the speaker is manipulated but the speech is otherwise identical).

Is the gender gap attributable to differences in experience?

Males do indeed have more experience than females, as measured by the number of previous Euros attended, and this gap has widened in recent years.

Figure 2: The mean number of previous EUDCs attended by males and females.

Figure 3: Negative correlation between proportion of tournament participants who are female and female speaker scores.

Unsurprisingly, speakers with more experience tend to earn higher speaker scores, with each additional year of experience corresponding to about 1.8 speaker points per round. The combination of these two facts helps explain the negative correlation (Figure 2, bottom) between the proportion of tournament participants who are female and how well females do relative to males: more females means more new females, which means lower speaker scores. (This correlation, due to the small number of tournaments, flirts with significance; p=.06, linear regression)

It thus seems likely that the experience gap is a partial explanation for the gender gap. But it is probably not a full one. One way to estimate the effect of the experience gap is to multiply each year’s experience gap by the estimated effect of experience: for example, if males had on average two years’ more experience than females, and each year of experience added on average three points to a speaker’s score, we would expect males to speak on average six points higher than females. When we do this, however, a great deal of the gender gap remains to be explained.6

Figure 4: The experience gap does not fully explain the gender gap.

It is nonetheless worth considering ways to reduce the experience gap. Part of the gap is probably unavoidable if (as one would hope) the proportion of female debaters continues to increase, since this will increase the number of novices; however, one might try to improve female retention. Males are 17% more likely than females to return to future Euros (p=.01, t-test). A female is more likely to return if a higher fraction of her delegation is female, however: a 50% increase in the delegation female fraction increases the probability of return by 40% (p=.04, logistic regression).7 It seems reasonable to conclude, therefore, that one way of reducing the gender score gap is to increase the fraction of females in delegations, which will increase retention, thereby experience, and thereby speaker scores.

Is the gender gap attributable to sexism in judging?

We first note that this hypothesis could be both true and statistically unprovable. The difficulty of testing the hypothesis is compounded by lack of data: while speaker scores are widely available, we had judging data only for WUDC 2013 and EUDC 2013. One strategy is to search for evidence of sexism in individual judges: for example, a judge who consistently gives lower scores to females than do other judges, and does not give lower scores to males, may be called sexist. This strategy, however, is both somewhat cruel to individual judges and statistically fruitless due to the small amount of data for each judge. We tested each individual judge j for sexism as follows: for each female f that j had judged, we computed the difference between the score j had given f and the score that f had received from other judges, did the same for the males m, and computed the probability that the means of the two lists were the same using a t-test. This would identify judges who consistently gave higher scores than did other judges to males than females, or vice versa. After we adjusted significance thresholds to account for the number of judges examined,8 no judge showed statistically significant signs of sexism, regardless of whether we examined only chair judges or all judges. (When considering non-chair judges, we examined all rooms in which a judge participated, regardless of whether they were chair.)

A more fruitful strategy might be to consider properties of the judges in aggregate, but this again yielded few results. We looked for correlations between the judge’s gender bias and three variables: judge skill (as determined by the tournament organizers on a scale of 1-9), judge gender, and the gender inequality index in the judge’s home country: none were significant.

We did, however, find evidence that judges are not “gender blind”: they display consistent preferences for one gender or the other, implying that these preferences play a role how speaker scores are assigned. When we examined the seventeen judges who attended both Euros and Worlds 2013, we found a significant correlation (p=.03) between the gap in scores they gave to males and females at Euros, and the gap in scores they gave to males and females at Worlds. This implies that judges are consistent in whether they prefer males to females or vice versa: they are not simply objective observers of arguments who are utterly oblivious to gender. A second piece of evidence for this hypothesis is that, when we conducted the t-tests for judge sexism described above and examined the significance scores for all judges in aggregate, more judges were significant at the .05 level–20% for Euros, 17% for Worlds–than one would expect from random chance 9. This implies that judges as a whole tended to display preferences for one gender or the other.

Thus, while we cannot conclusively demonstrate that individual judges prefer one gender to the other, we do find it likely that the source of the gender gap is not simply that females are worse at debate. If that were the case, and individual judges showed no gender preferences, we would not expect to see consistency in judge gender gaps across Euros and Worlds, and we would not expect to see the disproportion of significant gender preferences. Given that judges do appear to display gender preferences, and that a gender gap in speaker scores exists, it seems plausible to conclude that this gender gap may be partially attributable to judge gender preferences. We emphasize, however, that the statistical evidence that the gender gap is due in part to judge gender preferences is far less strong that the evidence that the gender gap exists: the latter is incontrovertible, but non-statistical methods may be best suited to demonstrate the former.

Conclusion

We have shown a large and statistically significant gender gap in the speaker scores given to male and female debaters. This gap varies across time, is particularly pronounced for single-gender teams, and is reduced for topics related to gender. It is partially, though probably not completely, attributable to the fact that male debaters have more experience. Judges tend to be consistent in their preferences for one gender or the other, suggesting that sexism may also play a role.

The gender gap has implications beyond debate. To the extent that public speaking is required in more consequential arenas–in corporate boardrooms, academic conferences, or chambers of parliament–one might plausibly expect the forces producing a gender gap in debate to produce gender gaps there as well. Indeed, previous research has shown that women do speak less than men in mixed-gender deliberations, and that their suggestions receive more criticism, although this discrepancy can be mitigated if efforts are made to include all participants (for example, through a unanimous decision rule).10 11

We conclude by discussing two positive trends and suggestions for accelerating them. First, of late, “elite females” at the high end of the female score distribution have been outspeaking “elite males” at the high end of the male score distribution: in the past Worlds and the past 3 EUDCs, females at the 99th percentile have outspoken males at the 99th percentile. (This is a striking result: in other male-dominated competitions, such as Math Olympiads, the gender skew becomes larger at the very high end.12) Due to the small number of debaters, the statistical significance of our finding is dubious, but it nonetheless suggests ways to reduce the gender gap: high-speaking females could serve as role models or teachers for new females as well as counterexamples to sexist judges (if they exist). Second, the proportion of females in debate has increased in recent years, although there is some evidence that this has actually increased the gender gap by bringing in novices (and, potentially, by reaching farther down the bell curve).

We suggest several means of reducing the gender gap. First, increasing female debaters’ experience (both by getting them to more tournaments and by improving retention to male levels) will improve speaker scores. (We note, however, that simply “getting more girls to tournaments” is an incomplete solution: previous research has shown that in some deliberative settings, the more women are present, the more silenced the women become11). Second, increasing the number of male-female partnerships might reduce the gender gap: male-female partnerships tend to feature more equal speaking scores, although this may simply be a selection effect. Still, it seems worth trying, particularly given that males and females tend to segregate. Third, one might study tournaments in which the gender gap is smaller: EUDCs 2002-2003, for example.14 Finally, although this is controversial, we might suggest statistically monitoring judges for signs of sexism: although, as our analysis suggests, such evidence would be difficult to find given the small sample size, the mere awareness by judges that they were being monitored might in itself reduce sexism (it also, of course, might bias judges in the other direction). We hope that some combination of these measures, along with continued study through statistical, experimental, and subjective methods, will continue our progress towards a more equal world.

With thanks to Stephen Boyle, Jens Fisher, Tommy Peto, and especially Shengwu Li for insights and data.

Utton C. “Who said misogyny’s dead? Female students receive sexist heckles at Glasgow Ancients debating competition”. The Independent, 5 March 2013.
Machine-readable data is more readily available for EUDCs than for WUDCs, but we saw no large statistical differences between the two tournaments.
Throughout this paper, we use “p” or “p-value” to refer to the probability (between 0 and 1) that a result would be seen by chance (if there was no pattern and the data was purely random). p-values below .05 are usually considered “significant”.
Available online: http://cs.stanford.edu/people/emmap1/math.html, “Statistical Evidence of Sexism in Competitions”.
It is worth considering what the latter hypothesis would even mean: while we would probably all agree that better debate speeches are more persuasive, and that the persuasiveness of a speech increases with objective measures like factual accuracy, persuasiveness also probably increases with qualities like a speaker’s confidence and height, characteristics which may have strong correlations with gender. At what point do gendered standards of good speaking devolve into sexist ones?
It is worth noting, however, that even if the gender gap were entirely attributable to differences in experience, the lack of perfect data might lead us to underestimate the effects of experience.
This number should not be taken too literally, given the hazards of regression and the curious fact that, when one regresses probability of return on both the number of males in the delegation and the number of females in the delegation, only the former is significant (p=.03) with each additional male reducing the probability of return by 8%. The conclusion, perhaps, is that women don’t want more women: they just dislike men.
Using a Bonferroni correction: for example, if we test 500 judges for sexism, each judge must be significant at the .05/500 level, rather than at the usual .05 threshold.
We confirmed this by performing a bootstrap and randomly shuffling the sexes of all competitors.
Karpowitz C, Mendelberg T, Shaker L. “Gender Inequality in Deliberative Participation”. American Political Science Review, 106(3):533-547 (2012).
Kathlene L. “Power and influence in state legislative policymaking: The interaction of gender and position in committee hearing debates”. The American Political Science Review, 88(3):560-576 (1994).
Ellison G, Swanson A. “The Gender Gap in Secondary School Mathematics at High Achievement Levels: Evidence from the American Mathematics Competitions”. Working paper, MIT and NBER (2009).
Kathlene L. “Power and influence in state legislative policymaking: The interaction of gender and position in committee hearing debates”. The American Political Science Review, 88(3):560-576 (1994).
As well as American tournaments, which often do not have statistically significant gender gaps–thanks to Stephen Boyle for this insight.

Emma Pierson

Characteristics of the gender gap

What Causes the Gender Gap?

Is the gender gap attributable to differences in experience?

Is the gender gap attributable to sexism in judging?

Conclusion