Johns Hopkins UniversityEst. 1876

America’s First Research University

Category Voices
Published
Author Amanda Neitzel, Center for Research and Reform in Education

Is a 25-minute 5K time good?

It depends. Compared to an elite runner, it’s slow. Compared to the average adult, it’s quite fast. Compared to your own previous time, it might represent real progress.

We intuitively understand this in everyday life. But in education research, we often forget it.

When people ask whether a tutoring program or curriculum “works,” what they often mean is: Is the effect size big enough? Is 0.10 meaningful? Is 0.20 strong?

But that’s the wrong first question.

The better question is: Compared to what?

What an Effect Size Actually Tells Us

In any evaluation—whether a randomized controlled trial or a quasi-experimental study—the estimated impact reflects a difference between two groups:

  • Students who received the program
  • Students who did not—the control group in randomized studies, or more generally, the comparison group

That sounds straightforward. But in practice, what the comparison group experiences can vary enormously and that variation fundamentally shapes the results.

In other words, effect sizes are not just about how well something works, they are about how much better it works than the alternative.

Tutoring: Not One Comparison, but Many

Consider tutoring, one of the most studied and widely implemented strategies in education today.

A study of tutoring might compare:

  • Tutoring vs. no additional academic support
  • Tutoring vs. another tutoring program
  • Tutoring vs. other supplemental services (e.g., small groups, intervention blocks)
  • Tutoring that replaces core instruction vs. tutoring that adds to it

These are not small differences. They are entirely different questions.

If tutoring is compared to no additional support, we might expect relatively large effects. If it is compared to another structured intervention, effects will likely be smaller. If tutoring replaces core instruction, pulling students out of reading or math blocks, the net benefit may be reduced or even negligible.

Yet these distinctions are often collapsed into a single headline: “Tutoring produced an effect size of 0.20.”

Without context, that number is almost meaningless.

A tutoring program showing an effect size of 0.20 against no additional support may be less impressive than a program showing 0.10 against another high-quality tutoring model. But unless we understand the comparison, we can’t interpret the result.

Curriculum Studies: The “Business as Usual” Problem

The same issue arises in evaluations of curriculum and instructional programs.

We are never comparing a reading curriculum to “no reading instruction.” Instead, studies typically compare:

  • Curriculum A vs. Curriculum B
  • A new program vs. “business as usual”

But “business as usual” is rarely well-defined. It might include:

  • A different structured curriculum
  • Teacher-developed materials
  • A mix of approaches that vary across classrooms or schools

In reality, many curricula share common features. They align to similar standards, include overlapping instructional practices, and are implemented under similar constraints.

As a result, even meaningful improvements may produce modest effect sizes.

This is not a sign that “nothing works.” It is a reflection of the fact that we are comparing one reasonable approach to another, not replacing something with nothing.

Why “It Depends” Is the Right Answer

When someone asks, “Is an effect size of 0.10 good?” the honest answer is:

It depends.

And not just on one thing.

It depends on:

  • What the program is being compared to
  • The strength of the research design (e.g., randomized vs. quasi-experimental)
  • What students in the comparison group actually experienced
  • Whether the program adds to or replaces existing instruction
  • The outcome being measured and the time frame

Among these, the comparison condition is often the least visible, but one of the most important.

A small effect against a strong comparison can be more impressive than a larger effect against a weak one. But without understanding the comparison, it is difficult to know which is which.

What This Means for Evidence Users

For practitioners and policymakers, this leads to a simple but powerful habit:

Always ask: What is this being compared to?

Before interpreting results, look for:

  • A clear description of the comparison condition
  • Whether students received alternative supports or services
  • Whether the program replaced or added to existing instruction

These details are sometimes buried in technical sections—or missing altogether—but they are essential for making sense of the findings.

Developing the habit of asking this question is one of the easiest ways to become a more sophisticated consumer of evidence.

What This Means for Researchers

Researchers can strengthen the field by improving how we describe comparison conditions.

This is not easy. “Business as usual” is inherently messy. It varies across schools, classrooms, and even individual teachers.

But even partial transparency helps:

  • Describing typical instruction in comparison classrooms
  • Reporting access to supplemental supports
  • Clarifying whether interventions replace or supplement core instruction

Better descriptions of comparison conditions make findings more interpretable—and more useful for decision-making.

Conclusion

Going back to the 5K analogy: a time only makes sense in context.

Education research is no different.

Effect sizes do not tell us how well something works in absolute terms. They tell us how much better—or worse—it performs relative to an alternative.

And until we understand that alternative, we do not really understand the evidence at all.

So the next time you see an effect size, resist the urge to ask whether it is “good.”

Instead, start with a better question:

Compared to what?

Keep up with our latest news.