How to Judge if Research is Trustworthy
[UPDATE Feb. 3, 2012: Please see additional clarification from both of the researchers of the studies cited in this article below.]
Scientists are notorious for questioning the veracity of publicized research — and with good reason. They want to know: Who conducted the research? Where was it published? What were the survey questions?
It’s that much more important when it comes to evaluating research in education that will affect the investment decisions of teachers, parents, and administrators.
Case in point: does the iPad boost student learning? Is it a solid educational tool, as the headline from a recent article in Wired magazine says, maintaining that the devices are improving student engagement and assessment.
The article draws on two recent studies conducted on iPad apps: one on Houghton Mifflin Harcourt’s Fuse Algebra I app (see MindShift’s coverage here) and one on Motion Math’s fraction app (see MindShift’s coverage here). Both of these studies tout positive results for the apps in question: In the case of the former, state standardized test scores jumped by 20%; in the case of the latter, students’ scores improved an average of 15%.
Both studies were commissioned by the companies in question; Motion Math hired an independent researcher and Houghton used both the research firm Empirical Education and its own staff to review the data. Neither have appeared in a peer-reviewed, scientific journal, although the Motion Math study is being submitted to one. That doesn’t mean that the findings are necessarily invalid, but it does mean it’s particularly important to take a close look at the research design and the conclusions they make.
For example, the HMH study was not a randomized, controlled experiment, and relied on self-reports by the two teachers who took part in the study. There was no statistical analysis, and no indication of the extent of the spread in the scores. And though the company surveyed four school districts, the only results released came from one school in Riverside County because each district carried out the study in different ways.
John Sipe, senior vice president at Houghton Mifflin Harcourt, acknowledged there’s more work to be done in assessing the results. “It causes us to want to look more and look further,” he said. “This is not a gold standard, this is simply a case study.”
I spoke with Dr. Alicia Chang, the Director of Science and Learning at the education gaming startup Airy Labs, about how non-scientists in particular (in other words, the general public) can evaluate research. Chang has a PhD in cognitive psychology from UCLA and has held post-doc research positions at the University of Delaware’s School of Education and at the University of Pittsburgh’s Learning Research and Development Center. As such, she has designed and implemented a number of research studies about learning, cognition, and developmental psychology.
Q. What are some of the things we should look for in terms of research design?
Chang: First, a “controlled experiment” should actually be an experiment. Some critical features include: random selection of participants (typically conducted at a university — studies done comparing classrooms or pre-existing groups are automatically not true experiments, but rather “quasi-experiments.” In a true experiment, the experimenter manipulates the independent variables — so you’d have to randomly assign kids to teachers/classrooms if you wanted a real experiment comparing classrooms, but obviously that would be really hard to do in reality.
Second, look for whether or not it’s peer-reviewed. White papers are self-published (and funded!) by companies for marketing purposes, so they can basically say whatever they want. Scholarly articles published in journals go through rigorous peer reviews by experts in the field so they can be considered more or less objective, and not for financial gain. Peer review does not help the cause of for-profit companies because the cycles can take months or years to complete, so by the time your paper gets published, your product cycle is long over. (I’m hoping that in the future maybe startups/companies can work collaboratively with researchers in order to improve this — I’ve tossed around the idea of establishing lab sites at universities and/or having a scientific advisory board, for example. I think data will be really important in the near future, and people will want to see measurable learning gains supported by solid evidence, or at least here’s to hoping!)
Third, no financial conflict of interest. This preserves integrity/objectivity of the results.
Q. What are some of the things we should look look for in terms of conclusions/analysis?
Chang: When I was teaching research methods, one of my pet peeves was the use of the word “proof” in science. One thing to keep in mind is that real science is always tentative and changing, so if a marketing message indicates “proven” results, whoever wrote that tagline probably does not have a real science background.
Q. What are some of the “warning signs” that research (or conclusions) may be flawed?
Chang: First, a very limited sample size without random assignment. For example, in the studies that were cited above, these were existing classrooms with different teachers.
Second, correlational analyses claiming causal relationships. This is one of the first things you learn in statistics. Just because two things co-occur, you can’t conclude that one thing causes the other. In mainstream media, you’ll often see grand conclusions like “eating fried food doesn’t cause heart disease!” but they didn’t actually only feed one group of people fried food and do a longitudinal study comparing them to a group that ate zero fried food. There is no way to tell a direct causal relationship there. This happens a lot with brain research, but most people are unaware that you can’t conclude causality with a lot of human neuroscience.
Q. For folks that aren’t familiar with statistics, how can we better understand if the results are actually “statistically significant”?
Chang: In the HMH study, for example, they don’t actually do any stats and only report percentages. This can be misleading because it might look like a big difference (e.g. 78% vs. 49%) but without any real analyses you don’t know if it’s “meaningful,” because the scores might appear different due to various reasons (perhaps one group had a distribution where half the kids did really well and half the kids didn’t and it just drove up the means). But in the HMH case they didn’t even show actual understanding of algebraic concepts, just the overall percentage of kids that tested proficient on the state standards, which is a completely different and tangential measure.
Q. When can we generalize research beyond the study group?
Chang: In a peer-reviewed study published in a scholarly journal, you can usually assume a level of integrity. Respectable journals will not allow critical design flaws to get through. You can also look at effect sizes and the way they selected their subjects. If they seem to be reasonable (a lot of studies use college undergrads as their sample, which isn’t entirely generalizable to the whole population, but generalizable to — let’s say — middle class, well-educated people), they are probably more or less generalizable.
[EDITOR’S NOTE, Feb. 3, 2012: We have heard from both Professor Michelle Riconscente, from the University of Southern California, who conducted the Motion Math study referenced in the article, and Denis Newman, CEO of Empirical Education, which conducted its own study of the Houghton Mifflin Hartcourt algebra iPad app separate from the one cited in the article. Both of these notes add useful additional context and points to ponder to this piece. I encourage you to read the letters below.
From Michelle Riconscente:
I am an advocate for highly rigorous research methods and the author of the Motion Math efficacy study. As such, I was both pleased to see MindShift raise awareness about the trustworthiness of research, and disappointed to see that, by engaging only one source (who is in competition with both companies mentioned), the article fell short of its goal to offer readers a balanced perspective on this important issue. Moreover, the article incorrectly implied that the Motion Math study shares the limits of the Houghton Mifflin Harcourt study. The two studies differ dramatically in their research designs.
Like much of the education research published in peer-reviewed journals, the Motion Math study used a controlled experimental design with random assignment by class to test the effectiveness of an individual-level (not class-level) intervention. Additionally, the Motion Math study applied an “extra” experimental feature called a “crossover design” which helped ensure that the increases in math scores were truly attributable to the app and not to something else, such as their teachers, or previous math instruction.
Though Dr. Chang offered some good advice to readers, she inaccurately indicated that the Motion Math study only compared classes taught by different teachers. She also states that white papers “can basically say whatever they want.” All the more reason why the criteria for judging their validity must be the quality and transparency of their rationale, methods, results, and interpretations. Those who read the Motion Math study will find that it fully meets these standards for rigorous and trustworthy research.
Prof. Michelle Riconscente
University of Southern California
From Denis Newman:
It’s great to see evidence-based discussions of how the iPad might be beneficial for schools (January 23 entry). It is particularly exciting to see school systems like Riverside USD taking a look at their own data as a way to support their local decisions about what’s working.
Unfortunately, the article unintentionally led to a confusion about which results were based on the study that my company, Empirical Education Inc., conducted and which were based on Riverside’s own data analysis. Riverside was one of four districts participating in our study, which was commissioned by Houghton Mifflin Harcourt (HMH) to study the effectiveness of their iPad app for eighth grade algebra. The comprehensive report on this study is being reviewed by HMH, and at this point, none of the findings have been released. This study was indeed a “randomized control trial.” Each participating teacher taught one randomly selected section of Algebra I using the iPad app, and their other sections continued with the conventional textbook. Riverside provided Empirical Education with data for the two teachers (and their nine sections) who had volunteered in their district. Unlike the other 3 districts, Riverside took a further step and ran the numbers themselves. That data analysis is where the “Figure 1” in your recent post originated. We applaud the district for taking a look at their own data, but we need to clarify that this doesn’t necessarily represent the overall results once they include the other districts participating in the study (we haven’t yet released the results!)
The confusion increased with this post that criticized Empirical Education’s work as follows.
“The HMH study was not a randomized, controlled experiment, and relied on self-reports by the two teachers who took part in the study. There was no statistical analysis, and no indication of the extent of the spread in the scores. And though the company surveyed four school districts, the only results released came from one school in Riverside County because each district carried out the study in different ways.”
The study, that is, HMH’s study as opposed to Riverside’s, was conducted the same way in the four districts. And we certainly used appropriate statistical methods. But, interestingly, if she is referring to the Riverside report, they had a (very small) randomized experiment—randomization was “within teacher.” We agree that, while the raw results from Riverside look promising, statistical analysis is important. Probably, the major caution that your readers should consider is that we can’t generalize the results of Riverside’s mini-experiment to other teachers who may have less experience with or enthusiasm for new technology. Each district that pilots a new program like this should check its own results.
A final note on the Watters article concerns her interview of Dr. Alicia Chang, who makes some errors in her advice to your readers about how to evaluate research. I base these comments on Empirical Education’s experience working with the US Department of Education, school systems, and other educational publishers (e.g., see the Guidelines we authored that were published by the software industry’s trade association). First, she states that “studies done comparing classrooms or pre-existing groups are automatically not true experiments,” which is incorrect. Most field experiments in education randomize clusters of students: commonly classes, teachers, grade-level teams, or schools.
Second, she is asked how we can generalize the results of a study, but doesn’t directly answer that question. Instead she mentions design flaws and how they may be caught by reviewers. But a flawlessly designed study may still have a very narrow range of generalization. For example, our study of the iPad app, although well designed, cannot be taken to apply to all math software. This is a very important caution—a flawless experiment conducted in one district may be irrelevant to decisions in a different district. This goes back to our appreciation of Riverside’s examination of their own results. Their study may have been flawed, but it provided them with a snapshot specific to their own school district. The timely examination of their results may have informed their decisions without having to wait for the official results—which may be less relevant to their local conditions—to finally be released.
Denis Newman, Ph.D.
CEO Empirical Education Inc.