How to Judge if Research is Trustworthy

| January 31, 2012 | 12 Comments
  • Email Post

B. Gilliard

[UPDATE Feb. 3, 2012: Please see additional clarification from both of the researchers of the studies cited in this article below.]

Scientists are notorious for questioning the veracity of publicized research — and with good reason. They want to know: Who conducted the research? Where was it published? What were the survey questions?

It’s that much more important when it comes to evaluating research in education that will affect the investment decisions of teachers, parents, and administrators.

Case in point: does the iPad boost student learning? Is it a solid educational tool, as the headline from a recent article in Wired magazine says, maintaining that the devices are improving student engagement and assessment.

The article draws on two recent studies conducted on iPad apps: one on Houghton Mifflin Harcourt’s Fuse Algebra I app (see MindShift’s coverage here) and one on Motion Math’s fraction app (see MindShift’s coverage here). Both of these studies tout positive results for the apps in question: In the case of the former, state standardized test scores jumped by 20%; in the case of the latter, students’ scores improved an average of 15%.

Both studies were commissioned by the companies in question; Motion Math hired an independent researcher and Houghton used both the research firm Empirical Education and its own staff to review the data. Neither have appeared in a peer-reviewed, scientific journal, although the Motion Math study is being submitted to one. That doesn’t mean that the findings are necessarily invalid, but it does mean it’s particularly important to take a close look at the research design and the conclusions they make.

For example, the HMH study was not a randomized, controlled experiment, and relied on self-reports by the two teachers who took part in the study. There was no statistical analysis, and no indication of the extent of the spread in the scores. And though the company surveyed four school districts, the only results released came from one school in Riverside County because each district carried out the study in different ways.

John Sipe, senior vice president at Houghton Mifflin Harcourt, acknowledged there’s more work to be done in assessing the results. “It causes us to want to look more and look further,” he said. “This is not a gold standard, this is simply a case study.”

I spoke with Dr. Alicia Chang, the Director of Science and Learning at the education gaming startup Airy Labs, about how non-scientists in particular (in other words, the general public) can evaluate research. Chang has a PhD in cognitive psychology from UCLA and has held post-doc research positions at the University of Delaware’s School of Education and at the University of Pittsburgh’s Learning Research and Development Center. As such, she has designed and implemented a number of research studies about learning, cognition, and developmental psychology.

Q. What are some of the things we should look for in terms of research design?

Chang: First, a “controlled experiment” should actually be an experiment. Some critical features include: random selection of participants (typically conducted at a university — studies done comparing classrooms or pre-existing groups are automatically not true experiments, but rather “quasi-experiments.” In a true experiment, the experimenter manipulates the independent variables — so you’d have to randomly assign kids to teachers/classrooms if you wanted a real experiment comparing classrooms, but obviously that would be really hard to do in reality.

Second, look for whether or not it’s peer-reviewed. White papers are self-published (and funded!) by companies for marketing purposes, so they can basically say whatever they want. Scholarly articles published in journals go through rigorous peer reviews by experts in the field so they can be considered more or less objective, and not for financial gain. Peer review does not help the cause of for-profit companies because the cycles can take months or years to complete, so by the time your paper gets published, your product cycle is long over. (I’m hoping that in the future maybe startups/companies can work collaboratively with researchers in order to improve this — I’ve tossed around the idea of establishing lab sites at universities and/or having a scientific advisory board, for example. I think data will be really important in the near future, and people will want to see measurable learning gains supported by solid evidence, or at least here’s to hoping!)

Third, no financial conflict of interest. This preserves integrity/objectivity of the results.

Q. What are some of the things we should look look for in terms of conclusions/analysis?

Chang: When I was teaching research methods, one of my pet peeves was the use of the word “proof” in science. One thing to keep in mind is that real science is always tentative and changing, so if a marketing message indicates “proven” results, whoever wrote that tagline probably does not have a real science background.

Q. What are some of the “warning signs” that research (or conclusions) may be flawed?

Chang: First, a very limited sample size without random assignment. For example, in the studies that were cited above, these were existing classrooms with different teachers.
Second, correlational analyses claiming causal relationships. This is one of the first things you learn in statistics. Just because two things co-occur, you can’t conclude that one thing causes the other. In mainstream media, you’ll often see grand conclusions like “eating fried food doesn’t cause heart disease!” but they didn’t actually only feed one group of people fried food and do a longitudinal study comparing them to a group that ate zero fried food. There is no way to tell a direct causal relationship there. This happens a lot with brain research, but most people are unaware that you can’t conclude causality with a lot of human neuroscience.

Q. For folks that aren’t familiar with statistics, how can we better understand if the results are actually “statistically significant”?

Chang: In the HMH study, for example, they don’t actually do any stats and only report percentages. This can be misleading because it might look like a big difference (e.g. 78% vs. 49%) but without any real analyses you don’t know if it’s “meaningful,” because the scores might appear different due to various reasons (perhaps one group had a distribution where half the kids did really well and half the kids didn’t and it just drove up the means). But in the HMH case they didn’t even show actual understanding of algebraic concepts, just the overall percentage of kids that tested proficient on the state standards, which is a completely different and tangential measure.

Q. When can we generalize research beyond the study group?

Chang: In a peer-reviewed study published in a scholarly journal, you can usually assume a level of integrity. Respectable journals will not allow critical design flaws to get through. You can also look at effect sizes and the way they selected their subjects. If they seem to be reasonable (a lot of studies use college undergrads as their sample, which isn’t entirely generalizable to the whole population, but generalizable to — let’s say — middle class, well-educated people), they are probably more or less generalizable.

[EDITOR’S NOTE, Feb. 3, 2012: We have heard from both Professor Michelle Riconscente, from the University of Southern California, who conducted the Motion Math study referenced in the article, and Denis Newman, CEO of Empirical Education, which conducted its own study of the Houghton Mifflin Hartcourt algebra iPad app separate from the one cited in the article. Both of these notes add useful additional context and points to ponder to this piece. I encourage you to read the letters below.

From Michelle Riconscente:

I am an advocate for highly rigorous research methods and the author of the Motion Math efficacy study. As such, I was both pleased to see MindShift raise awareness about the trustworthiness of research, and disappointed to see that, by engaging only one source (who is in competition with both companies mentioned), the article fell short of its goal to offer readers a balanced perspective on this important issue. Moreover, the article incorrectly implied that the Motion Math study shares the limits of the Houghton Mifflin Harcourt study. The two studies differ dramatically in their research designs.

Like much of the education research published in peer-reviewed journals, the Motion Math study used a controlled experimental design with random assignment by class to test the effectiveness of an individual-level (not class-level) intervention. Additionally, the Motion Math study applied an “extra” experimental feature called a “crossover design” which helped ensure that the increases in math scores were truly attributable to the app and not to something else, such as their teachers, or previous math instruction.

Though Dr. Chang offered some good advice to readers, she inaccurately indicated that the Motion Math study only compared classes taught by different teachers. She also states that white papers “can basically say whatever they want.” All the more reason why the criteria for judging their validity must be the quality and transparency of their rationale, methods, results, and interpretations. Those who read the Motion Math study will find that it fully meets these standards for rigorous and trustworthy research.

Prof. Michelle Riconscente
University of Southern California
riconsce@usc.edu

——-

From Denis Newman:

It’s great to see evidence-based discussions of how the iPad might be beneficial for schools (January 23 entry). It is particularly exciting to see school systems like Riverside USD taking a look at their own data as a way to support their local decisions about what’s working.

Unfortunately, the article unintentionally led to a confusion about which results were based on the study that my company, Empirical Education Inc., conducted and which were based on Riverside’s own data analysis. Riverside was one of four districts participating in our study, which was commissioned by Houghton Mifflin Harcourt (HMH) to study the effectiveness of their iPad app for eighth grade algebra. The comprehensive report on this study is being reviewed by HMH, and at this point, none of the findings have been released. This study was indeed a “randomized control trial.” Each participating teacher taught one randomly selected section of Algebra I using the iPad app, and their other sections continued with the conventional textbook. Riverside provided Empirical Education with data for the two teachers (and their nine sections) who had volunteered in their district. Unlike the other 3 districts, Riverside took a further step and ran the numbers themselves. That data analysis is where the “Figure 1” in your recent post originated. We applaud the district for taking a look at their own data, but we need to clarify that this doesn’t necessarily represent the overall results once they include the other districts participating in the study (we haven’t yet released the results!)

The confusion increased with this post that criticized Empirical Education’s work as follows.

“The HMH study was not a randomized, controlled experiment, and relied on self-reports by the two teachers who took part in the study. There was no statistical analysis, and no indication of the extent of the spread in the scores. And though the company surveyed four school districts, the only results released came from one school in Riverside County because each district carried out the study in different ways.

The study, that is, HMH’s study as opposed to Riverside’s, was conducted the same way in the four districts.  And we certainly used appropriate statistical methods. But, interestingly, if she is referring to the Riverside report, they had a (very small) randomized experiment—randomization was “within teacher.” We agree that, while the raw results from Riverside look promising, statistical analysis is important. Probably, the major caution that your readers should consider is that we can’t generalize the results of Riverside’s mini-experiment to other teachers who may have less experience with or enthusiasm for new technology.  Each district that pilots a new program like this should check its own results.

A final note on the Watters article concerns her interview of Dr. Alicia Chang, who makes some errors in her advice to your readers about how to evaluate research.  I base these comments on Empirical Education’s experience working with the US Department of Education, school systems, and other educational publishers (e.g., see the Guidelines we authored that were published by the software industry’s trade association).  First, she states that “studies done comparing classrooms or pre-existing groups are automatically not true experiments,” which is incorrect. Most field experiments in education randomize clusters of students: commonly classes, teachers, grade-level teams, or schools.

Second, she is asked how we can generalize the results of a study, but doesn’t directly answer that question. Instead she mentions design flaws and how they may be caught by reviewers. But a flawlessly designed study may still have a very narrow range of generalization. For example, our study of the iPad app, although well designed, cannot be taken to apply to all math software. This is a very important caution—a flawless experiment conducted in one district may be irrelevant to decisions in a different district. This goes back to our appreciation of Riverside’s examination of their own results. Their study may have been flawed, but it provided them with a snapshot specific to their own school district. The timely examination of their results may have informed their decisions without having to wait for the official results—which may be less relevant to their local conditions—to finally be released.

Denis Newman, Ph.D.
CEO Empirical Education Inc.
dnewman@empiricaleducation.com

Related

Explore: , ,

  • Email Post
  • http://twitter.com/JackCWest Jack West

    A great primer on how to examine research. This is particularly helpful when examining white papers; a tool being used more frequently in education with the edtech boom. 

    Skeptic that I am, I would also encourage consumers of research to consider the value of teacher surveys. If all we ever do is examine how standardized test scores are effected by a given intervention, we are limiting our scope too much. Seasoned teachers know from a host of measures that they do not have time to synthesize and quantify, when something is improving learning.

    The talent is always in the building. Unless teachers are incentivized otherwise (as might be the case when they are given technology by those who are measuring its effectiveness), they can usually be trusted to accurately report how something is improving their practice.

    • http://twitter.com/aliciac Alicia

      Thanks for your comment, Jack! I’ve worked a lot with teachers and have gotten great feedback from them.

      I agree that survey data can also be extremely useful when the analyses are done correctly. Some things to keep in mind are the sample size, how representative the sample is, and the effect size.

      All in all, quantitative as well as qualitative measures can all be used in solid research designs.

  • Mcfadden Justin

    Too often as it is with new technologies, see any research on 1-to-1 laptop initiatives, the hype outweighs the actual results from the proposed treatment. In a world where we need kids to think more, we need less ‘research’ like this from making the media rounds. It makes it seem like the answer is to give students ipads, THeN they will do better in school, when its obvious from looking at ‘studies’ such as this that no one is benefitting except those making money off of the initiative.

    • http://mistakengoal.com/ Kevin R. Guidry

      Well said!  An employee of Adobe recently wrote an article (http://mashable.com/2012/01/06/tablet-publishing-education/ ) praising the benefits of tablet computers and it was even less impressive than this article.  My reaction to that article (http://mistakengoal.com/blog/2012/01/08/i-dont-trust-this-article-and-heres-why/ ) was very similar to our reaction to the Wired article discussed here.  There are too many of these utopian fantasies portrayed as objective research summaries!  I wish the media would wise up and stop publishing this stuff but these articles attract a lot more attention than would measured, level-headed, and nuanced research that doesn’t make grand claims or promises.

    • Lucy Buckner

      Justin, 

      Did you actually read the ‘studies’? The motion math one in particular is an extremely solid design. I suggest you read it. Then come back and support your comment.

  • http://software-carpentry.org/ Greg Wilson

    I wish there was something like the FDA for ed-tech.

  • http://catenary.wordpress.com/ Jorge Aranda

    This is great advice, especially for evaluating controlled (or supposedly controlled) experiments. Two concerns, though. First, it’s dangerous to assume that unless a study is a double-blind controlled experiment it has no value, or even that such experiments are necessarily better than other kinds of studies. Darwin sustained his arguments with purely qualitative data, but they were still enormously powerful.

    Second, even when a controlled experiment satisfies the criteria above, it could depend on terrible constructs or measurements that make it worthless, while a potentially compromised experiment (for instance, due to conflict of interest) with better constructs might actually lead to valuable insights. But there’s no easy way to determine this—you need to develop your judgment.

  • Philipp Schmidt

    Thanks for this Audrey – much needed constructive input to fix the problems that exist in education research. I would argue that the problems run deeper than promoting corporate interests in the form of sponsored research. A lot of the peer-reviewed academic education research relies on case-studies and anecdotal evidence. I’m not promoting we switch to the other side (as many in the big data world would suggest) because I do think that learning is complex and data driven approaches too often measure the things that are easy to measure rather than those that are important. But we have to find a healthy balance (or combine many approaches) in order to advance our understanding of how learning works. 

  • Hatch Early Learning

    This advice is a great example of how crucial it is to assess
    the validity and reliability of such studies when discussing the effectiveness
    of new products on education. Dr. Kyle Snow from NAEYC spoke on this same issue
    regarding early childhood education in a webinar (recording linked below),
    where he discussed the necessity to balance judgment and maintain objectivity
    when evaluating studies based on who sponsored them and how they were
    conducted. Snow comments on how the results of these studies are relevant with
    results that are nonexistent when studies haven’t been done on competing
    products. (Webinar: http://www.hatchearlychildhood.com/pages/webinar-oct-2011-research-education-technology)
    (Q&A: http://blog.hatchearlychildhood.com/can-you-trust-businesses)

     

    Dr. Dale McManis

    Research Director

    Hatch Early Learning

    dmcmanis@hatchearlychildhood.com

  • Charles Bernhard

    This article is a bit of a mess. The way in which it is written, using words like “tout”
    and implying that both studies have issues with reliable methods while only
    offering an example from the Houghton Mifflin study, are examples of
    inexperienced journalism. This is an example of “let’s stir things
    up” without real investigation or an informed view on the topic.

  • http://textandpixelreflections.com/ denarosko

    Challenge assumptions: I prefer multi-modal approaches in the true sense of the word, and do not believe that just because a study is randomized that it is “accurate,” valid, or reliable. I find such language pretentious, presumptuous, and dismissive of other ways of knowing, and more importantly, at risk where ethics are concerned. Maternity research provides a sobering example. I find this article biased to a research genre touted by “science” as scientific. I recommend designing research per context; I write more on implications elsewhere. 

    • TR

      What on earth are you talking about. Doesn’t make sense.