|
|
FEATURES: Selling Software
By Todd Oppenheimer
How vendors manipulate research and cheat students
When companies that sell instructional software used to come calling on Reid
Lyon, expert on reading instruction and former advisor to President Bush,
he played a little game. First, he listened politely to the sales
reps’ enthusiastic pitches and colorful demonstrations of how
computer software can build reading skills in new ways. Then he asked to
see their technical manuals.
“I always found nothing in there that would help
the consumer determine if this stuff really works,” said Lyon, who
last year ended a 10-year stint as chief of the Child Development and
Behavior Branch at the National Institute of Child Health and Human
Development, which sponsors studies on reading. Regarding software, Lyon
said, he would “rarely see any data that I would consider
credible.” These encounters with software makers happened so
often—three to four times a month in Lyon’s
experience—that he observed a pattern. “They always came in
excited, and left very depressed.”
Educational software makers may get rebuffed by
authorities like Lyon, whose endorsements, companies believed, could lead
to governmental stamps of approval, and thus explosive sales. But they
usually get warmer receptions in the offices of the nation’s school
superintendents, who are, after all, their primary customers. The system
was not supposed to work this way. President Bush’s No Child Left
Behind Act (NCLB) famously requires that any instructional materials
supported by federal aid be proven to work through “scientifically
based research.” Unfortunately, scientific proof is defined in many
ways. In the world of education software, definitions abound.
Wrestling with Slippery Evidence
According to the Institute of Education Sciences (IES),
the primary overseer of research within the Department of Education,
“scientifically based research” fits the following criteria: It
randomly assigns its test subjects to comparable groups; it yields
reliable, measurable data; if the study makes any claims about what causes
its effects, it “substantially eliminates plausible competing
explanations”; its methods are clear enough that other researchers
can repeat or extend them; and, finally, the study has been accepted by a
peer-reviewed journal or equivalent panel of “independent
experts.”
But not everyone reads bulletins from the Department
of Education, or interprets them the same way. Purveyors of education
products of all kinds make claims that they’re based on scientific
proof, and thus “aligned” to federal requirements. A good many
trot out studies, sometimes great numbers of them, that appear to have
followed one or more steps that are hallmarks of gold standard scientific
research. A select few software programs (such as Cognitive Tutor, an
unusually sophisticated math program, and Fast ForWord, a language program)
have gone through at least some careful vetting. But the vast majority fall
somewhere along the other side of the scale. As Lyon puts it, claims
“are based on any kind of document—whether it’s an
unpublished technical manual, an opinion piece, or an editorial. These
people either don’t understand the law’s requirements, or
they’re trying to game the system.”
Federal authorities were supposed to lend districts a
helping evaluative hand. This was to be done primarily through a project
funded by the IES called the What Works Clearinghouse (WWC). Created in
2002, the WWC gathered a team of top-flight research experts whose mission
was to review studies done on a range of instructional packages—both
traditional and electronic—and rate the quality of their achievement
data (see Figure 1). After four years and an expenditure of $23 million,
the WWC had evaluated studies on only 32 products. While the WWC itself
earned good ratings for the rigor of its work, plenty of people were
frustrated with how little was getting done—and how few studies met
the agency’s standards. After complaints mounted, the WWC sped up its
work. By December 2006, it had reviews out on 51 products. To accomplish
this, the WWC went through 255 studies. Still, the vast majority of studies
(75 percent) did not meet the agency’s scientific standards, even
with some “reservations.”
By this time, critics were calling the WWC “the
Nothing Works Clearinghouse.” The nickname carries an important
double meaning. To some, it’s another example of governmental
blockheadedness—specifically, that understanding how teaching and
learning work in the real world is beyond the skill of a federal agency. To
others, including many leaders in the research community, the message is
actually more harsh. It is that most new classroom gimmicks don’t add
much of value, and studies packaged to suggest otherwise are to be treated
with great suspicion. In fairness, suspicious research sometimes contains
perfectly innocent flaws. That’s because truly scientific research is
extremely difficult, time-consuming, and costly—and thus very
rare—which is precisely why the WWC has found so few studies to be
satisfactory.
To compensate for the WWC’s academic outlook and
pace, many other organizations, both private and governmental, have
developed their own evaluation systems to help schools navigate the
dizzying array of curricular products on the market today. (These include a
RAND Corporation resource, called the Promising Practices Network; the
Comprehensive School Reform Quality Center, from American Institutes for
Research; the Best Evidence Encyclopedia, out of Johns Hopkins University;
and even a global survey called the International Campbell Collaboration.)
While some offer useful information, their criteria and standards vary
widely. And this may further confuse, or mislead, school purchasing agents.
The IES has since 2003 been working on its own
evaluation of educational software, through “gold standard”
methods of scientific research. The ambitious $15 million study, due
sometime early in 2007, has some peculiar characteristics. IES did not
begin by selecting the most popular products, but instead asked software
producers to volunteer; it then chose 15 products from among those that
did. And while the evaluation methods of IES appear to have been exacting,
the study will answer nothing more than the most general question: Does
educational software, as a class, tend to work? The individual evaluations
of the 15 packages will not be released.
This is odd for two reasons. First, the basic question
about software’s general effectiveness has long been answered: as a
whole, it works no better than cheaper traditional materials. (Former North
Carolina State professor Thomas Russell watched so many studies come to
this conclusion over the years that he eventually compiled a book on the
subject. Covering 355 different studies done since the early 1900s, the
book was titled, aptly, The No
Significant Difference Phenomenon.) The more precise answer is, it depends on which software
you’re using, with what ages, and in what circumstances. But if you
are a teacher or administrator trying to make a shopping decision,
“this [study] isn’t going to help you,” admits the IES
study’s lead researcher, Mark Dynarski of Mathematica Policy
Research, Inc., in Princeton, New Jersey. “We’re trying to help
Congress, which is spending more than $700 million a year to support
technology.” As basic as the general answer may be to education
insiders, Dynarski believes it will be news to policymakers. “If it
is this difficult to know whether this stuff helps, why is everyone so
anxious to know whether to purchase it?”
The second oddity to the IES arrangement is the
bargain it struck. To be included in the study, companies had to donate
software for 132 schools and teacher training. In return, they get two
important gifts: a free study, complete with a federal stamp of approval,
and the study’s individual evaluations. Companies can package and
spin those evaluations however they like, since no one besides IES will
have those details. “I don’t know what they will do with the
data,” Dynarski said.
All of which raises a very large and thorny question:
What really does happen, on the ground, inside the schools, when research
spin and marketing hype collide with desperate classrooms?
Money, Money Everywhere
Software sales to schools have certainly been robust.
According to Simba Information, a media analysis firm based in Stamford,
Connecticut, the nation’s K–12 schools bought $1.9 billion of
electronic curricular products in 2006. While that is less than a fourth of
the instructional materials market as a whole, the electronic
sector’s growth has been vigorous—up 4.4 percent from 2005 to
2006, as compared with the 2.6 percent growth rate of the overall
instructional products market.
Among the many electronic products that schools buy,
the most visible have been those geared toward reading. Building up the
nation’s reading skills was of course the main impetus behind No
Child Left Behind. Today, schools can pick from a cornucopia of federally
funded initiatives in this domain. There is Reading First, which aims $1
billion a year at grades K–3; various funds for
“supplemental” products; special programs to promote
educational technology or “comprehensive school reform”; and
numerous initiatives under Title I, the overarching federal fund for poor
students.
This plethora of options—and money—has
produced an abundance of new marketing opportunities for software
companies. Strangely, one of the richest of those is NCLB’s
scientifically based research requirement. Conceived as a strict rule with
clear methodological standards, it has instead become a versatile
tool—a Cuisinart for raw statistics, ideal for marketing hustle and
deception. It has not helped matters that, as with many laws, it is beyond
the ability of those it affects most to understand its crucial details; nor
did the law’s creators endow it with any sort of enforcement system.
L.A.’s $50 Million Gamble
Consider the story of Waterford Early Reading,
distributed by Pearson Digital Learning. Pearson is the nation’s
leading seller of educational software, and its Waterford program is used
in 13,000 classrooms in all 50 states. The WWC has not yet evaluated
Waterford, but it is one of the 15 products that IES has elected to study.
So what could be known about products like Waterford if
government evaluators chose to look into and report on the daily
experiences of teachers and students? One particular school district, Los
Angeles Unified (LAUSD), has a long and remarkably troubled history with
this product. In July of 2001, LAUSD decided to spend nearly $50 million on
Waterford, instantly making itself the company’s largest customer.
Waterford is designed for the earliest readers,
students in grades K–2. It requires students to spend 15 to 30
minutes a day with various multimedia exercises, and costs $200 to $500 per
student. In launching the program, Roy Romer, then LAUSD superintendent,
said this “is like putting a turbocharger in a car engine. We are
going to accelerate reading performance in Kindergarten and first
grades.”
Several years later, the district’s own
evaluation unit pronounced the program a failure. In a 2004 report, its
second with negative findings, the evaluators said, “There were no
statistically significant differences on reading assessments between
students who were exposed to the courseware and comparable students who
were not exposed to the courseware.”
Some teachers found the program helpful, but many did
not. Pearson and other supporters of the program argued that
Waterford’s effectiveness was compromised by the fact that teachers
didn’t fully use the product—and when they did, they often used
it incorrectly. But L.A.’s last evaluation found that “neither
the amount of usage nor the level of engagement had an impact on
achievement.”
The report, which did not make the news until early
2005, stunned L.A. school officials. Even Romer backtracked. “As I
looked at this, it didn’t provide as much bang for the buck as I
would have liked,” he told the Los
Angeles Times. The district has since scaled
back the Waterford program, using it as more of a sideline specifically for
students with learning difficulties. (One problem was that Waterford was
taking time away from students’ primary literacy lessons, thereby
causing actual declines in achievement.) Teachers and administrators both
say sidelining Waterford has helped. But it also means that the district is
getting a lot less for its $50 million than it planned on. School board
members soon questioned the wisdom of the whole venture.
What happened here? And what lessons do
Waterford’s rise and partial fall in Los Angeles offer education
policymakers, not only in other states but also in Washington?
Lesson One
The first lesson is to beware of seemingly persuasive
numbers. Many curriculum producers started promoting their
“scientific” research very soon after NCLB required it—an
impossible feat, considering the many years it takes to conduct solid
scientific studies.
How does questionable research get produced? In
Waterford’s case in Los Angeles, Julie Slayton, an analyst in
LAUSD’s Program Evaluation and Research Branch and one of the authors
of the Waterford evaluations, says Pearson did not try to tilt the
evaluators’ basic data, “but they did their best to make the
report come out favorably. They tried to make it focus on implementation
instead of effectiveness.” In other words, Pearson wanted the
question to be about the district’s wobbly use of Waterford—not
whether the program itself inherently worked.
After failing to change the district’s opinion,
Pearson prepared a preliminary evaluation of its own—a
“briefing packet” for Superintendent Romer, full of numbers
indicating that Waterford was producing dramatic achievement gains. Ted
Bartell, director of research for LAUSD, was not pleased. In a memorandum
dated May 8, 2002, he urged Pearson to spell out “the methodological
limitations” of its studies. Bartell argued that Pearson’s
sample sizes (60 students) were too small for solid conclusions to be drawn
from them; that there was no evidence that Waterford and not other factors
caused the gains; and that the gains were too small to be meaningful in any
case, or to be representative of the district as a whole.
In the following months, Pearson continued to generate
numbers indicating success, but the data drew from problematic sources.
Some used the Academic Performance Index (API), which covers all grades in
a school; this confounded the picture of K–2, where Waterford was
used. Some involved the California English Language Development Test
(CELDT), which is meant only for non-English speakers and tests advanced
literacy skills that Waterford doesn’t directly teach. Others drew
from reading inventories created by Pearson itself. “They kept using
gains and measures that are not relevant to their program,” says
Lorena Llosa, a former LAUSD research analyst who was completing a
doctorate in applied linguistics that employs CELDT data.
In retrospect, Slayton says, “Pearson does an
enormously aggressive job. They pressured us. They are in your face. They
are obnoxious, and they don’t go away. They are very, very good at
giving people a show.”
Before Los Angeles invested in Waterford, there were
plenty of signs that the program might not work quite as the company
promised. For years, Pearson, like many companies, had been gathering
studies that offered evidence of Waterford’s effectiveness. But
various independent evaluators had pointed out that the bulk of these
studies have methodological problems (lack of control groups, small sample
sizes, missing information, numbers based on subjective survey data, and so
forth). “In light of these limitations,” the LAUSD evaluators
said in their first report, in 2002, “much of the information from
these evaluations should be interpreted with caution.”
Lesson Two
Why didn’t Los Angeles officials do that? One
reason is that researchers who evaluate classroom exercises and educators
who work inside those classrooms represent two often conflicting
cultures—this is lesson Number Two. As an illustration, when Ronni
Ephraim, the district’s chief instruction officer, was asked if she
had looked at the research on Waterford before supporting its use in LAUSD,
she said she had not, largely because it did not seem relevant.
“Every classroom situation is different,” she says. “And
nothing compares to L.A. I’d rather listen to my own teachers.”
Ephraim’s worldview is broadly shared. To the average administrator,
the sensations of success or failure inside your own classrooms are going
to feel a lot more relevant than abstract statistics drawn from schools on
the other side of the country. Is it any wonder, then, that NCLB’s
scientific research requirements have been so widely ignored?
To researchers, however, Ephraim’s way of
thinking can make an instruction method look like it’s working when
it’s not. All too often, some other environmental factor is driving
the improvement; sometimes, in fact, the gains are just normal growth
associated with getting older. For reassurance that none of this is the
case, researchers commonly begin by looking for two facts in particular:
first, studies of the program published in independent, refereed
journals—ideally those of high repute; and second, the
researchers’ use of truly comparable groups.
So far, very few commercial programs meet these
standards—although many claim they do. One such example is
Renaissance Learning, Inc., whose lead product, Accelerated Reader (AR), is
used in more than half the nation’s public schools. Renaissance has
built such a following behind AR that it regularly holds massive annual
conferences that feel like religious revival meetings. Testimonials at
these conferences are typically adorned with lengthy, seemingly solid
studies proving AR’s power. Yet none of these studies have held up to
serious scrutiny. “These studies all suffer from serious confounds or
design problems that make it impossible to show that AR improves
reading,” says Tim Shanahan, professor of urban education at the
University of Illinois, Chicago, and director of its Center for Literacy.
Shanahan also was a member of the National Reading Panel, the group of
reading experts chosen by congressional mandate that published an exacting
report in 2000 evaluating decades of research on reading instruction.
Other commercial packages, both computerized and
traditional, have not fared much better. In the What Works Clearinghouse
ratings, not a single product has more than one study fully meeting WWC
research standards. This includes two well-respected software packages: I
CAN Learn and Cognitive Tutor. It should be noted, however, that good
research is not a quick sign of product effectiveness. Fast ForWord, for
example, had one study that cleared the WWC’s top bar. That study
found that while Fast ForWord was effective with language development, it
was ineffective with reading achievement.
Although Waterford has yet to be reviewed by the WWC,
it has been treated to four evaluations published in peer-reviewed journals. The first was a 2002 study in the Journal of Experimental Child Psychology, which found Waterford to be helpful in only one of nine areas
the researchers tested. The second, in 2003, was a mixed evaluation in Reading Research Quarterly, the
journal of the International Reading Association. Why the ambivalence?
“The more careful the study, the more mixed the results. That’s
the punch line,” explains Tracy Gray, an expert on educational
technology with American Institutes for Research.
The next two studies—a 2004 evaluation in the Journal of Literacy Research (JLR),
the publication of the National Reading Conference, and a 2005 article in Reading & Writing Quarterly—were
both positive. But it’s not clear these were full or fair contests.
Neither study says much about what the non-Waterford students did while
their peers played with the new computers; typically, the “control group” gets nothing of comparable novelty or potential power.. The 2004
study, Lyon says, suffers from “fatal flaws.” And the 2005
study used unusually small test pools: 46 students, who were divided into
four even smaller sub-groups. Three of these four, including a
non-Waterford group, all made gains. So the study’s praise for
Waterford rests on one small group: 12 low-performing 1st graders who did
not use Waterford. “That’s promising, but it’s not enough
evidence to justify spending a lot of money,” says Shanahan.
Pinning Down the Truth
These varied interpretations illustrate yet another
lesson: Experts don’t all agree on what constitutes good research. As
an example, Wayne Linek, professor of education at Texas A&M
University, in Commerce, Texas, and co-editor of the JLR, finds Lyon’s view
too narrow. True experimental designs that compare treatment groups to
untreated “control” groups can turn students into “guinea
pigs,” which Linek considers unethical. Furthermore, Linek says,
these kinds of studies can lean on quantitative measures—frequency of
eye movements, for example, or recognition of certain letters—that
are suspect and thus “bad predictors.”
Linek is partly right to complain about the current
obsession with measurable data, which is a double-edged sword. On the one
hand, the data are critical—they allow other researchers to replicate
the original study’s findings, or take them further. On the other
hand, any study that looks for data in obscure factors like eye movements
can be justly criticized for missing the main event, despite the fact that
it qualifies for publication in any number of relatively credible journals.
Linek’s first point raises an even more
important question: What’s wrong with turning students into guinea
pigs, anyway? Medical researchers do this with people all the time. If
mistakes are going to be made, isn’t it more humane to make them with
a small number of test subjects than with the general population?
While the research community debates such questions,
the commercial sector has felt free to devise its own interpretations. Andy
Myers, chief operations officer for Pearson Digital Learning, takes a sunny
view of Waterford’s scientific grounding. “In our
experience,” he says, “the more in-depth the evaluation process
is, the more the Waterford program shines.” When asked about the
weaknesses that Lyon and others see in the Waterford studies, Myers
acknowledged the studies’ limitations. Then, echoing Ephraim,
L.A.’s chief instruction officer, Myers questioned the relevance of
the research itself. “The studies may or may not meet the rigorous
standards of the What Works Clearinghouse,” he said. “But
what’s more important to a district is, ‘Does it work with our
students?’ Then they’re going to want to expand it.” When a prominent company like Pearson dismisses the
quantitative research, it exploits the fact that teachers and their
administrators don’t understand it, and excuses them for disregarding
it.
If the education world has been living in a kind of
scientific denial, that era may be fast drawing to a close. At this point,
the advocates of classic, quantitative science clearly have the ear of the
Bush administration. And they are setting today’s education
standards. With its next round of software research, IES will release
individual evaluations. It is also expanding its evaluations to include
textbooks. Eventually, a complete list of governmentally approved product
ratings will be just a mouse click away. That means that someday, it will
be easy for the marketplace—namely, district superintendents and
their purchasing agents—to embrace high scientific standards. When
that time comes, the spoils won’t go to companies that have busied
themselves ginning up studies full of tilted numbers. The victors will
instead be those who practice true R&D—that is, companies that
dare to use these intervening years of confusion to subject their materials
to research that is both rigorous and independent.
Todd Oppenheimer is the author of The Flickering Mind: Saving Education from the False Promise of
Technology, which was a finalist for the 2003
book award from Investigative Reporters & Editors. He can be reached at
www.flickeringmind.net.
|
QUICK LINKS:
FREE ISSUE
EMAIL ALERT
PDF
CONTACT US
TOOLS:




|