Tinkering with TIMSS | |
![]() |
By Gerald W. Bracey Despite Commissioner Forgione's attempt to put the best face on it, the final-year TIMSS study does not permit a comparison of achievement across nations, Mr. Bracey avers. When people make such comparisons, they are engaging in a political exercise, not an intellectual one. Illustration © 1998 by Deborah Zemke |
COMMISSIONER Pascal Forgione's article in the June 1998 Kappan, "Responses to Frequently Asked Questions About 12th-Grade TIMSS," answers some questions, but it leaves others dangling and finesses some with sophistry. Indeed, the title of his article perpetuates a myth that is among the worst distortions of the whole matter: that this study is about "12th grade." In many countries, it is about grade 13 or grade 14 or about students in their third year of study of a subject or about students enrolled in a curriculum concentrated on math and science and little else.
Commissioner Forgione attempts to deal with the 12th-grade issue when he says, "The purpose of this component of TIMSS was not to compare students of the same age or years of schooling, but rather to compare students at a similar point in the education system: the end of secondary school." If that be true, the release of the TIMSS data could perhaps be considered the worst public relations debacle in the history of the U.S. Department of Education. Here is the view of TIMSS that most people could deduce from the TIMSS press releases: Around the world, a school is a school is a school. In these schools, our 12th-graders went up against their 12th-graders and got trounced, and even our best 12th-graders went up against their best 12th-graders and got trounced.
Comments from the media and from educators clearly demonstrate that this interpretation is the one that holds sway:
American high school seniors have scored far below their peers from many other countries on a rigorous new international exam in math and science.1 (Emphasis added.)
American high school seniors -- even the best and brightest among them -- score well below the average for their peers in other countries.2 (Emphasis added.)
[TIMSS] presents findings concerning the standing of U.S. 12th-graders . . . compared to their peers participating in TIMSS.3 (Emphasis added.)
The conclusion is unmistakable: The longer students stay in American schools, the farther they fall behind their age-mates in most industrialized nations of the world.4 (Emphasis added.)
Some of our science and math folks have been saying for years that our best kids are the best in the world. Well, they're not.5
American 12th-graders scored near the bottom of the industrialized world.6
American 12th-graders [scored] at the very bottom of the rankings.7
If our students are so bad, why is the economy so good?8
Hey, we're No. 19!9
I have done my share to blast the media for distorting reports on the condition of American education, but I find it inconceivable that so many reporters, pundits, and educators could have misinterpreted the data in precisely the same way. It's about students, they wrote, not systems.
Indeed, while the official TIMSS report Mathematics and Science Achievement in the Final Year of Secondary School makes it clear that the study is about systems and that the systems differ greatly,10 Pursuing Excellence, the report from the U.S. Department of Education, focuses on students and emphasizes the similarities across nations:
As is discussed in more detail in chapter 4, the most recent data indicate that in most countries participating in TIMSS secondary school enrollment rates are similar to that of the United States. Not only do the TIMSS countries have most of their secondary school-age population enrolled in school, the strict quality controls discussed earlier ensured that the sample of students taking the mathematics and science general knowledge assessments were representative of the entire population at the end of secondary school.11 (Boldface in the original.)
Pursuing Excellence then acknowledges that "there are still many other differences between the secondary school systems of the countries," but it employs specious logic to contend that the comparisons are still "entirely appropriate" because "the end of secondary school is the culmination of each country's attempts to prepare all young people for living in society." This establishes a new, curious, and, at best, squishy criterion for determining the comparability of nations: "readiness for living in society."
The "quality controls," of course, did not work. They did not "ensure" representative samples. Indeed, only five of the 21 countries that took part in the general knowledge tests and the advanced math test met the criteria for sampling and participation rate, and only six met the criteria for the physics test. The quality controls didn't work, but Pursuing Excellence acts as if they didn't even exist: "On the mathematics portion of the general knowledge assessment, U.S. students scored below the international average and among the lowest of the 21 countries" (p. 26). An identical statement is made for the science results.
Of what use are criteria for quality control if violating them makes no difference to the discussion of results? Commissioner Forgione is silent about the existence or importance of the quality controls. But if one assumes that the criteria actually bear on the validity of the data, one is left with a TIMSS "final year" study that includes only five countries.
Let me now take up each of Commissioner Forgione's six questions.
1. Differences in age. Forgione deals with and dismisses the age differences by using enrollment rates for the various countries. But that just won't wash.
Consider tables A 5.18 and A 5.20 in Pursuing Excellence. These tables divide countries into three groups: those that improved relative to their eighth-grade performance, those that remained the same, and those that declined. They also show the age differentials between students tested at grade 8 and in their final year for these three groups. The five countries that improved their relative standings in mathematics were those whose students had the youngest average age at grade 8 and the oldest in the final year (14.0 and 19.3 years, respectively, for a difference of 5.3 years). The group that declined in their relative standings in mathematics were those whose students had the oldest average age at grade 8 and the youngest in the final year (14.4 and 18.0, respectively, for an average age differential of 3.6 years). The group showing no change, not surprisingly, was in the middle: 14.2 years and 18.8 years, for a difference of 4.6 years. The results for science are identical. Age matters.
Forgione also ignores differences in the structure of education. He fails to note that some countries tested only students in programs that concentrate on mathematics and science. He observes that Norwegian and Swedish students start school later than American students, which affects when they finish, but he doesn't mention that these students have studied physics for three years. Neither he nor anyone else has dealt with the extraordinary performances of Greece and Cyprus, which were near the bottom in grades 4 and 8 and on the final-year general knowledge test, yet performed well in advanced math or physics. The top 25% of Cypriot students did not score as high as the average student in 12 of the 20 other countries, yet Cyprus topped the world on the calculus items. This result strikes me as . . . impossible.
It is certainly legitimate to debate whether or not American high schools should offer multi-year programs in the sciences. It is also legitimate to wonder how the nations with multi-year programs manage to produce and recruit sufficient numbers of qualified teachers (assuming they do), given the number of teachers already teaching math and science out of field in this country. But it is not legitimate to compare students with one year of physics study to those with three.
2. Differences in enrollment rates. Commissioner Forgione states that the secondary enrollments of the various countries are "roughly comparable." What on earth does "roughly comparable" mean in a research study? Is a country with only 77% of its students enrolled "roughly comparable" to one with 97%?
Forgione further contends that higher enrollment rates are associated with higher scores. But this is because enrollment rates are confounded with age. The impact of age on the TIMSS results can be seen from a different perspective when one considers the following reproduction of part of Table A 5.14 from Pursuing Excellence:
Percent of 18-Year-Olds | |
| (Australia) | 32 |
| (Austria) | 56 |
| (Canada) | 34 |
| (Denmark) | 71 |
| (France) | 60 |
| (Germany) | 82 |
| Hungary | 65 |
| (Iceland) | 65 |
| New Zealand | 33 |
| (Norway) | 83 |
| (Slovenia) | n.a. |
| Sweden | 87 |
| Switzerland | 75 |
| Average = | 60 |
| (United States) | 22 |
Readers should note that the countries in parentheses are those failing to meet sampling and/or participation rate criteria. In addition, the average (mean) is distorted, and lowered in this case, by extreme values. The median percentage of 18-year-olds still in secondary school in the other nations is 65%.
These enrollment percentages are for nations that scored higher than the U.S. on the mathematics general knowledge test. Clearly, the composition of secondary schools looks different in some of them. In the U.S., almost four-fifths of 18-year-olds have left secondary school.
Forgione claims that only 75% of our 17-year-olds are enrolled in secondary school, compared to an international average of 82%. Thus, he contends, if lower enrollments favor higher scores, the U.S. is actually at an advantage and should have done better. He also notes that Pursuing Excellence shows that fully 87% of people in the U.S. between the ages of 25 and 34 have completed high school. The 75% figure does not make sense. If only 75% of our 17-year-olds are in school, how could 87% of our 25- to 34-year-olds have completed secondary school, a figure in line with other attainment statistics?
Moreover, Forgione's own NCES publication, The Condition of Education 1996, gives the enrollment of 17-year-olds as 92.4% for 1994, the highest rate ever (p. 40, though this table does not appear in the 1997 edition). Forgione's 75% figure comes from the 1997 edition of the OECD's Education at a Glance.12 The 1996 edition of Education at a Glance puts the figure at 86%. Such a large change in one year is highly improbable, particularly considering that, for other countries, the 1996 and 1997 figures are virtually identical.
3. Defining the U.S. mathematics population. I cannot comprehend Commissioner Forgione's exposition of this issue. Twenty-three percent of the items on the advanced mathematics test presumed that a student had taken a calculus course. Yet we also tested our students who were enrolled in precalculus classes. These students scored 100 points lower than American students who were actually wrapping up a year in calculus. The calculus group scored just below the international average. Commissioner Forgione justifies the inclusion of precalculus students on the grounds of "fairness." To be "fair," our sample had to be as large as those of other nations. "Would it be fair," he asks, "to compare 7% in the U.S. with over 16% in Canada, 20% in France, or 33% in Austria?"
Is it fair to test students on material they haven't studied? Fairness is an issue to which Forgione should be acutely sensitive. He was director of testing for the state of Connecticut for a number of years. During his tenure, the minimum-competency-testing madness swept the country, culminating, in one sense, with the 1978 ruling in Debra P. v. Turlington. In that decision, the court held that for a test to be used to qualify students for high school graduation, the state had to prove that the test had instructional validity. That is, the state had to show that students had had an opportunity to learn the material tested. This is fair.
As a minor digression, it is worth noting that these issues were discussed at considerable length in The Courts, Validity, and Minimum Competency Testing, edited by George Madaus of Boston College, an institution that serves as the headquarters for the TIMSS project in the U.S.13 William Schmidt, the TIMSS director of research, and several of his colleagues contributed a chapter, during the course of which they stated, "If tests without instructional validity are being used for certification, the students who fail such tests are being penalized for the failures of the schools and teachers -- and not for their own inadequacies. The rational basis for judging student performance in school is undermined."14 Since we know that the TIMSS advanced math test cannot have instructional validity for American precalculus students, Schmidt's analysis applies, although it is the selection of the students, not the failures of teachers and schools, that is producing the low scores.
Kappan readers should peruse the Madaus volume as a model of careful attention to psychometric issues -- in stark contrast to the general sloppiness surrounding the discussion of TIMSS. For instance, to the best of my knowledge, no discussion of TIMSS tests has even mentioned the sine qua non of tests: their reliabilities. Indeed, for TIMSS they are not impressive. The general knowledge test might be considered acceptable, but the advanced test is clearly iffy, and no self-respecting psychometrician would accept the reliability of the physics test: the coefficient ranges from .48 to .77 depending on country, with a median of .70, a value that accounts for less than half of the variance. (The mean value of the coefficients is .67.)
One wonders if the low reliabilities are partly a function of poor items. Elsewhere in this issue, Jianjun Wang provides a sample of TIMSS items that exhibit a variety of technical problems, including improper keying, multiple correct answers, and inappropriate scoring guidelines. It is not clear from Wang's essay how many items suffer from such worrisome qualities.
Our precalculus students are not at risk of not graduating; the point is that, ever since Debra P., the necessity of ensuring that tests match what was taught has been an important consideration in the construction of many tests. But we know that 50% of American students had not had an opportunity to learn at least 23% of the items on the TIMSS advanced mathematics test. We also know that American TIMSS officials had an opportunity to judge the fairness of the items but chose not to -- a choice that defies comprehension.
In the interest of fairness, I should note that Forgione was not commissioner when TIMSS was designed. That he should so vigorously defend the indefensible, though, simply shows the extent to which technical and professional considerations in TIMSS have been subordinated to political considerations.
In connection with sample definition, Commissioner Forgione presents the average percentage of the population tested across countries as 19%. This is the mean of all nations. But a quick look at the percentages of tested students shows that the mean is misleading; again, the median is the appropriate statistic. Slovenia reports a 75% figure, and this greatly influences the mean. Finally, if the Forgione doctrine of sample-size "fairness" is to be used, why wasn't it applied to the Russian Federation (2%), Lithuania (3%), Cyprus (9%), or Greece (10%)?
Again, it is legitimate to ask why such a relatively small percentage of American students are actually taking calculus in high school. The answer would no doubt be that, while Europe started moving calculus and analytic geometry into the secondary schools around the turn of the century, it was only after Sputnik that American high schools started offering calculus, which was still seen largely as a college-level course. When European secondary schools began offering calculus, American secondary education was expanding rapidly, and secondary teachers had little familiarity with the subject.15 The proper place for calculus in a course of study can be rethought, and maybe it should be. But testing precalculus students on a test that presumes calculus advances nothing -- least of all "fairness."
Commissioner Forgione engages in some slippery-slope thinking in arguing that maybe our precalculus students are not at a disadvantage after all, because "in none of the countries were students selected on the basis of whether they had taken calculus." One must wonder, then, why 23% of the items covered calculus. There are a number of reasons why students in other countries would be more likely to encounter calculus, prime among them being that many were in science/mathematics-tracked curricula.
Forgione's contention is also refuted by the test/curriculum matching analysis. This analysis was conducted because, "when comparing student achievement across countries, it is important for the comparisons to be as fair as possible."16 Thus curriculum experts rated whether or not each item was covered in a country's curriculum. For the advanced math test, the maximum score was 82. Calculus items accounted for 23% of the items. The mean number of items rated as fitting the curricula of various countries was 73, barely a 10% reduction, and for only two countries did the unmatching items amount to 20% of the total.
Lest readers wonder why the calculus items were not flagged as not matching our curriculum, let me point out that this curriculum matching was not performed in the United States. U.S. participants in TIMSS decided to accept uncritically all items at all grades -- just to see how the students would cope. Given the discussion of instructional validity above and given the stakes riding on the TIMSS tests, this strikes me as highly irresponsible.
4. Unbelievable numbers. Data obtained after publication of my May Kappan article on TIMSS17 have convinced me that the TIMSS figure on the proportion of American students who work more than three hours a day (55%) is accurate with regard to American high school seniors. Forgione's comments do not address the most salient aspect of this figure: that it is unique among nations.
In fact, the official TIMSS report spins these figures curiously. The report states that, while in about half of the countries few students work much, "in Australia, Canada, Iceland, the Netherlands, New Zealand, Norway, and the United States, at least one-fourth of students reported working for three hours or more each day."18 This is true, but misleading. For all these countries except Canada, the proportions are barely one-fourth and are less than half of the U.S. figure. The proportions:
| United States | 55% |
| Australia | 25% |
| Canada | 39% |
| Iceland | 26% |
| Netherlands | 26% |
| New Zealand | 27% |
| Norway | 27% |
It is also true that, in all but three of the other 20 countries, the relationship between work and performance is linear: the more hours worked, the lower the score. In the U.S. it is curvilinear. Those who don't work scored 484; those who work up to 14 hours a week scored 506 (above the international average for all students); those who work 21 to 35 hours a week scored 474; and those who work more than 35 hours a week scored 448. This curvilinear relationship in the U.S. might be for reasons different from those in Austria, Canada, and New Zealand. Students in the U.S. who don't work at all might well be low-income minority youngsters living in areas where jobs are hard to come by. The usual interpretation of the curvilinear relationship, though, is that some work teaches time management and responsibility.
I mention these issues but cannot resolve them. And that is the point: there are cultural and social variables whose impact is known, or at least likely, that remain impossible to quantify or factor out.
5. Cultural differences. No doubt part of the performance of American students has to do with the "school-related factors" to which Forgione alludes. This conclusion can be derived from earlier TIMSS curriculum and video analyses, and I would not wish to diminish the importance of rethinking both the U.S. math and science curricula and how they are taught. At the same time, I question some of the analyses alleging that cultural variables had little or no effect. Given the large proportion of American students who work and the clear impact of that work on performance, I wonder what kind of statistical analysis would have even been possible to lead Forgione to conclude that work had no "statistically significant relationship to our relative performance."
6. The relevance of the results. Commissioner Forgione observes that 38% of our high school graduates do not go on to college. I noted the importance of that fact in an op-ed piece in USA Today on the day the TIMSS data for the final year of secondary school were released.19 Students who slide through high school and don't go to college are at a decided disadvantage in competing for decent jobs.
Aside from that important fact, this section is mostly an assortment of the kind of "scare statistics" that surrounded the propagandistic release of A Nation at Risk and the establishment of grossly unrealistic performance levels by the National Assessment of Educational Progress.20
I do not believe that Forgione can provide concrete support for his statement, "For our long-term economic well-being, nearly all experts in the field of national economic productivity believe that the knowledge and skill levels of America's students in science, mathematics, and technology will become increasingly important." This is the kind of mythical mantra some reformers have been mindlessly chanting for years. It was debunked in the New York Times a couple of years ago. Said Times education writer Peter Applebome, "Many educators and economists are increasingly skeptical of the notion that better schools mean a more prosperous nation."21 Peter Capelli of the University of Pennsylvania was quoted as stating that "the link between education and the national economy is tenuous in all but the grossest sense -- say the difference between developed and undeveloped nations."
Finally, none of Commissioner Forgione's remarks address a factor of some import: motivation. I do not know about the presence of a "senior slump" in other nations, but here is what one reporter recently said about the phenomenon in this nation: "Donny Watkins, a slender 18-year-old, wants to spend this spring at Arlington's Wakefield High School doing what most American high school seniors do in their last semester. Next to nothing."22 The students tested for the final year study were tested as close to the end of the school year as possible, and that makes sense if you want them to finish their calculus or physics courses. So our 12th-graders were tested in May. American 12th-graders . . . tested . . . in May?
The test results of the TIMSS final year study raise some interesting issues, but they do not permit a comparison of achievement across nations. When people make such comparisons, they are engaging in a political exercise, not an intellectual one. Indeed, hearing and reading the various discussions of TIMSS, I have come to wonder whether, as a condition of receiving a political appointment, one must agree to leave one's technical expertise at the door.
2. Debra Viadero, "U.S. Seniors Near Bottom in World Test," Education Week, 4 March 1998, p. 1.
3. Gilbert A. Valverde, "TIMSS High School Results Released," U.S. National Research Center, Report No. 8, Michigan State University, April 1998, p. 4.
4. James Dobson, Family News from Dr. James Dobson, Focus on the Family, Colorado Springs, April 1998, p. 1.
5. Senta Raizen, Executive Director of the National Center for the Improvement of Science Education, quoted in Viadero, op. cit.
6. Ethan Bronner, "U.S. Twelfth-Graders Rank Poorly in Math and Science, Study Shows," New York Times, 25 February 1998, p. A-1.
7. William Raspberry, "The Good News About U.S. Schools," Washington Post, 12 March 1998, p. A-15.
8. Robert Samuelson, "Stupid Students, Smart Economy?," Washington Post, 12 March 1998, p. A-15.
9. John Leo, "Hey, We're No. 19!," U.S. News & World Report, 9 March 1998, p. 14.
10. Ina V. S. Mullis et al., Mathematics and Science Achievement in the Final Year of Secondary School (Chestnut Hill, Mass.: Boston College, 1998).
11. Pursuing Excellence: A Study of U.S. 12th-Grade Mathematics and Science Teaching, Learning, and Achievement in International Context (Washington, D.C.: National Center for Education Statistics, NCES 98-049, 1998), p. 20.
12. Education at a Glance (Paris: Organisation for Economic Cooperation and Development, 1997).
13. George F. Madaus, ed., The Courts, Validity, and Minimum Competency Testing (Boston: Kluwer-Nijhoff, 1982).
14. William F. Schmidt et al., "Validity as a Variable," in Madaus, p. 138.
15. Jeremy Kilpatrick, personal communication, 8 June 1998.
16. Mullis et al., p. C-1.
17. Gerald W. Bracey, "TIMSS, Rhymes with 'Dims,' as in 'Witted,'" Phi Delta Kappan, May 1998, pp. 686-87.
18. Mullis et al., p. 117.
19. Gerald W. Bracey, "Comparisons Mean Little," USA Today, 25 February 1998, p. 11-A.
20. For a discussion of the latter problem, see my Research column in the April issue: Gerald W. Bracey, "About Those NAEP Proficiency Levels (Again)," Phi Delta Kappan, April 1998, p. 630.
21. Peter Applebome, "Better Schools, Uncertain Results," New York Times, 16 March 1997, sect. 4, p. 5.
22. Jay Mathews, "Schools Treat 'Senior Slump,'" Washington Post, 27 May 1998, p. B-1.
![]()
PDK Home | Site Map
Kappan Professional
Journal
Last updated 9 September 1998
URL: http://www.pdkintl.org/kappan/kbra9809.htm
Copyright 1998 Phi
Delta Kappa International