I'm surprised to see so many HNers come out against standardized testing. Sure, ...

adrianhoward · on May 17, 2012

I'm surprised to see so many HNers come out against counting lines of code. Sure, it's imperfect and this developer got screwed, but how are managers supposed to make informed, data-driven decisions on what the right course of action is? Many of us extol the virtues of A/B testing landing page designs and gathering deep metrics and analytics but turn around and slam managers for making decisions based on line code counts.

yummyfajitas · on May 17, 2012

Your post is a logical fallacy. The fact that a bad metric exists does not mean all metrics are bad.

To address your specific fallacy, the output of coding is heterogeneous while the output of teaching is homogeneous. If I code today, it's a realtime optimization system. Tomorrow it might be a search product. When I taught, the output was always the same: students who understand calculus.

yequalsx · on May 17, 2012

The post is a parody. It points out the flawed reasoning used in the parent.

It appears you are claiming that when you taught the output was the same each day. Each day you taught your students learned calculus and understood it. Each day you may have taught an aspect of calculus but the topics varied from day to day. Furthermore, assuming undergrad level, it can't be believed that each of your students understood each topic you taught.

The dichotomy you've laid out between programming and teaching is not apt.

yummyfajitas · on May 17, 2012

The output was the same each semester, not each day. It was the same for me and the guy down the hall. It was the same for every person teaching calc 1 between the last curriculum change and today.

Not every student understood every topic. So what?

yequalsx · on May 17, 2012

Then the output isn't the same if not all students had the same outcome. Furthermore you used day as the time period when talking about programming. I assumed as a matter of consistency you meant to use roughly the same time period for teaching. Every semester I teach calculus is different. Two different people don't teach the same way even if the curriclumn is the same.

Your statements here strengthen the parody. Merge sort is the same for everyone, right? So let's just go by lines of code. The outcome is the same.

yummyfajitas · on May 17, 2012

A mistake in wording on my part - I should have said "the desired outcome is the same". Sorry for the confusion.

Two different people don't teach the same way even if the curriclumn is the same.

So what? The goal is the same. It is meaningful to measure whether my students know more or less calculus than yours.

If the goal of programming were to re-implement merge sort over and over then it would be an effective metric to count the number of correct merge sort implementations.

The inability to compare a realtime monitoring system to a web scraper is what makes programming harder to measure.

But in many cases one can measure programmers by output. For example, at Styloot, one task we have is building web scrapers. They all have a pretty straightforward (and identical) goal. An effective metric for programmer performance on this task would be to measure the # of scrapers written or (better) the # of items correctly scraped.

See also quant traders - the goal is homogeneous (increase profit, adjusted for risk), and your code is measured by how well it achieves that goal.

adrianhoward · on May 18, 2012

Your post is a logical fallacy. The fact that a bad metric exists does not mean all metrics are bad.

Yes.

That was, to some extent, my point :-)

That and the fact that many HN folk will have encountered management doing dumb things like equating velocity or cycle time or lines of code with productivity. Which may explain their aversion to simplistic metrics like this.

To address your specific fallacy, the output of coding is heterogeneous while the output of teaching is homogeneous. If I code today, it's a realtime optimization system. Tomorrow it might be a search product. When I taught, the output was always the same: students who understand calculus.

Really? I can easily see it from the opposite point of view. When I code I'm just delivering running-tested-features. When I teach (which I have done, and still do) every student is different - they often start from radically different positions of attitude, aptitude and existing knowledge.

raphman · on May 17, 2012

Unlike website visitors, the teachers in a school are not acting completely independent of each other, but part of a social group. Even if you could adequately measure their 'teaching quality', you might miss factors that contribute to the success of the school as a whole.

And measuring 'teaching quality' is not a simple as counting ad click-throughs. Education is at least as much about helping people learn about their role in the world, as it is about learning facts, concepts, and tools. Just measuring the increase of students' scores on standardized tests misses a lot of the impact teachers have on their students' minds.

Nobody in his right mind would judge a developer solely on the amount of features he has implemented during the previous year, but also on how robust and maintainable the code is. I you want to take these factors into account, you have to actually read the code, understand it, and ideally talk to the developer about it. Unless you do this, you can not judge the quality of the code.

The same holds true for evaluating teachers, imho. Administrators need to talk to students, parents and teachers and form a comprehensive image of a teacher's ability. I really think it is that simple. Certainly not perfect. But not less perfect than standardized testing.

As a German, maybe I am misunderstanding some properties of the US educational system.

jkn · on May 17, 2012

This is spot on. Does anyone know of a single instance of standardized teacher testing that recognizes important teacher qualities beyond their skill in imparting knowledge to students?

ig1 · on May 17, 2012

Doesn't that just imply that tests are failing to measure what we want teachers to teach and we should fix the tests ?

The whole reason that test results are being used as opposed to manual assesment, is that test results have proven to be a better predictor of long term outcomes. Test results aren't perfect but they're better than what was previously used.

bgruber · on May 17, 2012

"test results have proven to be a better predictor of long term outcomes"

i'm not sure what long term outcomes you're referring to and how they were measured, but if they're education related the measurement was probably more test results. Yes, test results are a good predictor of future test results.

ralfd · on May 17, 2012

Remember the kerfuffle about the bad performance of Germany in the Pisa study? We also sometimes standardized testing, but not to compare individual teachers, but to argue endlessly the slightly different curricula/school policies of the different German states.

Addendum:

My understanding is, that in the United States anyone can relatively easy train to be a school teacher. In Germany they are tenured civil servants with at least 3-4 years of study.

yummyfajitas · on May 17, 2012

Education is at least as much about helping people learn about their role in the world, as it is about learning facts, concepts, and tools.

Could you explain what this means more precisely? Specifically, how can I differentiate a student who knows "about their role in the world" from one who doesn't?

yequalsx · on May 17, 2012

Maybe it isn't possible to adequately measure this. I'm reminded of the, "I know it when I see it" reasoning famously employed by the Supreme Court. (IN reference to pornography.). I doubt that it is possible to develop adequate metrics to measure teacher ability.

yummyfajitas · on May 17, 2012

If you can't measure it, how do you know it exists?

https://iandravid.wordpress.com/2008/01/19/carl-sagan-dragon...

yequalsx · on May 17, 2012

I gave an example of something that almost everyone agrees exists but is very hard to define or quantify. Anyone can play the game you are playing here. How do you know your measuring device really exists? Your point is without merit.

yummyfajitas · on May 17, 2012

How do you know your measuring device really exists?

How can I know a calculus test exists? Very easily. I look at it, feel it, etc. How can I know calculus ability exists? Again, very easily - it predicts outcomes on a set of correlated exams, the existence of which I verify with sight, touch, etc.

You seem to want to argue that education only produces vague, immaterial and unmeasurable outcomes. Lets take that as a given - in that case, why not just eliminate education spending and save $900B/year?

yequalsx · on May 17, 2012

Outcomes not being well defined != unnecessary or worth getting rid of.

You seem to confuse physical existence and idea. Presumably a physical object can be painted but ideas can't. Lot's of things exist that can't be verified by sight, touch, etc. In my opinion being a good teacher is not something that can be reasonably measured. I'm open to the possibility that I'm wrong but have seen no evidence that I am.

DrJokepu · on May 17, 2012

After we came out of the church, we stood talking for some time together of Bishop Berkeley's ingenious sophistry to prove the nonexistence of matter, and that every thing in the universe is merely ideal. I observed, that though we are satisfied his doctrine is not true, it is impossible to refute it. I never shall forget the alacrity with which Johnson answered, striking his foot with mighty force against a large stone, till he rebounded from it -- "I refute it thus."

The Life of Samuel Johnson (James Boswell)

Natsu · on May 17, 2012

Good data is helpful. Bad data is worse than useless.

You're actually better off with no information than with bad information.

peterhunt · on May 17, 2012

I agree with this, but just because a doctor loses a patient once in a while doesn't mean that we should give up on medicine.

jkn · on May 17, 2012

This is not a correct analogy to "bad information is worse than no information".

A better analogy would be a medicine that typically leaves patients worse off than they were before treatment.

Natsu · on May 18, 2012

Doctors have a maxim of "first do no harm." I believe that maxim is being violated by some of these attempted fixes.

I'm not saying they should give up the search for something that works, but they absolutely should not cause harm just for the sake of doing something.

DougBTX · on May 17, 2012

What if the test makes good doctors quit? (As seems to be the case here.)

peterhunt · on May 17, 2012

I don't think that this really fits into my analogy, but if losing a few good teachers is the price we need to pay for preventing a toxic culture of no-accountability where administrators are not allowed to make metrics-based decisions then so be it.

kbolino · on May 17, 2012

If you have the wrong metrics, you cannot make the right decisions. If you are throwing away good teachers because the rest are really good at gaming the tests then you have done far more harm than good.

Natsu · on May 20, 2012

The analysis shows that the test results contain zero information, unless you believe that teachers change quality randomly every year. If correlation begins to appear in those evaluations, it will be because the test has been gamed.

pbhjpbhj · on May 17, 2012

>Give me a better metric for success that you can measure over the course of a year. //

Define success.

If one gets the best score but has no social skills, no friends, no fun and no creativity then is that success.

This (only considering a test score) is like doing A/B testing but only looking at impressions and not at conversions, or only looking at conversion rates and not at conversion value. In web terms you can succeed by reducing both your impressions and conversion rates (by taking more money of less people and only attracting motivated customers).

Sorry that's more destructive than constructive WRT what the right course of action is.

Are there [large scale] education systems that consider happiness of pupils, reduction in bullying, positive social interaction and such to be metrics to assess and improve?

peterhunt · on May 17, 2012

Sure. More metrics to track should definitely be number of parent complaints, disciplinary actions, and mental health incidents.

Jabbles · on May 17, 2012

I don't think you should be optimizing for test scores though, you should be optimizing for "education", which is ill-defined and hard to measure. In the analogy to A/B testing, you wish to maximise your profits, not the number of email addresses you collect.

gcp · on May 17, 2012

The problem is that the metrics can be gamed, and there's incentive to do so.

Nobody is arguing against testing. They're arguing about doing tests where you know the results will be deeply flawed.

alttag · on May 17, 2012

I had this discussion with a guy who is now the state superintent of public education. Our discussion led to classroom visits of about 10 minutes per instructor, multiple times each semester. Evaluation criteria were: 1) Are learning objectives for the lesson clearly visible or otherwise available? 2) What percent of students were on task? 3) ... I forget the others, but there were five. (It's been about five years, but I think they included the students knowing how they were to be evaluated, and the teaching style used, e.g., lecture, group work, etc., and whether the instructor was using data to inform the approach. It came down to the standard description of leadership: vision, expectations, support, feedback)

He was pretty firm that this method would be better—he'd spent years thinking about it—but we both agreed that it required a level of intervention by the administrator that although it could reasonably be expected was unlikely. Using just the test data is the lazy way, which means it's the method most will use.

Still, I believe _some_ standardized testing is important. ... but there are two types of tests: norm testing and standards testing, and both types have their uses.

jkn · on May 17, 2012

Give me a better metric for success that you can measure over the course of a year.

This is assuming that the net effect of the tests is positive, thus we should keep them until we find something better. But a lot of comments are pointing at the harm caused by standardized testing. For people who think the net effect is negative, the rational thing to do is get rid of them until we find something better.

rapala · on May 17, 2012

So if you have 10 good teachers, you just rank them and kick out the "worst" 2?

If your rank the teachers in a strict order, you have to make someone the worst and someone else the best teacher. But is the administration going to check how much worse the worst teacher really is? I doubt it.

peterhunt · on May 17, 2012

No. You set performance goals for teachers specific to the teacher and district, evaluate the performance objective and make educated decisions based on that data. It doesn't need to be ranked, but it does need to be empirical.

ColinWright · on May 17, 2012

I'm interested to know how you set objective performance goals.

peterhunt · on May 17, 2012

"Did test scores improve relative to everybody else year over year?" is a pretty good start. Of course, by definition, this won't work for all districts.

ColinWright · on May 17, 2012

Actually, based on my experience of dealing with hundreds of teachers and thousands of students, that won't work either. Test scores are a poor proxy for what students actually know or can do, especially in mathematics. The original linked article provides evidence for that. You put in the caveat "Of course ... this won't work for all districts" but then you're arguing for exceptions to be allowed. That's just a mess.

There is no simple fix. There are bad teachers (and I'm leaving the term undefined - it's a bit like porn - undefinable, but recognisable) who get good test results and glowing evaluations, and there are superb teachers who get mediocre test results and undistinguished evaluations.

It's easy from the outside or from a limited perspective to suggest "obvious" methods of assessment or "obvious" actions to improve the situation, but in the end, no one has really defined what they mean by "good teaching," so proposing assessments of something undefined will just result in more proxies to be distorted.

rapala · on May 18, 2012

Except that now you are ranking, by introducing the notion relative to everybody else.

Why should test scores improve year over year? If everyone is getting a 2/6 then yes, but students in the same grade are not going to get smarter every year. There is the major flaw in using student tests to measure teachers IMO. You are going to get a bell curve around 4/6 from the test results for majority of the teachers. When you don't, it is really hard to tell whether the reason is the teaching, the curriculum, the students or the test and its grading.