Grading

22 April 2006

I had a long talk with a student last week. We talked about a number of things including my role in the Auckland MBA programme, and how grading is done. I’ve done a lot of thinking about grading over the years, and so I thought I might start putting them down in the hope that I will eventually turn them into a paper.

I’ll start by saying that in my opinion, grading is not an exact science. But nevertheless, it is a system designed to produce reasonable consistent results. My personal view is that in an overall sense, it is difficult, if not impossible to reliably measure a student’s achievement to 3 significant figures. Indeed going beyond two, procedures an illusion of accuracy; and for some, going beyond a single figure is stretching the credibility of the system–but, as will be seen later, such talk of figures may, in themselves, be nonsensical.

A simple definition

Grading is a process of evaluation whereby a student’s work is assessed and given a grade. Student who completes a course typically ends up with a series of grades that are then combined together to produce a result.

There are a number of approaches to grading. The two most prevalent are norm-referenced grading, and criterion-based grading.

Norm-based grading

In norm-based grading, students are ranked relative to the performance of others in the class. It is also known as grading on the curve, and this name arises from the way in which grading is achieved by fitting the ranked work to achieve a normal distribution (the classic bell-curve). The consequences of norm-based grading include:

  • Increasing competition within the class and this can result in higher performances
  • Increased pressure to perform, and this can result in higher levels of plagiarism, cheating, and other undesirable behaviours.
  • Relatively easy to administer and implement
  • The final grades are meaningless; they are a measure of ranking rather than of absolute performance
  • Competitive forces discourage collaboration and peer support, and can even lead to sabotaging of peers’ work

Nevertheless, norm-based grading systems are widely used, particular in the United States.

For me, perhaps the most significant problem of grading on the curve is the meaninglessness/arbitrariness of the curve. There are few good explanations as to why any given class should produce a normal distribution. I.e. compare an open entry first-year class with an invitation honours class–why should the grade distributions be at all similar? Further more, as with other systems, calculating the overall grade, effectively adding curves effects together, doesn’t seem to make much sense.

Criterion-based grading

In criterion-base assessment, work is graded against a scale that is determined before the assessment. This is the system that is used throughout much of the University of Auckland (and, dare I say endemic throughout much of New Zealand’s education system).

The consequences of this type of system are:

  • It provides little information about the relative performance of the student (it’s interesting how often a class asks for the grade distribution so that they can see their relative performance).
  • It reduces competition between students.
  • It allows many students to achieve the same grade, i.e. if 30% of the class meets the standards for an A+, that is what they get.
  • The objects/scales can be mis-targeted and either is set too high/low, or measure the wrong things, or not measure all that is needed.
  • and as said elsewhere, Because of tendency of learning expectations to be mismatched with real learning outcomes, encourages ad hoc grade adjustments, thus contributing to meaningless grades.
  • and, also from the same source “Unduly constrains curriculum development by discouraging the use of very short assignments and/or by encouraging teacher to force exam or assignment to fit into point system easily calculated into scale.”

Overall, both systems have strengths and weaknesses (and it’s nice to know what they are). Most of the rest of this entry focuses on Criterion-based grading.

The use of broad criteria

Here, the broad grading list looks something like this:

Grade Description
A+ Rare, outstanding+
A Exceptional; beyond what was expected
A- Excellent
B+ Polished, very good
B Covers everything expected; comprehensive; demonstrated good understanding
B- Good coverage, minor flaws
C+ Demonstrated adequate understanding of the fundamentals but some gaps
C Some understanding, but gaps
C- Just adequate[^1]
D+ Inadequate, lack of understanding
D Very inadequate, lack of understanding
D- Very poor

Neither part of the grading list is without some contention. For example, the plus/minus system (e.g. A, A+, A-) is not universally accepted. Until quite recently some well know institutions, such as MIT and Stanford[^2], only used the letter grades (e.g. A, B, C, etc). There are a number of arguments as to why the plus/minus system should not be used. These range from concerns about the impact on increased competition between students, through to concerns about the reliability of accurately distinguishing between the letter grade itself and the plus/minus[^3].

The verbal description may also be considered contentious[^4], and from time to time faculty do discuss[^5] the exact meaning of these descriptions. Nevertheless, they are what have been accepted by the institution.

What is clear from this list is that grades represent an ordered series of categories. As soon as one accepts this, a number of issues arise.

  1. How does one combine a series of grades to arrive at an overall grade?
  2. How big do we expect the categories to be?
  3. Are the categories a relative measure or are they absolute? I.e. Is “Just adequate” for a first year undergraduate student, the same thing as a for a final year masters student?

These are not trivial matters, and they have major impacts not only on students not only in their results, but also on the amount of effort they put in to their work.

Combining grades

For the moment, let’s assume that student’s achievements can be reliably assessed and appropriate grades awarded. How does one take a series of equally weighted grades, say A, A-, A, A, and arrive at a grade that truly represents the student’s overall achievement? Remember, these are categories–it’s like saying we have three apples and a pear, what do you have overall? (Or maybe it might be like saying we have three fruit and one vegetable).

In the previous example, is the student an A student or an A- student overall? Many of the systems that rely on assigning a mark to the grade and then finding the central tendency result in the student having an A- (if averaging is used) or an A if the mode is used. Common-sense seems to call for an A for me. But whilst common sense works here, if there are more grades or a more varied distribution, what then? Well, many people use the mean (average) to calculate the answer, but I would suggest that the mode is much more appropriate. Try it out; make up some patterns of grades and see which method gives you a final grade that seems to be the most sensible.

But, in doing all of this work, we have ignored the question of how we assign grades in order to do these calculations. Should it be A = 3, B = 2, C = 1; or should it be A = 10, B = 5, C = 1. In other words, how much harder is to get an A than a C?

Anyone who flicks through their academic transcript, or who asks, will soon know that here we have the following scale for calculating Grade Point Averages (GPA).

Grade Points
A+ 9
A 8
A- 7
B+ 6
B 5
B- 4
C+ 3
C 2
C- 1
D+ 0
D 0
D- 0
Anything else 0

Whilst this conversion is used for ‘summing’ grades between courses, most departments use an entirely different scale[^6] if they need to do ‘grade math’. However, the use of such scales for within course calculations seems to be falling out of favour because they tend to encourage some students to focus on the ‘grade math’ rather than on the learning. I.e. was my C grade 52 or 54. Of course, this problem also exists within an assignment, where individual components[^7] are assessed; how should they be aggregated?

As one might notice, we are already a long way away from discussing the actual performance of the student.

Conclusion

At the end of the day, the goal must be to have lecturers that can (in a reliable and consistent manner) say “In my opinion, based on the work that was submitted[^8], this student is an X”, where X is some grade value. No grading system can be perfect, but through the use of good judgement, most lecturers can be (and are) consistent[^9] in the assessment of students performance (but, of course, some students will always dispute that).

As a final note, a number of schools, particular in the United States, have grading policies that seem to boil down to “Having looked at the assignments, plus anything the lecturer might additionally include (but not have mentioned) the final grade will be given”.

Some example policies TBA

Here are a few policies gathered from the ‘net.

After the average grade for each student is computed numerically using the weighting listed above, Prof. Farhi will discuss those students who are just a point or two below the grade borderlines with the recitation instructors and tutors. On the basis of this discussion, Prof. Farhi may use his discretion to push a small number of students above the borderline. The most common reason for such a grade increase is the case of a student who has shown very significant improvement during the term. MIT

[1]: Grades below C- are failing grades. I.e. D range grades are restricted to work failing work.

[2]: Apparently, MIT have moved to using the plus/minus system internally, but students’ transcripts only report letter grades without the plus/minus.

[3]: The argument often goes along the lines of “A lecturer can reliably distinguish between an A student, a B student, a C student, and so on; but moving to plus/minus grades introduces greater unreliability into the system and promotes a false sense of accuracy.”

[4]: For example, what exactly does Just adequate mean?

[5]: These discussions must be recognised for what they are; not sources of disagreement, but a way to build a shared understanding of what each grade actually means. It produces a tacit, rather than an explicit knowledge.

[6]: One popular scale, has a A+ as > 90, A > 85, A- > 80, B+ > 75, B > 70, B- > 65, C+ > 60, C > 55, C > 50, D+ > 45, D > 40, D- < 40. Notice the non-linearity in the scale

[7]: Whilst rubrics are often seen as more useful, they also have their own pitfalls. Firstly, they often fall into the “addition of grades problems”, and rare (if ever) do they provide an exhaustive list of attributes.

[8]: Assessment is meant to be based on what is being assessed, and not on the effort that went into it (unless that is explicitly part of the assessment).

[9]: There have been a number of tests to see if this is true. As with much research the results are mixed, but overall the evidence supports the assertion.