How reliable are your assessments?

I recently ran a session, as part of one of our training programs, around the topic of assessment. As part of this I discussed with delegates the factors and strategies we can use to affect the reliability of any assessment we undertake. For any secondary colleagues reading this, this topic is of particular pertinence as we all work towards assigning accurate and valid TAGs. Much of what I discussed on the training program stemmed from a fantastic resource recently published by “Evidence Based Education” association, outlining their “Four Pillars of Assessment”; Purpose, Reliability, Validity and Value.

Reliability is essentially about the accuracy and consistency of an assessment (and the inferences we draw from them). There are many factors that contribute to the reliability of an assessment, but perhaps most critical are the precision of the questions/tasks asked, plus the accuracy and consistency of the interpretations derived from the responses we collate.

What we must consider or accept is that no assessment will ever be 100% reliable, but that by understanding it we can act towards improving it. One of the most significant ways to improve reliability is to focus on “rater- reliability”, basically the accuracy and consistency of the marking. Rater reliability can be considered in two contexts, inter-rate reliability (i.e. across subject teams) and intra-rater reliability (i.e. an individual’s own consistency).  The latter of these two is often over looked.

We acknowledge that achieving consistency across individuals is challenging (especially when the material being assessed is inherently subjective- i.e. English Literature, History and Art) and therefore often allocate considerable time to minimise this, however we rarely stop to consider the factors that may affect our own consistency. It is not a giant step of logic to realise that our decisions, and therefore grading of assessments may vary depending on the time of the day, our work load, our emotional state, previous lesson experience and even how hungry we are. As such despite the pressures we face around assessment points (particularly the current TAG situation) that we take time to reflect and monitor our own capacity to be reliable.

Inter-rater reliability is inherently challenging; however, the Evidence Based Education report does highlight some ways we can improve it:

  • Using exemplar student work to clarify what success looks like in specific assignments
  • Bling marking assignments to reduce bias
  • Blind moderating samples of students work
  • Using well crafted closed multiple choice questions

Based on these suggestions I have been talking to some of our curriculum leaders at Durrington to find out how they approach the standardisation and moderation of assessments.

In Geography, Sam Atkins, selects 2 to 3 papers from the cohort (covering a range of student types) and ensure all staff have copies of this for standardisation. All team members meet to mark these papers together, with either Sam or one of the experienced exam markers in the team (which we are fortunate to have several of) taking the lead of advising on the definitive mark. For each of the longer answer questions a clear set of criteria for level marking is established and staff can use the copied papers as yard sticks. Staff are then encouraged to mark question by question rather than a whole paper at time, allowing for understanding of the mark scheme to be refined and comparative judgements to be made. Once all staff have completed their marking, they return their scripts to Sam. Sam then randomly selects papers from each teacher (plus any they have flagged as being unsure of) and these are moderated. Where common errors or issues are flagged these are recorded and sent out to the all members of staff asking to recheck their papers for these particular issues.

In PE, Tom Pickford follows a similar approach to that above. A key part of the team’s discussion when awarding marks/grades for extended writing is to ensure a consistent understanding of each level within the mark scheme. The standardised answers are given a grade and the reasons why this is neither of the adjacent levels is clearly outlined. This process of making the success criteria explicit is a vital part of ensuring high reliability. In English Andy Tharby and Jacqueline Bradley create a booklet of answers across the grade range – the team then meet to discuss why each example has achieved the grade it has (and therefore the differences between each) and then use this as a reference in their own marking. Andy and Jacky then sample a selection of each team members marking and give individual feedback. Once grades have been awarded and entered into the system, they then review the data looking for any anomalous entries when compared to a student’s historical data. If any anomalies are identified then that student’s assessment is also moderated.

Finally, in maths (statistically one of the most reliable subjects for assessment due to the closed nature of the majority of questions), Shane Borrett, has been trialing a new strategy in an attempt to further improve reliability and remove the potential for teacher bias. When marking their most recent set of year 11 assessments, Shane allocated questions across the higher and foundation paper to individual staff members. Each staff member was responsible for marking the same set of questions (approximately a double page spread of questions) for every paper completed, before moving the papers onto the next member of staff to mark the next double page spread and so on until the papers were fully marked. The benefit of this is that staff can become highly familiar with their section of the mark scheme, are able to make comparative judgements from paper to paper. Furthermore, as they are marking students they do not teach there is a reduced risk of bias. When speaking to Shane about this approach, it was clear he felt that it had helped improve the accuracy and consistency of marking, however he did admit that the approach had some draw backs. Primarily that as staff were not marking their own classes assessments (nor the whole paper) it was harder for them to get a “feeling” of how their students had done and where they needed to improve. While this was not a significant issue for these final set of year 11 assessments, it could be more problematic for interim assessments that would normally be used not only summatively, but also formatively to inform future teaching.

Ben Crockett

This entry was posted in General Teaching. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s