Article Text

Download PDFPDF

Reliability of a beef cattle locomotion scoring system for use in clinical practice
  1. Jay Tunstall,
  2. Karin Mueller,
  3. Oscar Sinfield and
  4. Helen Mary Higgins
  1. Department of Livestock and One Health, Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Cheshire, UK
  1. Correspondence to Dr Helen Mary Higgins, Department of Livestock and One Health, Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Cheshire, UK; h.higgins{at}


Background Locomotion (lameness) scoring has been used and studied in the dairy industry; however, to the authors’ knowledge, there are no studies assessing the reliability of locomotion scoring systems when used with beef cattle.

Methods A four-point scoring system was developed and beef cattle filmed walking on a firm surface. Eight veterinary researchers, eight clinicians and eight veterinary students were shown written descriptors of the scoring system and four video clips for training purposes, before being asked to score 40 video clips in a random order. Participants repeated this task 4 days later.

Results The intra-observer agreement (the same person scoring on different days) was acceptable with weighted mean Kappa values of 0.84, 0.81 and 0.84 respectively for researchers, clinicians and students. The inter-observer agreement (different people scoring the same animal) was acceptable with weighted Gwet’s Agreement Coefficient values of 0.70, 0.69 and 0.64 for researchers, clinicians and students. Most disagreement occurred over scores one (not lame but imperfect locomotion) and two (lame, but not severe).

Conclusion This scoring system has the potential to reliably score lameness in beef cattle and help facilitate lameness treatment and control; however, some disagreements will occur especially over scores one and two.

  • lameness
  • cattle
  • locomotion

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, an indication of whether changes were made, and the use is non-commercial. See:

View Full Text

Statistics from


Lameness in cattle is considered to be a critical welfare issue,1 2 with lame beef cattle being a specific issue due to the risks of being left untreated for a long period of time.3 Identification of lame animals is considered to be an important step in dealing with individual animals, but also in acknowledging and understanding the scale of the problem. As such, the UK dairy industry is encouraging farmers to locomotion score cattle,1 4 and a sheep locomotion scoring tool is available.5 Locomotion scoring also allows benchmarking, meaning that farmers can understand how they compare with others, and allows improvements or deteriorations, to be measured over time. However, this requires scorers to be able to give an animal with any given locomotion the same score on any given day. Furthermore, it requires different scorers to also give an animal with any given locomotion the same score in order for the results to be consistent. In particular, a practical and easy-to-use scoring system is needed that can be used by veterinary surgeons in clinical practice. This is different to scoring systems designed specifically for research purposes with typically considerable detail and a large number of possible scores. While necessary for research reasons, it makes them more complex and hence less practical for use by clinicians and is not essential for the routine monitoring and control of lameness in clinical practice.

Any locomotion scoring system would ideally have been proven to be valid in the sense that it accurately measures lameness, and also reliable which encapsulates the extent to which there is consistency (repeatability) in scores when independent measurements are performed. Although assessing validity of a locomotion scoring system can be challenging, reliability can be assessed in two regards. Inter-observer reliability relates to multiple people scoring the same animal and asks the question: how consistent are the scores they assign? In other words, do different people agree with each other over the same animal? Intra-observer reliability relates to the same person scoring the same animal on different occasions (with degree of lameness unchanged) and asks the question: to what extent does a person agree with themselves?

In dairy cattle, several scoring systems have been developed and reported in the literature, typically based on a combination of subjective visual observations such as back arching, stride length, weight-bearing and so forth.4 6 7 However, to the authors’ knowledge, none of these have been assessed for reliability when scoring beef cattle.

The aim of this study was to assess the inter-observer and intra-observer reliability of a locomotion scoring system for use with beef cattle in clinical practice by veterinary surgeons.


Locomotion scoring system

A four-point locomotion scoring system was developed following consultation between the authors based around two current dairy cattle scoring systems, but with due consideration for the practicalities and specific attributes of beef cattle.4 7 Of these two dairy cattle scoring systems used to develop the new one, the AHDB system is one commonly used in practice the UK, and the Sprecher system is well publicised and cited internationally. The descriptors are given in table 1.

Table 1

Proposed locomotion scoring system designed for use in beef cattle (adapted from Sprecher et al7 and AHDB4 scoring systems)

Video clips and online completion

Video clips were created by filming both suckler cows and finishing cattle walking on a firm surface, either from the rear, the side or a transition from side to rear. Where necessary, the animal intended as the focus of the video was identified with an arrow to avoid confusion and any audio was removed. These clips were examined by three experienced researchers to ensure a sufficient range of scores were present (approximately ten of each score, zero, one, two and three) and yielded a total of 40 video clips for assessment by participants. The researchers also selected four additional video clips to be used for training purposes (one clip for each score) that they agreed were typical of each score.

The 40 assessment video clips and 4 training clips were uploaded onto the University of Liverpool’s virtual learning environment (VITAL—Virtual Interactive Teaching at Liverpool), which uses Blackboard 2018 (Blackboard, Washington, USA). This platform enabled participants to view the training videos alongside the descriptors for each score at the start. They could re-play the training clips as many times as they wanted. Thereafter, they were asked to watch each of the 40 assessment videos and assign a score to each. Videos lasted between 1 and 18 s. Each assessment video could also be re-played as many times as the observer wanted. The order of the assessment videos was randomised for each participant. After four days, the observers were asked to repeat the entire task, that is, to watch the training videos and read the descriptors again and re-score the 40 assessment videos, which were presented again in a randomised order.


Observers were a convenience (non-random) sample of eight private practice veterinary surgeons (‘clinicians’ or ‘C’) involved with livestock work and undertaking postgraduate livestock courses alongside their clinical role, eight veterinary researchers/lecturers involved with livestock research/teaching (‘researchers’ or ‘R’) and eight veterinary students, in years three to five of a five-year course (‘students’ or ‘S’). Observers were coded 1–8 for each group ordered by their intra-observer exact agreement percentage.

Data analysis

The data were exported from VITAL into Microsoft Excel 2016 (Microsoft, Redmond, Washington, USA). Statistical tests were conducted in Minitab V.18.1 (Minitab Statistical Software, State College, Pennsylvania, USA) and R (R Core Team, 2019), including Computing Chance-Corrected Agreement Coefficients R Package (irrCAC, Gwet 2019). P values are reported as continuous values and without setting any arbitrary threshold.8 9 Quadratic weightings were used to produce weighted Kappa values and AC2 values.

Intra-observer agreement

Percent exact agreement (and ±1 and ±2 scores) was calculated for each observer across the 40 videos and mean values for the three different groups (ie, the researchers clinicians and students) were compared with paired t-tests. Differences between the same observer at the first and second scoring (intra-observer agreement) were examined using weighted Cohen’s Kappa values,10 and the difference between mean values for researchers, clinicians and for students was compared using paired t-tests. Systematic bias between attempts for each scorer was investigated by subtracting each observer’s second score from their first, and performing a one-sample t-test on the resulting value (null hypothesis: the mean value equals zero, alternative hypothesis: the mean value is not equal to zero).

Inter-observer agreement

Inter-observer scores were investigated using each observer’s first attempt at scoring the videos.

The percentage of video clips that an observer agreed on with each individual observer in their group (ie, the researchers, clinicians and students) was calculated to produce seven scores. The mean of these scores produced the mean exact agreement for that observer. This was repeated for each of the 24 observers to initially assess the agreement within groups. Agreement was formally analysed using quadratic weighted Gwet’s Agreement Coefficient 2 (AC2). An AC2 value was produced for each group of observers (researcher, clinician or student) and overall for all observers. AC2 values were adjusted using Critical Values provided by Gwet.11 12

For each video, the mode score was determined and considered to be the correct score. One video was bimodal, and the mean score was used to determine which mode to consider correct. All videos of each score were then grouped and an AC2 value generated for each score to show the agreement of observers for each individual locomotion score. This was performed for each group of observers, and overall.

Ethical approval was granted by the University of Liverpool Ethics committee. It is reported in accordance with the guidelines for reporting reliability and agreement studies.13


The distribution of scores, as determined by the mode score for each video, were score 0: 12 clips, score 1: 10 clips, score 2: 9 clips and score 3: 9 clips. The results for one video were bimodal; therefore, the mean score was used to determine which mode to consider the correct score.

Intra-observer agreement

Three observers did not provide a score for one clip on their second scoring session (all differed on the clips not scored). The individual’s scoring for that clip were not included in the analysis for intra-observer agreement.

For all 24 observers, the mean exact agreement between first and second observation was 66.0 per cent with a 95 per cent CI of 61.9–70.1 per cent; it was 68.0 per cent (61.7–74.3 per cent) for researchers, 63.3 per cent (51.7–74.9 per cent) for clinicians and 66.8 per cent (60.9–71.7 per cent) for students (table 2). Agreement within one score (with 95 per cent confidence in brackets) was achieved as follows for researchers, clinicians and students: 98.4 per cent (96.8–100 per cent), 97.5 per cent (95.0–100 per cent) and 98.7 per cent (97.1–100 per cent). The clinicians achieved 99.7 per cent agreement within two scores; the researchers and students achieved 100 per cent agreement within two scores. The clinicians achieved 100 per cent agreement within three scores.

Table 2

Per cent exact agreement between locomotion scores given during sessions 1 and 2 (and within 1 and 2 points) for each observer

The mean weighted Kappa value for agreement between first and second observation was 0.84 with a 95 per cent CI of 0.78–0.89 for researchers, 0.81 (0.73–0.89) for clinicians and 0.84 (0.82–0.86) for students (see also table 3). As shown in table 4, there may be some systematic bias between observations for some observers (examples could include researchers 1 and 2, clinician 6 and students 3, 6 and 7).

Table 3

Weighted Kappa values for each observer’s agreement between sessions 1 and 2

Table 4

Mean difference between locomotion scores given during first and second sessions and results of one-sample t-tests

Inter-observer agreement

The mean exact agreement percent was 61.6 (95 per cent CI 59.5 to 63.7) for researchers, 57.6 (50.3 to 64.9) for clinicians and 54.6 (51.6 to 57.7) for students (see also table 5). The AC2 values were 0.70 (unadjusted 0.81, 95 per cent CI 0.76 to 0.86), 0.69 (unadjusted 0.80, 95 per cent CI 0.77 to 0.84) and 0.64 (unadjusted 0.75, 95 per cent CI 0.69 to 0.81) for researchers, clinicians and students, respectively (table 5). The overall adjusted AC2 value for all observers was 0.72 (unadjusted 0.75, 95 per cent CI 0.69 to 0.81).

Table 5

Mean exact agreement and Gwet’s AC2 for each group of observers (researchers, clinicians and students) and for all observers combined (AC2 values adjusted for critical values*)

The adjusted AC2 values created for each locomotion score are displayed in table 6. They show almost perfect or substantial agreement for videos scoring either zero or three (as determined by the mode score). There was substantial or moderate agreement for videos scoring two, and substantial agreement for videos scoring one according to the interpretations determined by Landis and Koch14: <0.00=poor; 0.00–0.20=slight; 0.21–0.40=fair; 0.41–0.60=moderate; 0.61–0.80=substantial; 0.81–1.00=almost perfect.

Table 6

Inter-observer agreement coefficient (Gwet’s AC2) for researchers, clinicians, students and all 24 observers combined (AC2 values adjusted with critical values*)


Locomotion scoring is currently relied on in the livestock sector, both to identify lame animals and to determine a herd level prevalence, including enabling benchmarking. Although locomotion scoring is criticised for being subjective, this subjectivity can be reduced by using a scoring system with good reliability, both by the same scorer when scoring on different occasions, and by different scorers scoring the same cattle. Lack of knowledge of the reliability of a scoring system makes it difficult to fully acknowledge its subjectivity.

This study has assessed the reliability of the proposed beef locomotion scoring system, that is, its consistency. However, it should be emphasised that it has not assessed the validity of the scoring system, which still needs testing. Neither inter-reliability or intra-reliability addresses the issue of accuracy because observers can consistently agree with each other, and themselves on different occasions, and still be wrong.

When using this locomotion scoring system, researchers, clinicians and students achieved at least substantial agreement in both the intra-observer and inter-observer assessments with all results greater than 0.61 (classed as ‘substantial’ according to Landis and Koch14). This suggests that if the same observer scores the clips on different occasions, or if different observers score the clips, over the 40 clips they could expect to achieve substantial agreement. However, at the level of each score (table 6), scores zero and three show almost perfect or substantial agreement, with score one showing substantial agreement and score two showing moderate or substantial agreement. This indicates that there is less agreement between observers over the actual locomotion score categories. This also shows that most disagreement is likely to be around score one and two, and as such care should be given when scoring animals believed to be in these categories. In veterinary practice, it is generally considered important to lift the feet of animals equivalent to the score two and three descriptors and treat them appropriately. Therefore, on an individual animal basis, where an observer is unsure if an animal is a score one or score two, we suggest that it may be worthwhile to take one of two options, with an aim to reduce the risk of missing lame animals: (1) score these unsure animals as a two, ensuring that they have their feet lifted and are treated if appropriate, or (2) create a new category of ‘unsure’, requiring a timely re-score.

The observers were all provided with training before watching the scoring videos. Although some evidence suggests that training can improve agreement for on-farm scoring systems,15 there is also some evidence to suggest that training may not lead to much improvement in intra-observer or inter-observer agreement for locomotion scores,16 17 but more scoring sessions, that is, more experience, may lead to improvements in inter-observer agreement.18 If further experience of using the system, for example, a number of practice clips that could be scored (with answers being shown afterwards) had been provided, it may have led to improved inter- observer agreement. This is also demonstrated by evidence indicating that experienced observers perform better than inexperienced observers.19

This scoring system has not been studied with farmer observers. This would be worthwhile future work. The observers used for this study were not a random sample, and this may be a limitation of the study. Due caution should therefore be taken when extrapolating results to the wider population. In particular, the clinicians selected were all experienced veterinary surgeons undertaking further qualifications. It may be that less experienced clinicians (eg, new graduate clinicians) may not be as reliable. However, the veterinary students studied showed almost perfect intra-observer agreement, and only slightly lower inter-observer agreement than the researcher group (AC2: researcher value of 0.70 compared with a student value of 0.64), yet still substantial agreement with each other. However, when looking at the level of individual locomotion scores (table 6), there was a slight trend towards lower AC2 values than the researchers and clinicians suggesting that experience may lead to improved agreement on each specific locomotion score category.

The exact agreement between sessions was generally high (mean=66.0 per cent (SD 9.8) for all observers). However, the range is quite wide (40.0 per cent to 80.0 per cent) as there were a number of outliers that are likely to have skewed the results (eg, clinicians 1 and 2). This suggests that some observers are not as good as others, and perhaps before individuals use this scoring system in practice, they should test their own agreement (precision). The videos used in this study can be made into a package for this use, and if individuals find that their intra-observer agreement is poor, they may want to practice and train before reattempting the package with the aim of increasing their intra-observer agreement. Systematic bias between attempts could also be identified and controlled. Inter-observer agreement could also be assessed in the same way in clinicians working across the same farms to ensure that they are scoring similarly.

On the second scoring session, there was some evidence to support the notion that some observers had systemic bias in how they scored. However, these were in different directions (some increased their mean scores, and some decreased their mean scores), and only small mean changes were made. This suggests some bias in terms of systematically increasing or decreasing the scores between sessions one and two. In the authors’ opinion, this bias is small and unlikely to have a detrimental impact on the assessment of the scoring system.

The video clips used were variable in length. The authors felt that this reflected on-farm locomotion scoring, where on occasions, scorers will need to score quickly. As all observers scored the same clips, and as it was possible to watch the clips as many times as required, the authors do not believe that this negatively affects the assessment of the scoring system.

The authors have now used the system for research purposes and added a fifth point20 to enable differentiation of severely lame animals from those who have non–weight-bearing limbs. However, this was considered not clinically relevant, as a score 3 and a score 4 would both constitute severe lameness, warranting examination and suitable treatment. For practical use, the authors would recommend using the zero to three system described in this study.

There is some disagreement regarding the categories from Landis and Koch.14 Some suggest higher scores should be achieved before agreement is considered ‘substantial’ or ‘almost perfect’. For this reason, all values have been provided so that readers can interpret as required. However, in the authors’ opinion, the intra-observer and inter-observer agreement across the 40 video clips is considered acceptable when compared with similar studies in the literature.5 16 21–23


The authors would like to thank the farmers who allowed their cattle to be filmed, the clinicians, researchers and students who took part in this study, as well as Denis Duret and Bob Read at The University of Liverpool who provided assistance with the online platform.


View Abstract


  • Funding The authors would like to thank the Animal Welfare Foundation who funded this research under the Norman Hayward Fund (grant ref. NHF_2016_05_KM).

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval University of Liverpool Ethics Committee (VREC533a).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available on reasonable request. The videos created during this study are available on request.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.