The gait scoring system developed by Manson and Leaver was used by five experienced observers to assess the gait of 83 milking Holstein-Friesian cows in a live recording session, and video recordings were made. The agreement between the scores of the observers at the live session, and between each observer’s scores at the live session and a video session, were compared at three levels of stringency. The scores of the observers were highly variable at all but the least stringent threshold – whether a cow had a score of less than 3 or 3 or more, that is, whether it was not lame or lame.
Statistics from Altmetric.com
LAMENESS in dairy cattle is associated with pain, reduced dry matter intake and weight loss (O’Callaghan 2002), and milk yield decreases before clinical signs appear (Green and others 2002). There are also costs associated with treatment, infertility and culling. It is therefore important to detect lameness early, to improve the welfare of the cows and the profitability of the farm.
Lameness is usually first detected visually, without necessarily using a formal scoring system, by the herdsman, who will notify the routine hoof trimmer or, less frequently, a veterinarian. They may then use a scoring system to assess the cow’s abnormalities in gait or behaviour and assign it a lameness score. Alternatively, they may make an informal observation that the animal is not walking normally, although the use of formal scoring systems is increasingly being encouraged by assurance schemes. The cow may then be treated. The reliability of lameness scoring systems in cattle and horses has been questioned (Sprecher and others 1996, Winckler and Willen 2001, Engel and others 2003, Flower and Weary 2006), and the systems available all identify different characteristics as important factors for the detection of lameness (Whay 2002).
Scoring systems are subjective and in some cases the agreement between observers has been as low as 37 per cent (Engel and others 2003). Even the agreement between one observer’s observations can be as low as 56 per cent (O’Callaghan and others 2003). This ‘observer effect’ on the lameness score is particularly sensitive during the onset of lameness; trained observers may agree that a cow is either clinically lame or not lame, but agreement as to whether a cow is unsound may be much lower (Winckler and Willen 2001). In spite of this variability, locomotion scoring is the trusted and only practical way of assessing lameness (Whay 2002, Kujala and others 2008).
The root of the problem could be the qualitative nature of the scoring systems. Using criteria including ‘moderately lame’ and ‘slightly lame’ means that unless observers have viewed a large number of cows and developed agreed criteria, there can be no basis on which interobserver reliability can be reduced. Some scoring systems give a more thorough description on which to base the score, for example, ‘The cow stands with level back posture and gait is normal’ (Sprecher and others 1996), but again the assignment of a normal gait is left entirely up to the observer. Without clear and objective descriptions, a scoring method is open to individual interpretation.
There have been few investigations of the reliability and repeatability of the system of locomotion scoring described by Manson (1986) and Manson and Leaver (1988). Manson (1986) showed that different observers agreed 89 per cent of the time, and repeat measurements by the same observer agreed 84 per cent of the time. The large number of categories and therefore high resolution of the system made it more acceptable for use in scientific research (Whay 2002). This study aimed to assess the reliability of the system by measuring the differences between different observers scoring the same herd and, by using video, to assess its repeatability by measuring the differences between the lameness scores assigned by the same observer to the same cow observed repeatedly. It also aimed to assess whether video might be used as a means of ‘remote scoring’ a herd, a method that could possibly have application for farm assurance schemes.
Materials and methods
The cows were scored and the videos were taken at a farm with 83 Holstein-Friesian milking cows. During the 12 months before the study, between 3 and 10 per cent of the cows had been treated for lameness each month.
The scoring system
The panel of five observers (Table 1) scored the cows by the Manson and Leaver system. This system uses a nine-point graduated scale with discrete 0.5 intervals, ranging from completely sound (1.0) to extreme difficulty in rising and other indicators of severe lameness (5.0) (Table 2). The system requires each cow to be observed for a prolonged period (30 seconds to one minute) and during a number of activities (including turning) (Whay 2002). The herd was small and milked by a single herdsman, and as a result many of the cows left the parlour singly and could be held back from the exit race easily. The layout of the parlour allowed the observers to view the cows walking straight for approximately 25 m before they turned a right-hand corner and walked away (Fig 1); they could therefore view the cows from the side and behind, and as they turned.
The live scoring session took place during afternoon milking. Each member of the panel was given a score sheet, a copy of the scoring system guidelines and some pictorial aids to help them make their decisions. The score sheet was designed to minimise ‘central bias’ (Engel and others 2003) by asking the observer to write down a number rather than mark a continuous scale. The sheet also included a space to assign an affected limb, if the observers felt confident that they could. The observers viewed the cows from the collecting yard, where they were free to walk around, and they were asked not to discuss their scores with one another. During the session a technician intervened when necessary to ensure that the cows passed along the walkway as uniformly as possible. The cows were identified by their freeze marks.
Filming took place from two days before to two days after the live scoring session, during the morning and afternoon milkings. The herd was filmed from the side and from behind with two video cameras (HDR-HC1 RGB 25 Hz; Sony). Because the collecting yard was in normal use during four of the five filming days, it was not technically possible to film the cows walking from the location where the observers stood (Fig 1). No artificial lighting was required. The videos were uploaded on to hard disk using Studio 7 (Pinnacle Systems), and then cut and synchronised into sections showing individual cows using VirtualDub (available to download from www.virtualdub.org).
The videos of individual cows were randomised and videos in which cows urinated, stood still or overlapped with one another in the video were discarded. To obtain cows with different levels of lameness, the cows used in the final video were selected on the basis of the scores they were assigned during the live scoring session. Cows with a mean score of less than 2 during the live session were designated to be sound; cows with a mean score between 2 and 3 inclusive were designated as unsound but not lame; cows with a mean score greater than 3 were designated as lame. The final video contained 30 cows, made up of 19 sound cows, nine unsound cows and two lame cows. The video contained no repeats because the freeze brands on the rear of the animal were conspicuous and it was felt that the observers would be able to recognise individual animals.
Every effort was made to use the videos from the live scoring session, but because some of these cows had to be excluded, four videos were used from milkings before and after the scoring session. These videos were taken within 48 hours of the live scoring session and none of the cows underwent any veterinary treatment during this time.
The video was shown, on a computer, to each of the observers individually three weeks after the live scoring session. It was felt that this was adequate time to eliminate any problems with observers remembering a particularly lame or otherwise memorable cow. The video was less than 15 minutes in length, but the observers were allowed to view each cow as many times as they wanted. For the video session, the observers were given the same score sheets and pictorial aids as used in the live session.
During the video scoring, a technician recorded how many times each cow was viewed by each observer.
The data were analysed using SPSS v 15. The scores from the live session were used to test the agreement between the scores of the five observers (interobserver reliability), and the scores from the video session were used to test the agreement between each observer’s scores (intraobserver reliability). In order to assess whether the cows chosen for the video session were representative of the herd, the scores of the cows in the video recorded from the live session were also tested for the agreement between the scores of the observers.
The scores from all the observers were tested for normality using a Kolmogorov-Smirnoff test applied at three thresholds: scores, T1 and T2 (Table 3). These thresholds were introduced to assess the level at which the observers could agree. For example, even if they disagreed on the absolute score, they might agree that the animal was lame. The mean score was used to represent the average for normally distributed data, and the median was used for data that were not normally distributed. The standard error of the mean or median was used to calculate 95 per cent and 99 per cent confidence intervals (CIs), which were used to detect significant differences at the 5 per cent and 1 per cent levels.
Agreement between observers at different thresholds during the live scoring session
A proportional agreement (PA) and Cohen’s kappa (κ) agreement statistic were used to check for significant agreement between observers at the three thresholds. In order to assess the suitability of κ as an agreement sta tistic for this study, its correlation with PA was investigated. A weighted kappa (κw) was calculated at the scores threshold to show whether the observers were scoring similar, if not identical, scores. κw gives exact agreements a value of 1, and disagreements that differ by one category or more are given a value of 1–d(c–1), where d is the number of categories by which the observations disagree and c is the total number of categories. Thus, discrepancies of 0, 1, 2, 3 and 4 with nine categories in total are weighted 1, 0.875, 0.75, 0.625 and 0.5, respectively (Altman 1991).
The ability of the observers to assign an affected limb to an unsound animal was tested by recording whether the observers agreed on the affected limb. In order to assess whether it was easier to assign an affected limb to cows with higher lameness scores, the correlation between the score and the level of agreement was calculated.
Comparison between observers during the live scoring session
The observers’ scores from the live session were not normally distributed (Fig 2).
κ was shown to be correlated linearly with PA across all three thresholds (R2=0.811). There was no significant agreement (κ>0.7) at any of the thresholds measured. κw was significant at the scores threshold, but it was also statistically indistinguishable from 1 (Table 4); for this reason κw was not calculated at the other thresholds.
The distribution of κ values between the observers was indistinguishable from the normal distribution at all three thresholds. For this reason the mean values were used and the 99 per cent and 95 per cent CIs were calculated. The PA scores, that is, the number of agreements divided by the number of animals scored, were normally distributed at all three thresholds and the mean values were therefore used. For both κ and PA, there was a significant increase in agreement at each threshold (P<0.01 for κbetween all thresholds and PA between T1 and T2, and P<0.05 for PA between scores and T1) (Fig 3).
Comparisons of the scores assigned by each observer at the live and video sessions
Each observer’s scores from the video session were compared with the scores they assigned to the same 30 cows during the live session. κ could not be calculated in all cases. The PA and κ scores were normally distributed and the mean values were used for both of them.
There was a significant increase in PA between T1 and T2 (P<0.01), but not between scores and T1 (Fig 4). κ increased significantly between T1 and T2 (P<0.05), but remained very low.
Comparisons of the levels of agreement of the scores assigned during the video session and the live session
In order to assess whether the cows chosen for the video session were representative of the herd, the levels of agreement between the scores assigned by the observers during the video session and the live session were compared. The PA scores were similar across all three thresholds and increased significantly between scores and T1, and T1 and T2, for the live observations of the whole herd and the video observations of the selected 30 cows (Fig 5).
Designating an affected limb on an unsound animal during the live session
Observer 4 did not select an affected limb on any of the unsound cows and was removed from this part of the analysis.
The other observers agreed about the affected limb on one cow, but in the other cows at least one observer did not feel confident enough in their decision to assign an affected limb, and in some cases the observers assigned different limbs (Table 5).
None of the observers assigned a forelimb as the affected limb, possibly because it is harder to do this confidently, or simply because none of the cows had an affected forelimb.
The threshold system
Using a system of decreasingly stringent thresholds allowed the data to be analysed in a way that may be more applicable to the use of the system in the field. For example, it may not be important for a research team or herdsman to assign exactly the same score consistently to an animal. It may be more useful to know whether the system can recognise unsound animal consistently, or simply detect clinical lameness reliably. The thresholds used were arbitrary in their boundaries, but it is reasonable to assign a score of more than 1 to an animal that is not sound. The threshold system implies that an animal that one observer thinks is not sound and another observer thinks is clinically lame is more likely to have something wrong with it than an animal that both the observers think is unsound but not clinically lame.
Live scoring session
The increase in agreement between the observers when fewer categories were used is to be expected (March and others 2007). Only when the threshold was set to lame or not lame did the observers agree on more than 80 per cent (88.3 per cent) of cases, and when using the actual scores the agreement was only 33.3 per cent, considerably lower than the 89 per cent reported by Manson and Leaver (1988). Although the κ value increased with each threshold, it showed significant agreement between two observers only on T2 (κ=079), suggesting that the only threshold reliable for this herd was T2, lame or not lame. The weighted κw suggested that although the observers could not agree on a score, the difference between their scores was rarely large, supporting the findings of using the threshold system.
The disagreement between the observers was probably due to several factors (Engel and others 2003) and not solely to the scoring system. Every effort was made to allow the observers to have the best chance of scoring the cattle consistently, but they were able to view cow only once, and for a limited period of time.
Observer agreement between live and video scoring sessions
There was an increase in agreement as the thresholds became less stringent. The κ value remained low at all the thresholds, although this may have been an effect of the population of cows (see ‘Limitations of the κ statistic’ below). The observers’ video scores agreed with their live scores 86.7 per cent of the time at T2, but only 30 per cent of the time when the scores themselves were analysed.
The angles of view of the video cameras and the observers were different, and the results of the two sessions were therefore not directly comparable.
The low intraobserver reliability indicates that observers received differing visual indicators of lameness in each of the two sessions. By contrast, high intraobserver reliability in combination with low interobserver reliability would indicate that the same visual information was being received by the observers but the different observers were simply interpreting it differently. The results suggest that this was not the case, and that either the different observers saw different indicators during the two sessions but interpreted them correctly, or were unable to interpret the information consistently. The second possibility seems unlikely for trained veterinarians.
It was not the purpose of this investigation to analyse which indicators were apparently more noticeable during the live and video scoring sessions, but it would be useful to know, to allow scoring systems to be designed around them. It is possible that the disadvantage of not being able to see the cows in three dimensions during the video sessions was offset to some extent by the ability to watch them several times.
Assigning an affected limb
The observers’ attempts to assign an affected limb would have been hindered by their inability to see the right flank of the animal until it walked away from them after turning the corner (Fig 1). Assigning an affected limb can be difficult unless a cow is lame, and it was therefore not surprising that the agreement between the observers was poor.
There was a non-significant correlation between the median score of the cows and agreement about the affected limb, presumably because it is easier to assign an affected limb to animals that are more obviously lame.
It was not possible to compare the observers’ assessments with observations of the cows’ feet. However, the main purpose of the investigation was to assess the level of agreement between the observers, and it has been shown that the agreement was poor.
Limitations of the κ statistic
The κ agreement statistic has been used to assess the levels of agreement between the observers, and between each observer’s live and video scores. One limitation observed in this study is that it does not work if matching categories are not represented in the population. For example, in the video session observer 1 did not score any of the cows 35, but did in the live session. This meant that no κ statistic could be calculated until the next threshold (if matching categories were represented there). This is mainly a limitation of small or skewed populations like that used in this study. Whether κ is appropriate for use with skewed data is not clear, and there are arguments in favour (Vach 2005) and against (Hoehler 2000). In this study, κ correlated well with PA, suggesting it was an appropriate indicator of agreement (Fig 6).
Implications for cattle welfare
The Manson and Leaver scoring system is not the preferred choice of herdsmen and/or foot trimmers, and this investigation has shown that the system was highly variable at all but the least stringent threshold. However, this threshold may be acceptable for many herdsmen and foot trimmers who do not need to differentiate between scores of 1.0 and 1.5. Scientific researchers such as those developing automated lameness detection systems (Bicalho and others 2007), or wanting to detect preclinical lameness, should be aware that the system can be highly variable, as any ‘gold standard’ based on the system could be influenced by the variability of the scores. A less sophisticated scoring system may be more suitable from a herd management perspective, where small changes in locomotion are not as important as whether or not the animal is lame. Scientific research may require a higher resolution, but it is unlikely that scientific investigations would score the cows coming out of the milking parlour or be restricted in where the observers stood. Without these drawbacks the agreement levels might have been higher.
The industry should look to the scientific community for a more reliable means of detecting lameness by novel techniques (Kujala and others 2008), such as those used to detect lameness in horses (Audigié and others 2001, 2002, Pfau and others 2007). Flower and others (2005) showed that hoof pathologies have a quantifiable effect on the kinematics of walking cattle, but that study was limited to two hoof pathologies and gave no indication of the method’s reliability in detecting lameness.
The authors thank Neville Gregory and John Fishwick for participating in the study, and Defra for funding the study. AMW is the holder of a BBSRC Research Development Fellowship and a Royal Society Wolfson Research Merit Award.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.