Article Text

other Versions


Reliability of equine visual lameness classification as a function of expertise, lameness severity and rater confidence
  1. Sandra Dorothee Starke1 and
  2. Maarten Oosterlinck2
  1. 1School of Engineering, University of Birmingham, Birmingham, UK
  2. 2Department of Surgery and Anaesthesiology of Domestic Animals, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
  1. E-mail for correspondence; maarten.oosterlinck{at}


Visual equine lameness assessment is often unreliable, yet the full understanding of this issue is missing. Here, we investigate visual lameness assessment using near-realistic, three-dimensional horse animations presenting with 0–60 per cent movement asymmetry. Animations were scored at an equine veterinary seminar by attendees with various expertise levels. Results showed that years of experience and exposure to a low, medium or high case load had no significant effect on correct assessment of lame (P>0.149) or sound horses (P≥0.412), with the exception of a significant effect of case load exposure on forelimb lameness assessment at 60 per cent asymmetry (P=0.014). The correct classification of sound horses as sound was significantly (P<0.001) higher for forelimb (average 72 per cent correct) than for hindlimb lameness assessment (average 28 per cent correct): participants often saw hindlimb lameness where there was none. For subtle lameness, errors often resulted from not noticing forelimb lameness and from classifying the incorrect limb as lame for hindlimb lameness. Diagnostic accuracy was at or below chance level for some metrics. Rater confidence was not associated with performance. Visual gait assessment may overall be unlikely to reliably differentiate between sound and mildly lame horses irrespective of an assessor’s background.

  • lameness
  • gait analysis
  • horses

Statistics from


Lameness is the most common equine health problem in the UK, accounting for a third of all problems reported in the Blue Cross National Equine Health Survey.1 The ability to perform a visual lameness examination is hence a fundamental skill of a veterinarian.2 Visual lameness assessment marks the first step towards successful diagnosis and subsequent rehabilitation of a patient,3 4 where the horse is typically evaluated for notable movement asymmetry during trot. Several objective studies indicate that the most reliable kinematic lameness pointer is vertical movement asymmetry of the head for forelimb lameness5–7 and of the sacrum and tubera coxae for hindlimb lameness.5 7–11 Although the introduction of technology into the clinical setting enables capturing and quantifying movement asymmetry, this does not replace the diagnostic abilities and decision-making process of a skilled and experienced clinician.12 It therefore remains empirical to train our eyes and mind in the best possible way to correctly assess horses.

Incorrect visual lameness assessment can have profound effects on a patient’s outcome. The reliability of gait assessment has hence been of interest to the equine community for a long time.6 10 13–24 ,25 This body of work demonstrated that the task poses a significant challenge to practitioners, where especially subtle lameness is inherently difficult to classify and diagnose.14 21 However, to date, a full understanding of the issues surrounding incorrect visual lameness assessment is missing. Video recordings or live presentations of cases with spontaneously presenting or induced lameness, where movement cannot be stringently controlled, give only a rough approximation regarding classification accuracy. Early work to establish basic movement asymmetry detection abilities in humans using simple block animations estimated a detection threshold of 25 per cent asymmetry.20 Computation of movement asymmetry detection thresholds using signal detection theory estimated this threshold at close to 15 per cent.26 Yet, what to date remains poorly understood are the very fundamental aspects of visual lameness assessment, including important questions about whether experience improves detection accuracy, how accuracy of detection correlates with the amount of asymmetry in gait, how correct classification of forelimb and hindlimb lameness compare and how reliably sound and lame horses can be differentiated. The objective of this study was therefore to establish a theoretical competency baseline for visual lameness detection across veterinarians with various expertise levels, using strictly controlled, near-realistic three-dimensional (3D) computer animations of sound and lame horses.

Materials and methods


This study was conducted as part of the 20th equine veterinary seminar organised by the ‘Wetenschappelijke Vereniging voor de Gezondheid van het Paard vzw’ ( The event was held for equine veterinarians working in first-line practice as well as in referral centres and for students in equine veterinary medicine. It took place at the Faculty of Veterinary Medicine, Ghent University, Belgium, on February 3, 2018. A total of 89 available TurningPoint voting handsets were randomly distributed among the 120 event participants to engage in an interactive lameness classification session. Participation was voluntary and participants were free to not answer questions.


3D animations (figure 1 and online supplementary materials) of sound and lame horses at trot were created as part of a project developing an e-learning application. Animations were based on high-resolution, 3D motion capture data recorded from a Thoroughbred horse. Using the recorded movement trajectories, average joint angles of each limb segment were calculated across the left and right limbs and offset by 50 per cent stride duration, to create a perfectly symmetrical virtual trot cycle. These data were then used to generate the movement of a 3D horse model in MotionBuilder (Autodesk, USA). To create animations of lame horses, the vertical movement trajectories of head (forelimb lameness) and sacrum (hindlimb lameness) were modified according to detailed literature data.5 This resulted in the generation of high-fidelity 3D animation clips of trotting horses with 10–60 per cent movement asymmetry (in 10 per cent increments) between the two vertical movement amplitudes of the head or sacrum during each stride. This asymmetry compares to normal subjective veterinary judgement as follows: in one study5  using the American Association of Equine Practitioners (AAEP) categorical 0– 5 scale, an average of 44 per cent asymmetry in vertical head movement and 40 per cent in vertical sacrum movement corresponded to a score of 1 ("lameness is difficult to observe and is not consistently apparent, regardless of circumstances"). Further, 86 per cent asymmetry in vertical head movement and 60 per cent in vertical sacrum movement approximated a score of 2 ("lameness is difficult to observe at a walk or when trotting in a straight line but consistently apparent under certain circumstances (e.g. weight-carrying, circling, inclines, hard surface, etc.") .5 In another study, average general movement asymmetry of 30 per cent for head movement and 25 per cent for sacrum movement approximated scores of mild lameness (1/4), while 45 per cent for head movement and 37 per cent for sacrum movement approximated scores of moderate lameness (2/4), with overlap in ratings.6 10 The animated horses in this study can hence be considered to display mild to moderate lameness. Horses presenting with forelimb lameness were shown in frontal view and horses with hindlimb lameness were shown in rear view. However, as in real life, the animations also displayed head movement from rear views and pelvis movement from frontal views. In addition to the different levels of lame horses, sound horses (here describing horses with symmetrical vertical head and pelvis movement) were rendered for each of the two views. For added realism, 10 different coat colours were applied to the animations at random. For each view, each asymmetry level was featured once and three animations were rendered showing a sound horse, resulting in a total of 18 clips, each comprising 10 strides.

Supplementary file 1

Figure 1

Snapshot of animations used in the study. For example of clips, please refer to online supplementary material 1 and

Data collection and preprocessing

The session started with participants responding to six general questions (see online supplementary material 2) regarding their years of experience, work setting, case load, confidence in assessing forelimb and hindlimb lameness and whether they would more likely classify a horse as sound or lame if they are unsure. Following this, the lameness assessment session started, where the 18 clips were shown in random order using a list of random numbers generated in Microsoft Excel (simple randomisation). Participants had approximately 30 seconds to examine each video before submitting their vote. Clips ran in a continuous loop during this time.

Data were gathered on a laptop computer with dedicated software (TurningPoint V., exported in spreadsheet format and analysed separately for horses presented in front view (detection of forelimb lameness) and rear view (detection of hindlimb lameness). Part of the analysis was conducted for data sets grouped by years of experience and case load exposure. Given the response distribution, years of experience was partitioned into ‘Student’, ‘<1 year’, >1 year but ≤5 years (none of the participants selected this) and ‘≥6 years’. Similarly, case load exposure was partitioned into monthly case loads of ‘<3 to 10 cases’, ‘11 to 20 cases’ and ‘21+ cases’. Statistical significance was defined as P<0.05.

Lameness detection

For horse animations displaying movement asymmetry, the percentage of participants correctly classifying the horse animation as lame was calculated for each asymmetry level across all participants who voted on the horse. This was done (A) across all participants and (B) across participants grouped by the above categories. A Fisher’s exact test was run to examine—for each asymmetry level—whether the participant group (by years of experience and case load) had a significant effect on the proportion of participants assessing a horse correctly with respect to both presence of lameness and affected limb. For this purpose, participant counts were processed in IBM SPSS Statistics V.24 using 3 x 2 cross tabs. Reasons for classification errors were calculated across all participants as ‘false negative’ (lameness not detected) and ‘wrong limb’ (lameness detected but incorrect limb declared lame).

Soundness classification

For sound horses, percentage correct was calculated across all participants voting on the horse for each of the multiple clips showing a horse in front view and rear view. This was done (A) across all participants and (B) across participants grouped by the above categories. A Fisher’s exact test was run to examine whether the participant group (by years of experience and case load) had a significant effect on the proportion of participants assessing a horse correctly as sound. A Pearson Χ2 test was run to examine whether the proportion of participants correctly classifying horses as sound differed between the assessment of forelimb and hindlimb lameness.

Diagnostic accuracy

To examine diagnostic accuracy, four metrics were calculated which are commonly used to assess the ‘usefulness’ of diagnostic tests. This was calculated for data across all asymmetry levels combined and based solely on the correct detection of presence of lameness, irrespective of the affected limb. Sensitivity (the proportion of all lame horses correctly classed as lame) and specificity (the proportion of sound horses correctly classed as sound) were calculated as:

Sensitivity=TP/(TP + FN) and

Specificity=TN/(TN + FP),

where TP is true positives (lame horses classed lame), FN is false negatives (lame horses classed sound), TN is true negatives (sound horses classed sound) and FP is false positives (sound horses classed lame). Positive predictive value (PPV, the likelihood of a horse actually being lame if classified lame) and negative predictive value (NPV, the likelihood of a horse actually being sound if classified sound) were calculated as:

PPV=TP/(TP + FP) and


For PPV and NPV, it is important to note that these metrics depend on the prevalence of lameness, here estimated as a 1:1 ratio of sound to lame horses.

Correlation with confidence rating

To examine the correlation between self-rated confidence and actual percentage correct on the task, percentage correct was calculated for each participant who had voted on 10 or more horses. Regression analysis was performed to examine whether increased confidence is associated with better task performance.

Classification bias

For all participants who had voted for at least two sound horses in a given view, the percentage of correctly assessed horses was calculated. Classification bias was taken from the answer to the question, ‘If you are unsure whether lameness is present, would you more likely classify the horse as sound or lame?’ (Answers: Sound, Lame). An independent samples t test was run to examine the effect of classification bias on the correct classification of soundness.



A total of 50 participants provided details on monthly case load exposure and years of experience, 44 provided a vote on confidence for forelimb lameness, 42 provided a vote on confidence for hindlimb lameness and 42 indicated a decision bias. Across individual clips, 18–31 participants who had also provided details about case load exposure and years of experience voted, with an average of 25 respondents voting per clip. A total of 31 participants provided a vote for 10 or more horses.

Lameness detection

For both forelimb and hindlimb lameness, increasing asymmetry resulted in a larger percentage of assessors correctly classifying the horse as lame in the correct limb (figure 2). For forelimb lameness, this increased from 17 per cent of all participants at the 10 per cent asymmetry level to 77 per cent of participants at the 60 per cent asymmetry level. For hindlimb lameness, this rose from 32 per cent of all participants at the 10 per cent asymmetry level to 70 per cent of participants at the 60 per cent asymmetry level.

Figure 2

Percentage of participants correctly classifying a horse as lame (x, grey) and those who also correctly determined the affected limb (o, black) for assessment of forelimb lameness (top) and hindlimb lameness (bottom).

Years of experience (figure 3, top) had no significant effect on the proportion of correct assessments at any asymmetry level for forelimb lameness (P≥0.149) and hindlimb lameness (P≥0.186). Case load exposure (figure 3, bottom) had a significant effect on the proportion of correct assessments at the 60 per cent asymmetry level for forelimb lameness only (P=0.014), where participants in the groups seeing 11–20 cases and 21+ cases per month substantially outperformed the group seeing less than 3 to 10 cases per month. Case load exposure had no significant effect on the proportion of correct assessments at any asymmetry level for hindlimb lameness (P≥0.199).

Figure 3

Percentage of participants, grouped by years of experience (top) and case load (bottom), correctly classifying a horse as lame (light shade) and those who also correctly determined the affected limb (dark shade). Experience: green—‘Student’, red—‘<1 year’, blue—‘≥6 years’. Case load: green—‘<3 to 10 cases’, red—‘11 to 20 cases’, blue—‘21+ cases’. 

For forelimb lameness, more than half (78 per cent) of participants correctly detected the presence of movement asymmetry (irrespective of the affected limb) at the 30 per cent asymmetry level (figures 2 and 3, light shading). For hindlimb lameness, more than half (75 per cent) of participants correctly detected the presence of movement asymmetry at the 20 per cent asymmetry level (figures 2 and 3, light shading). The asymmetry detection threshold can hence be approximated to around 25 per cent for forelimb lameness and 15 per cent for hindlimb lameness. This approximation is derived from standard signal detection methods, which define the detection threshold at the level where 50 per cent of stimuli are detected correctly. Here, we substituted this for the level at which 50 per cent of participants correctly detected lameness.

Reasons for error (figure 4) were similar between forelimb and hindlimb lameness for the lower (10 per cent) and higher (40 per cent +) ends of the asymmetry scale. For 20 per cent asymmetry, while most errors for forelimb lameness arose from false negatives, more than half of those for hindlimb lameness arose from selecting the wrong limb. Similarly, for 30 per cent asymmetry, all errors were due selecting the wrong limb for hindlimb lameness, while less than two-thirds were attributed to this reason for forelimb lameness.

Figure 4

Reasons for error when assessing forelimb (filled bars) and hindlimb (open bars) lameness. Green—correct assessment, orange—incorrect limb selected, red—false negative (horse incorrectly declared sound).

Soundness classification

For horses that presented sound with 0 per cent movement asymmetry (figure 5), the percentage of participants classifying horses correctly was not significantly affected by years of experience (P≥0.461) or case load (P≥0.412). The percentage of participants correctly classifying sound horses as sound was significantly higher (P<0.001) for the assessment of forelimb lameness (average 72 per cent correct) compared with hindlimb lameness (average 28 per cent correct).

Figure 5

Percentage of participants correctly classifying horses as sound (mean±sd across three horses presenting sound in each view). Years of experience and case load coded by colour (please refer to figure 3 for reference). Black circles: average percentage across all participants who correctly classified horses as sound.

Diagnostic accuracy

Results for the four measures of diagnostic accuracy are shown in table 1 as a function of experience and case load exposure as well as across all participants. For forelimb lameness across all participants, sensitivity (0.69) and specificity (0.72) were similar. For hindlimb lameness, sensitivity (0.81) was substantially higher than specificity (0.28) due to the high number of false positives noted in the assessment of hindlimb lameness. Assuming a prevalence of lameness at a 1:1 ratio of sound and lame horses presented for evaluation, around 70 per cent of classifications as sound or lame across all participants can be expected to be correct for forelimb lameness (PPV: 0.71, NPV: 0.70) and around chance level at 56 per cent for hindlimb lameness (PPV: 0.53, NPV: 0.59).

Table 1

Measures of diagnostic accuracy

Correlation with confidence rating

There was no significant correlation between self-rated confidence in lameness assessment and actual performance on the task (figure 6) for forelimb (R2=0.13, P=0.066) and hindlimb (R2<0.001, P=0.978) lameness. The only notable divergence from this trend was for three participants with self-rated confidence of 0 in the assessment of forelimb lameness, who assessed only half the number of horses correctly compared with their peers.

Figure 6

Correlation between self-rated confidence and the percentage of correctly assessed horses (mean±se) across all participants. Grey circles indicate single data points.

Classification bias

Out of 42 participants who responded to the classification bias question, 10 (24 per cent) reported that they would classify the horse as normal and 32 (76 per cent) as lame if unsure. Bias had no significant effect on the percentage of horses correctly declared sound for assessment of forelimb lameness (P=0.739) and hindlimb lameness (P=0.806).


Accuracy of visual gait assessment: evaluation of lame and sound horses

This study demonstrated that the accuracy of visual gait assessment is imperfect across an asymmetry range of 10–60 per cent, representing mild to moderately lame horses. For both forelimb and hindlimb lameness, movement asymmetry of 40 per cent or more was required for at least half of all participants to classify the horse as lame in the correct limb. This asymmetry level would be scored around AAEP 1/55 or 1–2/4 (mild to moderate).6 10 Only the group with the highest case load exposure (21+ cases per month) reached 100 per cent correct when classifying forelimb lameness with 50 and 60 per cent movement asymmetry. There was no evidence for years of experience or case load exposure to affect performance except for a significant effect of case load exposure on forelimb lameness assessment at 60 per cent asymmetry. Forelimb lameness is considered more common in veterinary practice than hindlimb lameness,4 which could explain this finding.

The ability to classify the correct limb as lame was comparable between assessment of forelimb and hindlimb lameness. This is somewhat surprising given that hindlimb lameness is generally considered to be more difficult to assess than forelimb lameness.4 6 8 10 19 Instead, the most apparent difference between forelimb and hindlimb lameness assessment was in the large proportion of ‘false positives’ (sound horses classed as lame) in the assessment of hindlimb lameness: participants frequently saw lameness where in fact there was none. This resulted in poorer specificity of 0.28 and a lower positive predictive value (PPV) of 0.53, compared with the same measures for forelimb lameness (0.72 specificity, 0.71 PPV). In effect, a PPV of 0.53 means that the likelihood of a horse actually being lame if classified lame is at roughly chance level (50 per cent) for hindlimb lameness, assuming a reasonable prevalence of lameness at presentation of 50 per cent (1:1 ratio of sound to lame horses). Errors made when evaluating lameness with smaller asymmetry near the detection threshold differed: for forelimb lameness, they often resulted from not noticing the lameness. For hindlimb lameness, they were often due to classifying the incorrect limb as lame. Due to simultaneous axial rotation and vertical movement of the pelvis,5 9 27 28 observation of pelvic movement for hindlimb lameness assessment is perhaps more complex than observation of more simple head movement for forelimb lameness. This more complex pattern may increase the chances of falsely triggering the misperception of movement asymmetry as lameness. More attention should be given to these concepts when teaching students and young or inexperienced veterinarians about observation for hindlimb lameness. Diagnostic accuracy across the full asymmetry range was disappointing, especially for the discrimination between sound and lame horses, and should serve as a warning sign regarding the predictive value of visual gait assessment, especially for subtle lameness.

This study provided further evidence that the correct identification of sound horses poses a challenge to assessors across the experience scale for assessing especially hindlimb lameness (average 28 per cent correct), but also forelimb lameness (average 72 per cent correct). These findings were unaffected by years of experience or case load exposure. Similarly, correct classification of sound horses was previously at chance level for students evaluating video recordings25 and reportedly posed a challenge during gait scoring for lameness assessment during lunging.24 These results highlight the importance that training should emphasise recognition of normal as much as abnormal movement patterns. Other domains have shown a positive impact of training such ‘normal’ patterns on subsequent discrimination skills.29

Visual asymmetry detection thresholds and implication on assessing subtle lameness

While this study did not allow for the repeat presentation of stimuli, which is usually done in signal detection, the general visual asymmetry detection threshold was approximated as the percentage of participants correctly detecting movement asymmetry (irrespective of the affected limb). For forelimb lameness, 78 per cent of participants recognised 30 per cent asymmetry and 27 per cent recognised 20 per cent asymmetry. The detection threshold (per definition the threshold at which 50 per cent of classifications are correct) is hence expected around the 25 per cent asymmetry mark. For hindlimb lameness, 75 per cent of participants recognised 20 per cent asymmetry and 44 per cent recognised 10 per cent asymmetry. The detection threshold is hence expected around the 15 per cent asymmetry mark. This lower threshold may arise from tubera coxae movement amplifying movement asymmetry through a ‘hip hike’,5 8 28 but also be the result of higher number of false positives. In light of existing asymmetry perception thresholds, the present study substantiates earlier estimates of limits to human asymmetry perception at around 25 per cent asymmetry20 and 15 per cent asymmetry.26

The above findings suggest that we cannot expect human observers, of any expertise level, to reliably classify subtle gait asymmetry correctly. Results demonstrate that for horses presenting with movement asymmetry of 20 per cent and below, it will be unlikely that visual movement asymmetry assessment can reliably classify or detect these cases correctly. This probably explains why at least in the past, horses presenting with around 18 per cent movement asymmetry were included in cohorts of horses classified as ‘normal’ as indicated by the mean and 1 sd.5 It is further important to note that during the observation of real horses during trot up, movement will be less consistent than in the animations of this study. Observation in the field may hence require more pronounced asymmetry than that reported here. For those horses that present with subtle lameness, the importance of this symptom has to be evaluated in the light of a complete and thorough examination, including diagnostic anaesthesia and the exacerbation of lameness, for example, through flexion tests, evaluation on the circle and even ridden evaluation. Objective gait analysis may assist in these circumstances to verify observations.

How frequently horses presenting with movement asymmetry in the region of 20 per cent suffer from clinically significant pathology is currently unknown, where the ‘normal’ range of movement asymmetry in horses is still an open question. When applying current thresholds from objective gait analysis, 72 per cent of 220 riding horses perceived sound by their owner exceeded thresholds for lameness.30 Similarly, only 25 per cent of horses considered sound by their owner were assessed as sound under all circumstances during a comprehensive subjective evaluation.31 In the future, differentiating between small movement asymmetry caused by pathology and that caused by other reasons (such as handedness and conformation) will remain one of the big challenges for both objective and observational gait analysis.

Classification of subtle lameness: issues and outlook

Results of this study may cause concern for some regarding the general reliability of visual gait assessment. Visual assessment has further been shown to be influenced by factors such as expectation bias17 and trotting speed.23 However, systematic training and practice may help overcome current limitations. A positive outlook was given by a study on human gait analysis, which achieved high interobserver agreement by selecting ‘observers [that] had taken the exact same visual analysis training course, had academic backgrounds which were very alike and also [had] a similar professional experience.’32 This suggests that by employing extensive and consistent training, interobserver agreement (and by extension the reliability of decisions) can reach very good levels.

The present study did not find a systematic association between self-rated confidence and actual performance. Hence, the confidence of a practitioner is not predictive of his/her assessment quality. Better opportunities for regular self-assessment may improve this situation. In contrast to this study’s findings, veterinary students showed a strong correlation between self-assessed and actual performance.25 In the future, better opportunities for self-assessment may result in improved awareness of one’s own abilities and limitations.

Supplementary file 2

Supplementary file 3

Supplementary file 4


SDS thanks the Eranda Foundation for funding the LamenessTrainer project (, through which the animations were developed, as well as her LamenessTrainer team Gregory Miles, Stephen May and Sarah Channon. MO thanks Jan Velghe and Dirk De Maerteleire for technical support with the voting system.


View Abstract


  • Funding This study was funded by the Eranda Foundation.

  • Competing interests None declared.

  • Patient consent Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.