EQB's Published Heart Study a Wake-up Call for Believers

General on-topic discussion.

Moderators: Roguelet, hpkingjr, WaveMaster

Silver Deputy
Newborn
Posts: 8
Joined: Tue Mar 25, 2008 10:02 am

EQB's Published Heart Study a Wake-up Call for Believers

Postby Silver Deputy » Wed Sep 14, 2016 11:15 am

Going back to the late 1990s, when I was a young man and my father was a shopper of yearlings and used EQB’s heart service to streamline his choices, I have always been attracted to the clarity EQB’s data provide. Most methods for evaluating untrained young horses at auction don’t rely on data at all, or in the case of pedigree-related statistics, speak to the horse’s family, not the individual under consideration. The knowledge that one yearling’s left ventricular cross-sectional area in diastole (LVD) measures 2000 mm larger than another yearling’s potentially brings evaluation into much sharper focus.

Last year’s October 24 edition of the BloodHorse contained an article featuring EQB’s heart service. The backbone of the piece consists of explanations and background provided by EQB President Jeff Seder and Vice-President Patti Miller. They take pains to stipulate what their heart measurements can and cannot tell you, and how they have learned to leverage and interpret them better.

Although it had been a long time since I sat silently by listening to Patti’s tutorials, or read EQB’s promotional material, little in the piece was new to me. I hadn’t been aware, however, that EQB had converted an internal review of their data into an April 2003 article in the Journal of Equine Veterinary Science. While it would be natural to wonder if the research had been rendered largely obsolete by more recent investigations, Seder presents it as the go-to source for the most complete understanding of their work.

I had looked at EQB’s heart reports at the Saratoga yearling sales in days of yore imagining what would become of the corresponding horses, and here was the answer to that mystery, repeated thousands of times over, culminating in an ultimate verdict. From the enthusiastic nature of Seder’s citations in the BloodHorse article, it appeared that my pro-heart measurement hunch, born from compelling case studies and theory, would gain foundation when I became acquainted with the actual data.

The experience of reading the article was peculiar. The classic journal article arc is clear, with a hypothesis, data intended to support it, and follow-up data to expand its reach. But in the typical statistical study, the commentary almost seems an afterthought, more provided for the sake of completeness than illuminating a mystery or setting a tone for the paper. If the data does not speak for itself, it does most of the work, and the issues that require explication concern subtleties. In reviewing this paper, however, the results I was reading about, and the ones I was seeing, ran on separate tracks. Omissions ruled the day, with so much that seemed to demand comment skirting it. Results that seemed to point to the failure of the hypothesis were hailed a success. The disconnect for me did not consist of the interpretation of a table or two, but persisted for virtually the entire piece.

The paper begins by discussing the technical side of heart measurement and introduces dimensions represented by the abbreviations LVD, LVS, and SW. These are ways of looking at heart size. The reliability of the heart measurements, meaning that when hearts are measured to be a certain size, they basically are, is shown in several mathematical forms, all of which are frankly apparent from the first. The amount of space dedicated to this presentation is curious because low reliability would not invalidate any correlations found between heart size and performance; on the contrary, it could serve as an explanation for a lack of findings. Establishing high reliability is not necessary to secure the argument that better horses have bigger hearts.

Because Seder, Vickery III & Miller eventually adjust heart size based on age, sex, and weight, they establish the relationship of the y (heart size) to the x variables to show that this is proper procedure. The data leave little doubt that heart size does indeed need to be assessed in relation to these variables, meshing with intuition. However, the researchers’ analysis of the effect of age is poor and erodes trust in the overall quality of the work. First, they adduce R-squares of .98 for colts and .92 for fillies, relating age and heart size. This suggests a striking, maybe even a stunning, correlation. But the actual analysis performed is not what you would assume, and the discrepancy is not mentioned. The meaning is not that, as would go with the full sample size of 7434, over ninety percent of heart size for an individual horse can be predicted from age. What the researchers have done is to calculate average heart size for all colts and fillies of a given age, and then regress these averages on the months variable (the n is 16 for 16 months). This inflates the R-square and covers up the troubling fact that heart size is shown to stay essentially constant through spaces of months, and even to get smaller for fillies over some time periods!

The illogical nature of these results points out that the research methodology for capturing the trend is flawed. The data are cross-sectional, with each month’s bucket containing different horses. What likely happened, since the sample size is well into the hundreds for almost all months, is that the horses evaluated in some age samples were simply better than those evaluated in other age samples. Because sales occur at different times of the year, and heart measurements are often taken in advance of a sale, the individual age brackets undoubtedly had different proportions of sales affiliations. Sales differ in quality, leading to a difference in quality in the age brackets. This problem results from using a cross-sectional approach when a longitudinal one is needed. A longitudinal analysis would have measured the same horses at every time point. Yet Seder, Vickery III & Miller explain their counterintuitive growth chart as reflecting puberty, training, or drug regimens at different ages, while glossing over, if not outright ignoring, the cross-sectional nature of their data and the problems it produces.

This error has limited implications for the rest of the study, because the growth curves do not factor in the ultimate adjustment of heart size for age. Instead of adjusting for age, sex, weight, and year measured (I’ll get to this one shortly) through the method of covariation so all horses could be compared on the basis of their heart size without the extraneous variables, the vast majority of the possible comparisons were instead thrown out for each horse, with only horses of the same age, sex, weight, and year measured remaining (with a small range allowed for age and weight, obviously). With 16 ages, two sexes, say five weight groups, and two-and-a-half time intervals, the sample was divided by around 400 as standard practice (but the more common a horse’s age, weight, and year measured were for the study, the fewer reference horses were thrown out). So instead of gauging heart size in relation to 7,433 other horses, often only a few other horses served as sources of comparison. This conservative approach was costly and unnecessary, since the relationship between the nuisance variables and heart size could have been ascertained from the data.

The approach probably has the effect of weakening rather than distorting, and is therefore not a fatal error, with the large original sample size the saving grace. The departure from best practice is dramatic, however, leaving the impression of amateur, not expert work. The study thanks biostatistician J. Richard Trout, PhD, at its conclusion. I don’t know enough about the work biostatisticians usually do to judge whether it is a good fit for the challenges of this analysis, but I can say that research psychologists habitually deal in similar datasets, and you will not find a published study in that field that eschews covariation for this “bucket” approach. It is at best a crude solution, needlessly curtailing sample size.

Perhaps the researchers believed their work would have the most credibility if methods were kept as simple as possible. I also suspect the Journal of Equine Veterinary Science has more of a veterinary orientation than a statistical one. But the mystique of EQB rests on the notion that they operate on a higher level than the typical thoroughbred advisor; they join horsemanship with scientific expertise. When they water down their research, this claim is compromised.

Returning to the decision to limit comparison horses to those measured within a year of the subject, EQB cites “the possible effects of gradual small changes in calibration, methodology, and external variables” (such as the rate of steroid usage) as the rationale. I can well believe that a LVS of 4500 mm2 in the year 1995 warrants a different interpretation than an LVS of 4500 mm2 in the year 2000. It’s a reasoned hypothesis. However, unlike with age, sex, and weight, the data that support controlling for year measured are not presented. An analysis such as the average heart size recorded for each calendar year would be necessary to justify the “time measured” restriction. The additional condition, rather quietly introduced, strikes me as one that may have moved a non-significant result between heart size and performance into the significant category. That in itself would not mean the step was inappropriately taken, but it is necessary to demonstrate that there was an actual initial problem in comparing spaced measurements at face value.

Once each horse had its own reference group, its heart size was ranked within that group and converted to a percentile rank. But subsequent analyses are not conducive to percentile data, and the move is a bit like using the oven to dry clothes: you can get away with it, but you’re not using the product correctly or in the way it was intended. It’s hard for me to believe that anyone with a thorough understanding of the mathematics involved could stomach taking this step. As I tried to work with the percentile data, I ran into illogicalities and dead ends. In a preemptory vein for which they deserve credit, Seder, Vickery III & Miller accurately characterize the problem from a technical standpoint, but say that the average reader’s greater familiarity with percentiles constituted a worthy tradeoff, and that results were effectively the same when mathematically sound standardized scores (z-scores) were used instead.

This example probably is illustrative of the authors weighing the relative value of simplicity versus accuracy differently than I do. Certainly, there is much to be said for simplicity in this debate, and if that were the crux of the difference in our methodological approaches, there wouldn’t be much to comment upon. The disagreement runs deeper, however.

If the percentile transformation does in fact lead to incoherence, as I contend, why did it yield the same pattern of results as the z-scores? The unreliability of the interpretations throughout the paper means that I cannot just take the authors’ word that results were unaffected, but looking past this, I see two reasons. The first is that for capturing means, percentiles should work quite well. A large heart by percentile rank will be a large heart by standard score. The serious problems only come in for significance testing. The second point is that a standard score approach could have problems of its own unique to the extreme segmentation of these data. Standard scores could come out as screwy outliers in groups of two horses, five horses, etc. Based on my experience, I am actually strongly inclined to believe the authors did indeed dodge a bullet with the percentile rendering, meaning that it led to the right conclusions. But I can’t say I take the standard score results as the measuring stick of this.

If neither conversion system was a good choice, you might ask what would have been? The answer, again, is that covariation should have been employed, not segmentation.

Groups were also formed to capture racing performance. The idea was to compare good horses and not-so-good horses, and then compare their hearts. The good horses (high earners) were those with $10,000 or more in earnings-per-start, and the bad horses (low earners) those with $2,000 or less in earnings-per-start (since the horses were offered at auction from 1995-2000, the scale of earnings was probably lower than it would be today).

In defining their groups the way they did, Seder, Vickery III & Miller employ an infamous technique, held up for opprobrium in statistics classes the world over. Turning continuous variables (here, earnings) into categorical ones (earning groups) is in itself discouraged, but antennae really go up when the data are divided into only high-performing and low-performing groups. This is the oldest trick in the statistical book. It is what you do if you want to make your effect seem greater than it is. The general correlation between variables will explode as a consequence. The authors never give the basic statistic we want, which is the correlation between heart size and performance, for all horses. Instead, the middle 53% of the dataset by earnings, those with between $2,000 and $10,000 a start, are thrown out.

What I have written so far is merely preamble. While, with the benefit of hindsight, the flaws and irregularities of the introduction and methods were as ominous as a horse climbing early in a race under a hard ride, I believed the crudeness of the research did not necessarily speak to the importance of the variable under investigation. So it was with full suspense that I finally came to the unveiling in Table 9.

This shows that, for the three measures of heart size, high earners averaged to be in the 53rd percentile (53.12 on the LVD or diastole measure, 52.72 on the LVS or systole measure, and 53.29 on wall thickness). Very good horses would seem to differ little then from the general thoroughbred population. But Heather Smith Thomas quotes Seder claiming much more in the BloodHorse profile.

"One of the first things we learned in the Olympic sports-medicine movement was that elite athletes were as different physically from normal people as sick/injured people are from normal people.

"They had to build databases for human athletes, and I had to do that for horses. Everything in the textbooks was about normal horses or diseased or injured horses and it didn’t apply because these athletes are physically different."

If the average star runner has a heart at the 53rd percentile, it stands to reason that some of these horses must have very large hearts, and some very small hearts. The standard deviations, ranging from 27.17 (SW) in percentile units to 28.43 (LVS) in percentile units, confirm this. Standard deviations usually supplement means and allow us to infer the percentage of individuals that had scores above or below certain values. But with the data already converted to percentiles, the standard analysis no longer applies. This problem has larger implications, as significance testing, in a way the lifeblood of the paper, requires the viability of standard deviations. But for now, it is adequate to note that the mean/SD combination quoted makes it clear that many horses with large hearts and many with small hearts are in the distribution. For comparison, the average percentile heart among all 7434 horses measured would of course be 50, and if all percentiles were represented equally, the standard deviation would be 28.87. So not only is the average heart size among the very good horses only 3.04 percentile units higher, their spread, as represented by standard deviation, is 97% as much. For general size of heart, Table 9 tells you everything you need to know: good horses’ hearts look like all horses’ hearts.

Data on unsuccessful runners (although, it should be noted, representing 34% of runners compared to 13% for the > $10,000/start group, not the mathematical obverse of successful runners) are compiled, too, and significance tests comparing the groups are the bedrock of the argument for heart size’s importance. The percentile ranks for the low earners on the three size variables cluster around 46 (45.93 – 46.45). Again, standard deviations approach the 28.87 expectation for a random sample and do not suggest any meaningful clustering at sizes lower than those found among the high earners.

Now, EQB’s position, as per the BloodHorse article, seems to be that there are more good hearts than good horses, as heart is only one prerequisite for success. While I don’t see how high earners’ and low earners’ hearts could truly have such asymmetrical profiles, the low earner data does not directly contradict the role EQB says heart size plays in racing success the way the high earner data does.

Through significance testing (p-values are shown to be below .001), the authors establish that high earners have bigger hearts on all three variables than low earners. The unanswered question is, how much bigger? Establishing the size of differences, not just that they exist, is a standard part of any analysis, with established tools that enable comprehension and allow comparison from analysis to analysis. Cohen’s d, for instance, ranges from 0.22 to 0.26 for the three measures of heart size. A d of 0.20 is needed for an effect to even be considered small. It has to rise to 0.50 to be moderate, and 0.80 to be large. So these differences fall far short of the moderate level. Hopefully, this result simply extends your intuitive sense of one group having an average percentile rank of 53, and the other having an average percentile rank of 46.

We can also take the means and standard deviations and turn them into a correlation coefficient. This is helpful, because many of you are comfortable with the correlation coefficient’s meaning and scale. The correlation between earning group and heart size, depending on the variable, ranges from .11 to .13. Square those values, and we learn that heart size predicts about 1% of the variance in earnings. And this isn’t among all horses, remember, but just among a group of high earners and low earners, a setup designed to increase the size of effects.

Just as is the case with significance testing, the mathematical precision of effect sizes is compromised by the percentile basis of the data. I cannot conceive of a situation, however, where the percentile formulation somehow led to means and standard deviations that did not give an accurate idea of the magnitude of the real differences.

Defenders are sure to point to the miniscule p-values as evidence of heart playing a large role in performance. But experienced analysts know to differentiate between significance, which is the mere establishment of an effect greater than 0, and effect size. These are separate questions. A link does not have to be strong to produce a low p-value, it can just have a large sample size behind it. That is the case here.

Strangely, Table 9 compares high and low earners not just on heart size but on physical size, as represented by Weight, Height, and HTWT (HTWT is described as a product of the first two, but, as Tables 11 and 12 show, has a much higher correlation with Weight than with Height). This may be strange, but is extremely useful for gauging the effect size of heart. In terms of Weight, high earners beat low earners 60.11 to 47.25 (scores reflecting percentiles again); on Height, they had a 65.81 to 53.47 edge; and on HTWT, they were bigger on average, 58.53 to 45.32. Cohen’s d places the difference between the groups at 0.45 for Weight and Height, and at 0.47 for HTWT. To review, 0.26 (SW) was the biggest effect size among heart variables.

The small difference in heart size between groups could be explained as being revelatory of the thoroughbred industry. The argument would go that the method of identifying a good horse at a yearling or 2-year-old sale is so little understood that the scale of advantage demonstrated in the heart analysis compares favorably with alternate tools. But the height and weight difference between high and low earners effectively puts this contention to bed.

Not only does this show that another variable distinguishes good and bad runners more effectively than heart does, but the elemental nature of the other variable suggests that mean percentiles for high earners around 60 just scratch the surface of what would be possible with an assessment tool reflecting the most informed thinking in the industry. The performance of height and weight as predictors may show the very opposite of the initial hypothesis: high and low earners may differ starkly on many variables. As bloodstock analysts, we either greatly underestimate how important the most basic horse characteristic is, or more holistic ratings of conformation and athleticism would, as expected, differentiate good and bad horses much more dramatically than size does. Heart’s role in an overall assessment, then, would be less central still.

So that you can have complete understanding of EQB’s method and interpret the results for yourself, one detail that I should explain is that the size percentiles had difference reference groups than the heart percentiles. Making a fine reading of the study, it appears that size rankings required the same sex, 30-day-age proximity, and year of measurement for comparison horses, but did not restrict the group to horses of similar weight. So we need not confront the oddity of weight predicting racing performance when it had little variation.

Another point to consider is that, without the weight restriction, the size reference groups must have been larger, meaning that size was probably represented more accurately by the percentile scores than heart was. Some of size’s better performance as a predictor of racing performance can then perhaps be attributed to this. But the onus is on the principal investigators here to demonstrate the prepotency of their heart variable; we cannot assume it would emerge with a more reliable measure. And certainly the representation of the heart effect overall seems bolstered by a number of decisions motivated less by the superiority of the particular approach and more by obtaining the desired outcome. We will continue to encounter these cases.

While the first EQB heart assessments I was privy to had the separate LVD, LVS, and SW dimensions, they seemed to converge in a single rating of the horse’s cardiovascular standing. Later, the picture seemed more complicated, and we were told certain horses had ideal sprinters’ hearts, for instance.

The value of considering performance in relation to distance does emerge in the paper. The difference could be illusionary, simply reflecting an arbitrary definition of groups. But taking it at face value, the probability of high earners having above-average hearts (a 53rd percentile average rating, you will remember) seems not to hold among sprinters (percentile means of 49, 48, and 52 for LVD, LVS and SW, respectively). But the good news is that, when sprinters are taken out of the whole group of high earners, the average rating then rises for the rest of the horses, to the 57th and 58th percentiles. Invoking terms from the statistics and psychology fields with which I am familiar, the terminology is that the overall effect of heart size is qualified by sprint/route proclivity.

We are left again considering effect size and whether, if we subset to routers, heart size does indeed have practical value. Taking the high earner mean and standard deviation for LVS, the variable with the highest mean percentile among routers, and comparing them to the previously defined low earners, the effect size improves to 0.41. However, from three perspectives, this number still looks meager. First, the effect size guidelines still place it in the small range. Second, it is still not a better discriminator of very good and ordinary horses than a simple rating of a horse’s size. Third, the effect that does emerge is still only achieved with the discredited technique of comparing extreme groups (> 10k/start, < 2k start).

I’m also not convinced that routing inclination really is the driving force behind the group labeled routers having bigger hearts than the group labeled sprinters. It seems entirely possible that the routers were simply superior horses: the glory in racing is in routing, and while both groups had to fulfill a 10k earning/start requirement, the routers on average could have exceeded this rate by more than the sprinters. The EQB-termed routers might also be better because no cap was put on their sprinting success: routers had to have made over $10,000/start with a three-route-minimum sample, with no attention paid to their sprint record. Their success sprinting could have been at any level, including outpacing their success routing. (Hence why I can’t simply call them “routers.”) There is a chance the sprinting group really was better sprinting than the routing group (after all, the routing group had no requirement of sprinting success), but it is almost certain its liabilities routing exceeded the routers liabilities sprinting: to qualify for the sprint group, sprinters had to be under $2,000/start in their route races. Striving to be fair, however, this alternative hypothesis explaining the routers’ advantage in heart size in terms of greater ability acknowledges the link between heart and ability in the first place, something that is hardly incidental to the study. So, again, I don’t doubt the link, only that it is large enough to be important.

The greater correlation between heart size and performance with routers comes with numerous other qualifications. First, Seder, Vickery III & Miller address only horses who made > $10,000 a start. It doesn’t require significant cynicism to suppose that this is neither randomly chosen or thematically considered, and the distance effect did not emerge with quotidian horses. More importantly, the routers are not just contrasted with sprinters, but with rank sprinters: their sprint success (> $10,000 a start) came in races under 7f, and their route difficulties were extreme, as they earned < $2,000 a start routing. Most good horses have some facility sprinting and routing, but these horses did not.

It is important to examine exactly what is found rather than just accepting the implication that routers have bigger hearts than sprinters of the same ability. The actual findings are not applicable to most horses one would actually be evaluating. At best, the findings need to be extended to a broader group, and at worse, the peculiar stipulations that were instituted reflect that attempts to do this failed.

The same table that shows that horses that did well only in races under 7f have smaller hearts (but not necessarily thinner heart walls) than horses whose success included routes shows the same trend with physical size. To be more specific, the HTWT percentile advantage of routers over sprinters was 10.21 on the average, with a 0.39 d. This compares to a 10.49 router/sprinter differential, 0.37 d for LVS, and a 8.29 differential, 0.29 d for LVD. Mean percentiles for Weight, Height, and HTWT among high-earning routers were 64.12, 69.51, and 63.06.

It would be tempting to ascribe routers having bigger hearts than sprinters as simply a natural consequence of their being bigger overall. But since the heart size ratings controlled for physical size, the two findings are best considered independently. If there is a connection, it is again that the definitions of routers and sprinters may have resulted in the routers being better. Even more so than for heart, physical size is correlated with racing success.

The course of a research paper can be compared to a horse’s career, which typically begins with an anonymous maiden special weight and succeeds to more decisive tests. While I would argue that the meaning of this study’s descriptive heart statistics (mean and standard deviation) is dispositive, the greater apparent complexity of a statistical model, and its more immediate applicability, makes it the equivalent of a first big stakes test for a horse, holding much more interest for most readers than the preliminaries. As I see it, the twist that the statistical model puts on the basic descriptives in this case is twofold: first, it leverages the individual variables to capture their combined power; second, it frames the critical question in the manner the audience naturally asks it, in terms of racing performance resulting from heart size, and not the other way around.

The paper does not provide nearly enough detail to derive completely understanding of the inner workings of the statistical model, and categorical modeling is also not my strongest subject. But the general approach is as follows. Among the heart dimensions (LVD, LVS, SW, and the redundant “Stroke Volume”) and the size variables (HTWT, etc.), the weights that lead to the best predictions for “high earner”/“not high earner” are found. Each horse’s actual status is known, which can be simplified as “1” or “0.” And then alongside this is a prediction for the horse. This would initially not be a “1” or a “0” but would be a percentage within that range, depending how the horse’s heart and size characteristics placed its chances. The final weights are those that minimize the cumulative differences between actuals and predicteds. The sample again is all high earners and low earners, with horses whose earnings-per-start ranged from $2,000 to $10,000 excluded.

Every variable tested is guaranteed to help prediction, but not necessarily parsimoniously. In keeping with tradition, Seder, Vickery III, and Miller only include the variables with weights significantly different than 0. They do not tell us what these weights are, but they tell us which variables made the cut. For combined sexes, in order of impact, these are HTWT, SW, and LVS.

The next step is to make the predictions simply “1” or “0,” and not percentages, so that each horse has a definitive predicted status as a high earner or low earner, and an actual status as a high earner or low earner. Rounding is employed, and the horses are forced into the categories based on whether the prediction for them is over or under .5. With this division completed, the difference in performance between the predicted high earners and low earners can be observed, and the practical potential of the model understood.

The authors break up results into “blind” and “non-blind” tests, and into colts and fillies. The blind tests aim to preempt the naysayers who claim using the same sample for deriving the weights and evaluating their effectiveness predisposes toward positive results. I will accept at face value the researchers’ finding that this inherent bias was not a major factor and will concentrate on the non-blind results -- mainly arbitrarily, but also because they encompass the full sample size of 1479. As dividing into colts and fillies also seems less authoritative and more specialized, plus was not seen to alter the basic picture, I will address only “combined sexes.”

The EQB model tabbed a little under half (46.4%) of the prospects to be high earners. These forecasts were wrong more often than they were right (only 37.3% of these horses did in fact become high earners). However, this hit rate seems impressive compared to the record of the predicted low earners, who only became high earners, and defied expectation, 20.4% of the time. The “good” group was in fact good on the track 83% more often than the “bad” group.

A prediction of “high earner” can be labeled an aggressive one in the sense that it is going against the starting probabilities. The model suffers from over-aggressiveness. Its predictions were right 60.0% of the time. But calling every horse in the sample a “low earner” at the starting point would have netted a much better hit rate of 71.7%. This deficit relative to chance performance does point to something’s being out of kilter, and I will explore the issue later. But unlike when filling out a NCAA bracket, the goal is not strictly to be right as often as possible, but to be able to distinguish good horses and bad horses with some perspicacity. While the two tiers that are created are ultimately devoid of sense (with a 46-54 split, they’re not the best-50% rated prospects and the bottom 50% of prospects, for example, nor are they horses that actually had a 50%+ chance of being a high earner or low earner), the first tier does greatly outperform the second tier.

Again, one of the elements that can distinguish a model from a descriptive report is the estimate of a combined effect. In this case, the selection of the tier 1 prospects was derived from the HTWT, SW, and LVS variables.

The inclusion of HTWT seems out of place here. Nothing in the presentation of the subject belies the title: “The Relationship of Selected Two-dimensional Echocardiographic Measurements to the Racing Performance of 5431 Yearling and 2003 Two-year-old Thoroughbred Racehorses.” The subject is echocardiographic measurements, not general measurements. Clients do not engage EQB’s services to learn how big prospects are. (Presumably, they can figure this out on their own).

Research of the kind undertaken in this paper frequently makes use of control variables in models. But that is not HTWT’s function here, as the heart variables already control for HTWT. What HTWT is doing is simply making the model more predictive. This is glossed over. The hope seems to be that the reader will blithely attribute all of the model’s predictive power to the value of heart readings.

EQB does not show how each of the variables was weighted in the model, but they do tell us that HTWT had the strongest initial association with being a high earner. This was a given from the mean percentile comparison of high earners and low earners shown in Table 9. In fact, in a model environment, size’s preeminence would only grow, since the intercorrelated heart variables would largely cancel each other out (although the same may have happened to HTWT if it was subjected to Weight and Height being in the model at the same time, a point which isn’t clear to me from the write-up). In any event, the improvement in the post-model probability of selecting a high earner is undoubtedly owed in no small measure to size.

My frustration that size obscured the true effect of heart size on performance dissipated when I realized that a later table essentially provided the data unvarnished. At first blush, Table 14’s summary showing the nonblind, combined sexes model performance and a later analysis (Table 26) showing percentage of high earners for every heart variable in every quartile range do not seem commensurate. But framing the model results in terms of high earners, rather than as a ratio of high earners to low earners, bridged much of the gap. Then, in what was just a stroke of good luck, the model predicted high and low earners in almost equal number, facilitating comparison to Table 26, which has its groups divided into quartiles that can easily be consolidated into halves.

The demonstration of the model’s effectiveness used only high earners and low earners. Table 26 did not narrow the sample based on earnings, as long as the horses made three starts in North America. So the sample is the same as the one used in the test for the model, except it is over twice as large, since “middle earners” were included as well as low and high earners. Therefore, the overall percentage of high earners in Table 26 is a little over 13%, while the percentage of high earners in Table 14 is 28.3%. (The middle earners included in Table 26 increased the percentage of non-high earners.) As I have stressed, the complete sample is the right one to use, allowing for an accurate gauge of the effect.

I have an inkling an example is sorely needed. Table 26 shows that the bottom quartile in SW were high earners 10.8% of the time, with the groups representing progressively larger hearts high earners 13.1%, 13.1%, and 16.3%, respectively. Averaging the first and second quartiles, and then the third and fourth, the resulting median division shows an increase in success from 12.0% to 14.7%, or 22.5%. This is a gauge of the effect of SW alone. By comparison, in the model test, LVS, SW, and HTWT together selected a group that became high earners 82.7% more often than the rejected group. So single heart variables were a lot less predictive than the model’s array of variables.

Even though the statistical regression showed that SW was the most correlated heart variable with high earner status, the improvement in high earner rate dividing hearts into large and small categories was actually a bit better for LVD (36.0%) and LVS (30.2%). To explain this seeming contradiction, I would say that, first, the median approach does not acknowledge the predictive value of the SW second quartile outperforming the SW first quartile. Second, as mentioned before, the quartile analysis had a different and larger sample than the model, as “middle” earners were included.

While the two analyses provide a meaningful comparison, the reduction in the size of the effect cannot all be laid at the door of the elimination of physical size as a variable. There are additional differences, some of which I note only out of scientific compunction, and some of which seem more material. First, the 82.7% improvement represents a theoretical world in which there are only low earners and high earners, while the ~30% improvement does not involve this complication. Additionally, within the samples, the model’s tier 1 group was slightly more select than the top 50%, giving it an advantage in distinguishing itself compared to the exact top 50% used in the LVS analysis (although I believe the relative advantage of the way the split was done in the model test to be infinitesimal).

The 82.7% also represented not just SW and HTWT but LVS as well. Combining the effects of the heart variables is not only legitimate, but necessary for assessing their potential. However, I have reason to think it is not responsible for much of the composite success in prediction. Table 11 shows the three heart variables intracorrelate with R-squareds of .51, .70, and .74. Correlations are more commonly shown as rs, and those numbers are .71, .84, and .86. So if you know a horse’s standing on one of the variables, you have a good idea of its standing on the others. After accounting for one of the heart variables, the unique predictive value of either of the two other heart variables is likely limited.

It is simply a fact as well that every single analysis shows physical size to be more correlated with performance than any individual heart variable. Seder, Vickery III & Miller specify this was the case in the section where they detail variable selection in the models. And working from Table 26, the same median approach applied to HTWT finds a 58.8% difference between Tier 1’s high earner rate and Tier 2’s. Losing physical size is obviously more of a blow to prediction than losing a somewhat redundant heart variable.

The colt-only and filly-only models also suggest a second heart variable finetunes rather than drives predictions. Logic dictates that if the weight in the combined sexes model was large, at least one of the sub-models should have seen it emerge as significant. Yet, while the pattern was constant in all three models, with HTWT most predictive and SW second-most predictive and significant, neither LVS or LVD were significant in the colt-only and filly-only models. LVS was probably significant in the combined model just because of the larger sample size resulting from combining colts and fillies. Again, the paper does not supply us with LVS’s weight, but the information given is entirely consistent with a weak effect.

While the imbalance does not directly implicate the utility of their model, any discussion would be incomplete without mentioning that Seder, Vickery III, & Miller’s numbers don’t add up. In the demonstrations I found of logit probability models and the equivalent that classified cases into groups, the ratio of the predicted cases either exactly mirrored the actual ratio or was even more slanted toward the larger initial group. But taking EQB’s combined sexes’ nonblind high earner/low earner model (the pattern is the same with all of the other high earner/low earner models), the 72% actual low earners shrunk to 54% in prediction.

If we knew nothing about any of these horses, if we did not know their heart size or physical size, our best guess would decidedly be to put them in the low-earner category, since low earners outpaced high earners by a ratio of between 2- and 3-to-1. If we are to believe the classification results, we need to believe that the data was so strong on 46% of the candidates that they defied the base probability and were more likely to be high earners than low earners. Note that this was not just the prediction for a few exceptional horses, but for nearly half the sample. And it was not a prediction that was borne out: the projected high earners only made good on it 37% of the time.

So many horses with a large discrepancy between their supposed chance of being a high earner (50%+) and their starting pre-model probability of being a high earner (28%) seems particularly curious because the data in general point to a weak model. Without very strong links between heart size and high earner status and physical size and high earner status, the model should not produce weights strong enough to yield aggressive predictions, no matter what a horse’s characteristics were. And there do not seem to be these links.

Remember that beyond the simple “1” or “0” classifications of predictions, horses have an initial, exact prediction, such as .32 (which would mean that a horse’s characteristics lent it a 32% chance of being a high earner). While the numbers comprising the dichotomous classification do not need to sum to the actual 418 high earners in the sample, the raw, input predictions do need to sum to 418. Termed another way, just as the percentage of high earners in the sample is 28.26%, the average predicted value needs to be .2826.

But the scenario of 46.4% of the sample having probabilities of at least .5, and yet the average probability equaling .283, does not make much sense. If the horses rejected as likely high earners were all assigned zero chance of achieving this, which defies reason, the average probability of being a high earner of the horses that passed the threshold could only be .6093 (.6093*.4638*1479 = 418). Taking another case, if the model found that all of the predicted high earners implausibly had exactly a .5 chance of being a high earner, that would still not leave much room for the horses not predicted to be high earners: their average chance would just be .0946. In order for the numbers to add up, it seems to me that an extremely convenient and perhaps impossible thing had to have happened: a model that could not support predictions well over .5 somehow made many of them, while making very few just under .5. In other words, among the good candidates to be high earners, an inordinate percentage made the grade, while there were next-to-no near misses. One would assume the model was blind, and could not have a tendency of disproportionally making predictions just over .5. The implied histogram sticks out like a half-decorated Christmas tree.

The only hypothesis I can even advance in support of this scenario would be if the percentile scores on the heart and size dimensions were often based on only a few observations and scores such as 100%, 75%, etc., were abundant. This would certainly indicate a larger methodological problem but could also lead to predicted values clustering and not forming a coherent distribution. However, the means and standard deviations for the variables in Table 9 indicate no such abnormality. The standard deviations essentially match what you would see if all percentile scores were approximately equally represented.

I hold researchers to a high standard. If I am tough, it is mostly because crude, insufficiently considered research and misinterpretation bother me to my core. I also have come to terms with the fact that mincing words does not work for me and ends up being counterproductive. My natural manner is marked by right angles and not curves. In my case, an understated tone would be transparently insincere, motivated by cowardice, not conciliatory impulse, and my criticisms would sound even more contemptuous than they do now.

In addition to not pulling punches, I am driven by the imperative to be fair as I can be. I need to guard against the trap of writing an analysis starting from the premise that EQB’s paper is just terrible. The risk of succumbing to negative bias is multi-pronged: first, there are truly striking weaknesses with the paper, and this can lead to indiscriminate rejection of everything in it; second, the extreme close scrutiny I have subjected the paper to is grist for extreme criticism (an exhaustive search will always unearth faults); third, far-reaching criticism can seem interesting and indispensable, while measured criticism can seem superfluous.

The seeming illogic of the model test numbers puts my goals of not mincing words and being fair into conflict. I believe the paper is misleading almost from beginning to end, and I believe the researchers are more knowledgeable than some of their omissions of analysis would lead you to believe. It’s also easy to see that the imbalance of high and low earners may have created a stumbling block where high earner predictions could not be made, and the model therefore not tested. With backtracking, such as perhaps redefining the categories to be more even, this didn’t have to be an insurmountable problem, but one can certainly see why it was more tidy, and why it would have been tempting, to keep the original framework.

Alleging fudged data is a step further than I am willing to go, however. I cannot sit silently by. I must bring the questions up. But for all I know, the answer to the mystery might lie in key facts I am overlooking, or in key elements of the model or model test I am not understanding.

If I had to guess, I would say that my critique is well-founded and yet the high-earner predictions did result in earnest from fixed methodology. While it’s very possible that deficits in my knowledge base or, despite repeated reviews of the relevant section, still remaining oversights explain why I could not reconcile the numbers, my hunch is that the paper does not lay out the road map that would be required to grasp the procedures. Although elsewhere in the paper I was virtually always able to resolve my confusions, I believe the paper is often sorely in need of clarification. This seems partly by design: there are significant problems that the researchers presumably would prefer went undetected. So, what I’m missing about the model test might serve to explain the numbers but not to justify them. The paper is rife with examples of unorthodox interventions, sometimes adopted with limited comment, that do violence to the basic issues under investigation, and I believe jerry-rigged procedures probably were at play here. I think it is too much to hope for a deus ex machina, particularly considering what seem to be some daunting and irreconcilable initial facts.

Table 26’s comparison of the high-earner rates by heart size is the piece of evidence that gives me the most pause when I contend that EQB’s own data argues for the limited value of its heart scans. To review, this table shows that horses with the 50% largest hearts earn more than $10,000-a-start approximately 30% more often than horses with the 50% smallest hearts. Heart scans can probably do better when more than one measure is used and when registered more precisely (i.e., when the exact percentile is noted instead of just above- or below- average).

The analysis I constructed from Table 26 is just a rearrangement of a comparison I reported on previously. High earners having a mean heart percentile score of 53 with a standard deviation of 28 is the same information in a different form, and this was a window into the issue that made it obvious to me that heart size was relatively unimportant.

I have the facility with numbers one gets from working with them habitually, yet I would not have guessed these as two sides of the same coin. No less than our eyes, our statistical sense can play tricks on us. So if we compare above-average hearts to below-average, we see a 30% high earner edge for above-average hearts. If we compare above-average hearts just to the overall average, the edge is 13.5%. If we state the gain in percentage points, it is 1.8%, probably the least convincing of the framings (1.8 reflects the difference between the average of the high-earner rate for above-average LVD, LVS, and SW horses, 15.1%, compared to the overall 13.3% high earner rate).

Ultimately, I am most comfortable in the mean and standard deviation realm and in my ability to interpret these data. And thank goodness for Cohen’s effect size guidelines: they go a long way toward eliminating the subjectivity habitual with interpretation. And again, the comparison of heart to physical size, categorically showing heart to be less important, is the trump card, and places the heart effect in perspective.

We can disagree about the importance of heart and particularly about whether it has any practical value. What I do not think we can disagree on is that the narrative that has taken hold far exceeds the reality. Consider this claim of Seder’s in the BloodHorse article.

"We never discount desire, but of all the thousands of super graded stakes winners, we’ve never seen one with a really tiny heart."

I don’t know how Seder defines a “super graded stakes winner” (although how super can this specimen be if it’s been encountered thousands of times), and I don’t know what “a really tiny heart” is. The statement is simultaneously extreme and hedging, conveying unease. I do know that the authors’ study finds over 10% high earners among horses with hearts in the bottom 25%, compared to 13.3% overall, so to the extent the statement can be evaluated, it is false. The quote furthers the narrative that heart selection is the secret to selecting good horses, a breakthrough in its modest realm akin to Salk’s solving polio or Darwin’s discovering evolution.

“Then we know what is different about the good ones; it’s the heart, in the ones that succeeded,” Seder says.

The data make it abundantly clear that heart does not deserve this special place of privilege.

While Seder sometimes makes extravagant claims for heart’s explanatory power, sometimes his tone and Miller’s are more measured. “Seder says they don’t even look at the heart of a horse until it has passed three or four of their other tests,” Thomas writes. While the quote discussing “super graded stakes winners” stands out for being categorical, ultimately what EQB is contending is that heart should be used as a screening device. The suggestion is clearly that buyers should eliminate horses with below-average hearts from their list of prospects. In my opinion, the evidence also does not support this approach: the heart sizes of good horses show very little consistency.

My main reason for examining the importance of EQB’s heart data was simple: I was curious. Beyond my personal history and data analytic bent, the obvious and important practical applications gave the issue special resonance. However, I certainly could have engaged in the same process of analysis and then sat on my findings.

Having gone public, I do not expect you to be immediate converted over to my side. You may imbibe the rhythm of my case and impute at least an element of truth, but it is natural to feel overwhelmed by an avalanche of argument, particularly when it is statistically based. You are wise to be skeptical and to suspect that you would be equally convinced by the other side if it had its spokesperson.

So you revert to messenger credibility as your touchstone. EQB’s team can claim status, experience, advanced degrees, and work that was granted publication. Don’t get me wrong, these are most decidedly points of recommendation, not grounds for indictment. But ultimately they do not make up for glaring deficiencies and weak argument. Granting EQB due respect is one thing; placing blind faith in their construals, another. Reverence for science imbues EQB with much of their appeal, but the hallmark of embracing science is not the amount of enthusiasm with which we glom on to scientific products, but the extent to which we form our beliefs from the available evidence. To this end, skepticism is not just permissible, but crucial.

EQB also does not deserve the final word on the utility of their work because they are hardly a disinterested party. Take note of Seder’s words here as recorded by Smith Thomas.

"So I went after that aspect [the “size and functioning of the heart”] – but couldn’t do it because we didn’t have the right protocol nor the right equipment. As it turned out we also needed mountains of data. By the time we arrived at something useful it was 20 years later and I’d spent millions of dollars."

Having made such a sizeable investment, EQB’s demonstrated lack of objectivity is not the least bit surprising.

Compared to other fields and sports such as baseball, data analytics in racing is at a primitive stage. This will continue to be the state of affairs if declared results are accepted at face value and not examined. Intercourse of ideas is needed, not shouting in the wind. Moreover, a truly active audience keeps researchers honest. If I know you will be evaluating what I do, my level of care will increase.

Data is fashionable and data is comforting. It endows us with a sense of power. We run toward light in the darkness. But the potential does not equate to actual insight. All data is not good data, and we can evaluate the tools that we use. If EQB’s heart data doesn’t work, it is time to invest in other approaches, rather than accepting the company’s spin as gospel.

-David Harris

User avatar
Mahubah
Freshman Sire
Posts: 2774
Joined: Thu Sep 16, 2004 2:23 pm
Location: Lake City, Florida

Re: EQB's Published Heart Study a Wake-up Call for Believers

Postby Mahubah » Fri Sep 16, 2016 6:41 pm

Thanks for sharing such a detailed statistical analysis. Question for you: is the Journal of Equine Veterinary Science a publication which requires peer review of papers prior to printing them?
"A man who was merely a man and said the sort of things Jesus said would not be a great moral teacher...You must make your choice. Either this man was, and is, the Son of God: or else a madman or something worse." C. S. Lewis

Silver Deputy
Newborn
Posts: 8
Joined: Tue Mar 25, 2008 10:02 am

Re: EQB's Published Heart Study a Wake-up Call for Believers

Postby Silver Deputy » Sat Sep 17, 2016 2:36 pm

Mahubah wrote:Thanks for sharing such a detailed statistical analysis. Question for you: is the Journal of Equine Veterinary Science a publication which requires peer review of papers prior to printing them?


I certainly think so. I just got the paper from EQB's website. Here's a link, or if it doesn't work, it's under "Scientific Journal Publications" on the website.

http://www.eqb.com/Domains/www.eqb.com/ ... ournal.pdf

You see that refereed mark in the top left hand corner of the article. JEVS's mission http://www.j-evs.com/content/aims and editorial board are also worth looking at. http://www.j-evs.com/content/edboard. While in no way an expert about such things, I am surprised the paper was published, even though statistics might more be something the majority of the board is acquainted with, rather than expert in.

wgc517
Allowance Winner
Posts: 381
Joined: Tue Nov 30, 2004 7:27 pm
Location: East Coast

Re: EQB's Published Heart Study a Wake-up Call for Believers

Postby wgc517 » Wed Nov 23, 2016 7:42 am

Hi David,

I read your post on EQB and I was just wondering what part of the business you were in, (Bloodstock agent, breeder, racing etc.)? I ask because it was the longest post I ever saw on here and written in a way that does not seem like it comes from someone who would normally post on this board. Although I have not used their services, (I am a small breeder) I do follow them and find their website and the selection processes they discuss very interesting. Maybe because I read a lot of peer reviewed studies in my other job and I know the value and support they offer.

Would you be kind enough to share your qualifications/affiliations with us? I am trying to understand if you are a credible source (statistician or of similar background) or someone with a axe to grind.

Silver Deputy
Newborn
Posts: 8
Joined: Tue Mar 25, 2008 10:02 am

Re: EQB's Published Heart Study a Wake-up Call for Believers

Postby Silver Deputy » Thu Dec 01, 2016 12:08 am

wgc517 wrote:Hi David,

I read your post on EQB and I was just wondering what part of the business you were in, (Bloodstock agent, breeder, racing etc.)? I ask because it was the longest post I ever saw on here and written in a way that does not seem like it comes from someone who would normally post on this board. Although I have not used their services, (I am a small breeder) I do follow them and find their website and the selection processes they discuss very interesting. Maybe because I read a lot of peer reviewed studies in my other job and I know the value and support they offer.

Would you be kind enough to share your qualifications/affiliations with us? I am trying to understand if you are a credible source (statistician or of similar background) or someone with a axe to grind.


I am a very small breeder. My family has been involved in racing on a somewhat larger scale as owners. I can talk about California Chrome, the Jimmy Durante Stakes, what have you, with the best of them if I need to. I studied Psychology in school and came to statistics that way. It's funny that you fall into the exact mode I anticipated in my conclusion. You would do better to evaluate my critique on its merits. My experience and knowledge of EQB is exactly as I describe in the paper. I am not well connected, nor am I coming from a place where you would say, "Aha! Of course he thinks that." I have no natural incentive in this case. Any axe I have to grind stems from what I discovered and is borne of personal conviction.

User avatar
Patuxet
Grade III Winner
Posts: 1150
Joined: Fri Dec 01, 2006 10:36 pm
Location: New England & Florida

Re: EQB's Published Heart Study a Wake-up Call for Believers

Postby Patuxet » Thu Jan 05, 2017 1:35 pm

Fascinating! Thank you for sharing your time-consuming research and articulating your thought processes, analyses and conclusions so clearly. You've given us much to weigh and reflect on.

Allison
"He is pure air and fire and the dull elements of earth and water never appear in him; he is indeed a horse ..." Wm. Shakespeare - Henry V