What is gold standard and what is ground truth? (2024)

"What has not been examined impartially, has not been well examined. Scepticism istherefore the first step towards truth." (Denis Diderot, Philosopher)

Clinical decision-making is complex and based upon accurate evaluation of clinical findingsusing diagnostic tests and reference standard data. Given that many aspects of dentalexamination are not direct measures, but rely on indirect measures, it is important forclinicians to understand the basic principles and terms used to assess the accuracy ofdiagnostic tests and to appropriately evaluate published literature regarding these tests.Luckily, there is a variety of readily available metric systems to assess the quality ofdiagnostic test studies and to help clinicians better understand evidence-based literature.

Dentistry, or shall we say Clinical Dentistry, is becoming more complex and patients havebeen better informed. Importantly, health care has also shifted focus to emphasizeevidence-based practice (EBP). EBP is considered the gold standard for health professionaldecision-making. No one can deny that the activities in the field of evidence-basedDentistry have grown exponentially in the last decade. However, we cannot forget thatPierre Fauchard (1678 - 1761) may have been the first to warn the dental field about theconcept of evidence, taking into consideration the practices of the time. Fauchard andJames Lind (1716-1790) were both concerned about the health of sailors dying of scurvy and,for this reason, conceptualized a "clinical trial" involving the use of vitamin C tocounteract the disease. The former even tested techniques for the removal of caries, dentalrestoration and implants.

The true meaning of evidence-based Dentistry is grounded in a solid understanding andapplication of clinical epidemiology principles to reduce any confusion that may exist dueto academic training. Epidemiology is defined as the "Science of making predictions aboutindividual patients or a group, by recounting clinical events in similar patients in orderto ensure that the predictions are correct". Clinical epidemiology is "a subfield thatapplies the principles and methods of epidemiology to study the occurrence and outcomes ofdisease in people with a given illness".1

The ability to precisely define a question of interest (clinical question), derive relevantinformation from databases, differentiate research methodology, select statisticalprocedures as well as the ability to critically evaluate studies and understand theirimplications for care, are required skills.2However, let us not be too optimistic; there are drawbacks. Ironically, political, socialand economic pressure limits the time available for practitioners to seek answers toclinical questions. Furthermore, there is a surprising number of weekly published studies,from the best to the worst.

This paper will discuss a clinical question, among several that can be built"epidemiologically", specifically, diagnostic test accuracy. In other words, the study willprovide estimates of the ability of a diagnostic test to discriminate between patients withor without a pre-defined health condition, comparing the results with a standard referencetest. There will always be one predictor variable (result of the test) and an outcome(presence or absence of the disease).3 Furthermore,we add the concept of ground truth, which is a set of measures known to be more accuratethan the measurements of the system you are testing.

The term gold standard refers to a benchmark that is the available under reasonableconditions. Indeed, is not the perfect test, but merely the best available one that has astandard with known results. This is especially important when faced with the impossibilityof direct measurements.4 In Dentistry, for example,micro computed tomography can be considered a gold standard for the diagnosis of proximalcarious lesions of posterior teeth, as microscopic examination of the enamel hasdemonstrated its acuracy.5 In the past, referring toan examination as the gold standard meant that it was unqualifiedly the most accurateprocedure. However, in present clinical practice, even though the intent of term has notchanged, its use is dependent upon the context of the statistical method being used.

A gold standard study may refer to an experimental model that has been thoroughly testedand has a reputation in the field as a reliable method. The correct interpretation of adiagnostic test demands one to master specific concepts such as sensitivity, specificity,prevalence, positive and negative predictive values. The sensitivity of a test is definedas the proportion of people with the inherent disease who test positive (true-positive).The specificity of a test is the proportion of people without the disease that have anegative test (true-negative). In some literature, one can find the term 1-specificity thatis defined as the rate of false positives (in other words, the percentage of the sampleincorrectly identified as positive). Typically, a Receiver Operating Characteristic curve(ROC) is used as a graphical representation of the rate of sensitivity and specificity. Thearea under the curve represents the accuracy of the test. The closer the value is to one,the greater the test accuracy. In many clinical scenarios, there is a trade off betweensensitivity and specificity. This trade off is related to the fact that some people willclearly be normal while others will have the condition. However, there will inevitably be agroup of patients who fall in a middle zone (neither clearly normal nor abnormal). In suchinstances, an arbitrary cut off will be used to distinguish between normal and abnormal.Any screening test used to distinguish between patients in this circ*mstance will have atrade off between sensitivity and specificity. One way to address this dilemma is to use acombination of diagnostic tests to develop a diagnosis.

Positive predictive value is the probability of patients with true positive results (theyhave the condition of interest) to test positive. Negative predictive value, on the otherhand, is defined as the probability of patients with true negative results (no disease) totest negative. It is important to recognize that diagnostic tests are influenced by theprevalence of the disease in the population being tested. Prevalence is the probability ofan individual to have the disease (based on clinical characteristics and demographic data)in a population and includes both newly diagnosed cases and existing cases. Likelihoodratio is the ratio between the probability of a particular outcome of a diagnostic test inindividuals with the disease and the probability of that same outcome in individualswithout the disease. This may be positive or negative.6

To best understand how and why diagnostic tests function, a basic understanding of Bayestheorem is needed. Bayes defined probability as "the ratio between the value at which anexpectation depending on the happening of the event ought to be computed, and the value ofthe thing expected upon its happening".7 Forexample, the probability a person has to be diagnosed with oral cancer and having apositive test for the condition depends not only on the relationship between events, butalso on the accuracy of the test and the prevalence of the condition in the populationsample. Thus, if one wishes to evaluate the operating characteristics of a diagnostic testand selects a sample consisting of only a few people with oral cancer, whereas anotherindividual evaluates the same diagnostic test in a sample with a greater proportion ofpeople with oral cancer, test sensitivity, specificity, positive and negative predictivevalues may vary considerably even though the test procedure was identical.8

An ideal diagnostic method hypothetically presents a sensitivity of 100% with respect todetection of injury or illness (identifying all cases of injury or disease in all specimensevaluated or individuals with no false negatives) and a specificity of 100% (without falsepositives, pointing to injury or illness where there is none). Thus, in practice, there isno perfect gold standard. Instead, we have a method with the greatest sensitivity and thehighest specificity. Therefore, the gold standard diagnostic of the past has probably beenchanged today.

Higher sensitivity values increase negative predictive values. Higher specificity valuesincrease positive predictive values. Thus, if the test has higher values of sensitivity andspecificity, all people having a positive test result have the disease, while all patientswho have a negative test do not have the disease. Therefore, there is a trade off betweenthese values. This concept is important in instances in which the diseases have a poorprognosis. In these cases, one might want the test to have higher sensitivity so as not tounduly distress patients with lots of false positive results. Alternatively, if a diseaseis easily treatable, it might be more important to screen the population at risk by meansof a test with less sensitivity and higher specificity. For patients who are a falsepositive, a second test can be used to confirm diagnosis.9

For example, in Medicine, angiography (arteriography) by contrast was a former goldstandard for heart disease. A recent study reported the sensitivity of angiography to be66.5% and the specificity to be 82.6%. Now magnetic resonance angiography (MRA) has becomethe new gold standard, with a reported sensitivity of 86.5% and a specificity of83.4%.10 The acceptance of a new gold standarddefault method takes time and exhaustive evidence, especially if the internal validity isconsistent and acceptable.

As for ground truth, it can signify the mean value from the collection of data from aparticular experimental model (that preferentially uses gold standard method) representingbehavioral reference. For example, using an universal shear testing machine to evaluate thestrength of a new resin for bracket bonding, we obtain a value of X. This value can becompared to a reference value obtained by previous observations. Thus, if the resulting Xvalue is similar to or higher than those found in ground truth, it can be said that thisnew resin has an appropriate value. There is a consensus that the clinical resistancepattern for bracket bonding corresponds to something around 6.8 Mpa (this value matchesmore in ground truth definition than gold standard as it can not be preciselychecked).11 So this value can be used asreference ground truth to accept or reject the hypothesis that a particular new resin hasadmissible clinical strength or resistance. Therefore, in simple terms, a gold standardtest refers to a diagnostic method with the best accuracy; whereas ground truth representsthe reference values used as standard for comparison purposes.

In a recent study, authors classified midpalatal suture ossification in five maturationstages.12 A total of 140 cone-beam computedtomography (CBCT) scans from palatal suture were collected and blindly classified into fivestages. The images were used as ground truth reference. Subsequently, 30 images wererandomly evaluated and reclassified by three experienced orthodontists. The authors foundstrong agreement in the proposed classification method, with kappa index ranging from 0.82to 0.93. However, for this diagnostic method of suture maturation to become a goldstandard, histological confirmation is required to test specificity and sensibility. Inother words, it should be tested whether CBCT scans of "no suture" really mean midpalatalsuture tissue absence or the opposite in their five stages.

When a clinician or researcher is interested in critiquing a study, which describes theprocess for evaluating a diagnostic test, or conducting such study, it is important to notethat studies of a diagnostic test follow the rules described in the literature. TheStandards for Reporting of Diagnostic Accuracy Studies (STARD)13 is a list containing 25 items used to criticallyevaluate the quality of a particular diagnostic test study. Another accepted format used toevaluate studies of diagnostic tests is the Quality Assessment of Studies ofDiagnostic Accuracy Included in Systematic Reviews (QUADAS).14 the latter is a 14-item checklist (answers can be"yes", " no" or" unclear ") used to measure potential risk of bias in systematic reviews.Systematic reviews of these studies may follow the format proposed by the CochraneCollaboration available at (Cochrane Handbook for Systematic Reviews of DiagnosticTest Accuracy) (http://srdta.cochrane.org/handbook-dta-reviews).

In conclusion, gold standard data or method is related to something that has already beenchecked (histologically, microscopically, chemically, etc.) and presents the best accuracy(sensitivity and specificity). Ground truth means data and/or method related to moreconsensus or reliable values/aspects that can be used as references, but were not or cannotbe checked. We recommend more exposure to concepts of clinical epidemiology in dentalschools to ensure the best evidence-based practice.

REFERENCES

1. Portney LG, Watkins MP. Foundations of clinical research: applications topractice. 3. New Jersey: Prentice Hall Health; 2009. [Google Scholar]

2. Cardoso JR. Fontes SV, f*ckujima MM, Cardeal JM. Fisioterapia neurofuncional.Fundamentos para a prática. São Paulo: Atheneu; 2007. Fisioterapia baseada em evidências; pp. 29–38. [Google Scholar]

3. Korevaar DA, van Enst WA, Spijker R, Bossuyt PM, Hooft L. Reporting quality of diagnostic accuracy studies: asystematic review and meta-analysis of investigations on adherence toSTARD. Evid Based Med. 2014;19(2):47–54. [PubMed] [Google Scholar]

4. Versi E. "Gold standard" is an appropriate term? BMJ. 1992;305(6846):187–187. [PMC free article] [PubMed] [Google Scholar]

5. Soviero VM, Leal SC, Silva RC, Azevedo RB. Validity of MicroCT for in vitro detection of proximalcarious lesions in primary molars. J Dent. 2012;40(1):35–40. [PubMed] [Google Scholar]

6. Haynes RB, Sackett DL, Guyatt GH, Tugwell P. Clinical epidemiology: how to do clinical practiceresearch. 3. Philadelphia: Lippincott Williams & Wilkins; 2006. [Google Scholar]

7. An essay towards solving a problem in the doctrine ofchances by the late Rev Mr. Bayes, communicated by Mr. Price, in a letter to JohnCanton MA and FRS. Read December 23, 1763. First publication. Philos Trans R Soc Lond. 1764;53:370–418. http://www.stat.ucla.edu/history/essay.pdf [Google Scholar]

8. Mazur DJ. A history of evidence in medical decisions: from thediagnostic sign to Bayesian inference. Med Decis Making. 2012;32(2):227–231. [PubMed] [Google Scholar]

9. Saah AJ, Hoover DR. "Sensitivity" and "specificity" reconsidered: themeaning of these terms in analytical and diagnostic settings. Ann Intern Med. 1997;126(1):91–94. [PubMed] [Google Scholar]

10. Greenwood JP, Maredia N, Younger JF, Brown JM, Nixon J, Everett CC, et al. Cardiovascular magnetic resonance and single-photonemission computed tomography for diagnosis of coronary heart disease (CE-MARC): aprospective trial. Lancet. 2012;379(9814):453–460. [PMC free article] [PubMed] [Google Scholar]

11. Reynolds IR. A review of direct orthodontic bonding. Br J Orthod. 1975;2:171–178. [Google Scholar]

12. Angelieri F, Cevidanes LH, Franchi L, Gonçalves JR, Benavides E, McNamara JA., Jr Midpalatal suture maturation: classification method forindividual assessment before rapid maxillary expansion. Am J Orthod Dentofacial Orthop. 2013;144(5):759–769. [PMC free article] [PubMed] [Google Scholar]

13. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies ofdiagnostic accuracy: the STARD initiative. Standards for Reporting of DiagnosticAccuracy. Clin Chem. 2003;49:1–6. [PubMed] [Google Scholar]

14. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, Kleijnen J. The development of QUADAS: a tool for the qualityassessment of studies of diagnostic accuracy included in systematicreviews. BMC Med Res Methodol. 2003;10:25–25. [PMC free article] [PubMed] [Google Scholar]

What is gold standard and what is ground truth? (2024)
Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 5639

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.