© 2002 American Thoracic Society
A Decision Tree for Tuberculosis Contact InvestigationLung Health Center; School of Health Related Professions; School of Medicine; Comprehensive Cancer Center, Division of Biostatistics, University of Alabama at Birmingham; Department of Biostatistics, School of Public Health; and Alabama Department of Public Health, Division of Tuberculosis Control, Birmingham, Alabama Correspondence and requests for reprints should be addressed to Lynn B. Gerald, Ph.D., M.S.P.H., University of Alabama at Birmingham Lung Health Center, NHB 104, 619 19th Street South, Birmingham, AL 352497337. E-mail: geraldl{at}uab.edu ABSTRACT The University of Alabama at Birmingham and the Alabama Department of Public Health recently developed a logistic regression model showing those variables that are most likely to predict a positive tuberculin skin test in contacts of tuberculosis cases. However, translating such a model into field application requires a stepwise approach. This article describes a decision tree developed to assist public health workers in determining which contacts are most likely to have a positive tuberculin skin test. The Classification and Regression Tree analysis was performed on 292 consecutive cases and their 2,941 contacts seen by the Alabama Department of Public Health from January 1, 1998, to October 15, 1998. Several decision trees were developed and were then tested using prospectively collected data from 366 new tuberculosis cases and their 3,162 contacts from October 15, 1998, to April 30, 2000. Testing showed the trees to have sensitivities of 8794%, specificities of 2228%, and false-negative rates between 7 and 10%. The use of the decision trees would decrease the number of contacts investigated by 1725% while maintaining a false-negative rate that was close to that of the presumed background rate of latent tuberculosis infection in the state of Alabama.
Key Words: tuberculosis decision tree contact investigation Classification and Regression Tree analysis Investigation of contacts of active tuberculosis (TB) cases using tuberculin skin testing (TST) is an important epidemiologic tool in identifying TB infection. Reports by the Institute of Medicine and the Advisory Council for the Elimination of Tuberculosis have cited the importance of developing more effective methods of identifying contacts with a high risk of infection (1, 2). In addition, budgetary constraints on state and local health departments have created a strong impetus to streamline contact investigation programs without sacrificing disease control. The University of Alabama at Birmingham and the Alabama Department of Public Health Division of Tuberculosis Control have recently completed a project that evaluated existing contact investigation procedures in Alabama and developed standardized protocols and a computer-based case management system, including a contact investigation module (L.B.G., unpublished data). The data collected using the contact investigation module were subsequently used to create a model showing variables most likely to predict a positive TST among contacts of active TB cases (3). The contact TST model used generalized estimating equations (GEEs) (4), a type of logistic regression, to identify risk factors. This model is mathematically sound but poses two major difficulties in translating the results to field application. First, to predict the TST result of a contact using the GEE model, one must have information available on all variables used by the model. This is not always practical in the contact investigation. Second, the odds ratios produced by the GEE method are difficult to translate into rules for prioritizing contacts for investigation. Classification and regression tree (CART) analysis can address these difficulties and has been successfully used to assist in the diagnosis of myocardial infarction (5) and to differentiate between benign and malignant breast tumors (6). Recently, a decision tree created using CART was developed to predict which TB patients should be isolated upon admission to the hospital (7). The authors created an algorithm with a 100% sensitivity and a specificity of 48%. A sensitivity of 100% was desired at the cost of a lower specificity due to the nature of TB as an infectious disease. The model had a negative predictive value of 100%. Although the investigators realized they would be required to isolate more patients than necessary, they wanted to ensure that they did not miss any infectious cases. The number of patients requiring isolation could be reduced by more than 40% using this decision tree without increasing the risk of the spread of infection. This study used CART analysis to develop several simple decision trees to assist TB field workers in establishing priorities in contact investigation. The goal is to reduce the number of contacts investigated without sacrificing disease control. METHODS
Sample The sample (learning sample) used to develop the decision tree included 292 consecutive confirmed TB cases with a total of 2,941 contacts identified in the state of Alabama from January 1, 1998, to October 15, 1998. Data collected from 366 new TB cases and their 3,162 contacts from October 15, 1998 to April 2000 (test sample) were used to test the decision tree.
Study Variables Certain variables (including positive human immunodeficiency virus [HIV] status and whether the case was homeless) were thought to be important in determining the probability of infection of contacts; however, there were too few cases or contacts with these traits to use these variables in the model. Current smoking status of the case was significant but was not included in the analysis because of the large amount of missing data. There were almost no missing data for all of the other variables.
Analysis Although the GEE model can be used to predict a positive skin test among contacts, the limitations associated with the model make the translation of the results to field application cumbersome. For example, to predict the probability of a contact having a positive TST using the GEE model, data must be available on all variables used in the model. In addition, the odds ratios produced by the GEE model are difficult to translate into decision rules for the investigation of contacts. To overcome these difficulties, the CART (13) method was used to develop a decision tree to predict the TST result of a contact. CART allows one to assign different penalties for misclassification. In the case of TB contact investigations for the purpose of disease control, false-negative errors (classifying contacts with positive TSTs as negative) are more serious than false-positive errors (classifying contacts with negative TSTs as positive). Therefore, we assigned a higher penalty to false-negative errors. Several different penalty weights were tested. A penalty of two allowed us to develop a decision tree with a 10% false-negative rate (close to the presumed background prevalence of latent TB infection in Alabama), yet allowed for a substantial reduction in the number of contacts examined. The final result from CART is a decision tree that has a minimal misclassification cost. More detail on the CART analysis can be found in the online data supplement. CART analysis included the seven variables significant in the GEE model and three environmental exposure variables that the clinicians felt were important exposure variables: (1) whether or not ventilation of the exposure environment is minimum (minimum ventilation includes ventilation situation of closed windows and doors and window/fan exhaust), (2) whether or not place of exposure to case is home, and (3) whether or not contact is exposed in a nursing home, prison, group home, or boarding house. Total hours per month was grouped into terciles: 024, 24120, and 120+ in the CART analysis. The decision trees were then tested using prospectively collected data from 366 new TB cases and their 3,162 contacts. RESULTS During the time period in which the data were collected, the state of Alabama had a TB incidence rate of 8.8 per 100,000the sixth highest rate in the country. Characteristics of the TB cases and their contacts used to create the GEE and CART models are shown in Table 1 . The mean number of contacts per case was 10 (median, 4; range, 1 to 181). The overall infection rate among all contacts was approximately 18%.
Table 2 shows the univariate analysis. The decision tree produced by the CART method is shown in Figure 1 . To classify a contact, one examines the designated predictor at each node. If its value is in the category indicated, the contact belongs to the left subtree, and if not, the contact belongs to the right subtree. This determination is made at every node encountered until a terminal node is reached. The decision tree was developed with an 87% sensitivity and a 44% specificity (see Table E1 in the online data supplement). The validity of the predictive tree was tested in a separate data set of 366 new TB cases and their 3,162 contacts and was performed with a 91% sensitivity and a 28% specificity (Table 3) . Use of this tree would allow TB field workers to decrease the number of contacts investigated by 776 (25%) for the test sample, thus greatly reducing the costs associated with contact investigation. However, we would not investigate 59 (10%) of those contacts with positive TSTs.
It is important to note that the trees developed by the CART method can be pruned by the investigator according to his or her needs. For example, if the tree divides the contacts into illogical groups, the groups should be disregarded and the tree pruned at that point. For instance, in Figure 1, the yellow highlighted node implies that more hours spent with the case decreases the likelihood of a positive TST. This most likely reflects an artifact from the small number of contacts in this node. Because this node is illogical, the tree should be pruned at this point. Furthermore, positive TSTs in children are less likely to be due to the background rate, and contacts who are children are considered very high priority. Therefore, we would also recommend pruning the tree at the blue highlighted nodes. Figure E1 in the online data supplement shows the pruned tree, eliminating these nodes, and Table E2 shows the sensitivity and specificity of this tree. Although demographic characteristics such as age, race, and sex are scientifically valid predictors, it may be that the significance of demographic factors in the model is due to differences in the background rate of infection in different subgroups of the population. Furthermore, although these characteristics may be related to transmission, it may also be that it is not race and sex per se that increase a contact's risk of having a positive TST. Rather, such demographic characteristics are related to social and cultural factors that are difficult to measure but have an impact on the transmission of TB. Therefore, in examining a model for use in clinical practice, we ran the decision trees without the demographic characteristics. The resulting tree is shown in Figure 2 . This tree was developed with a 94% sensitivity and a 24% specificity and was performed with a 92% sensitivity and a 20% specificity in the test data (see Table 4) . The use of this tree would allow for a 17% reduction in the number of contacts investigated.
The tree shown in Figure 2 was pruned at the two highlighted nodes where results were not logical. The resulting tree is shown in Figure 3 . In addition, the Alabama Department of Public Health categorizes all children who are less than 15 years of age as high risk. Therefore, we would always investigate children regardless of the prediction of the algorithm. Such a decision tree performed with a 90% sensitivity and a 22% specificity in the test data and would allow for a 20% reduction in the number of contacts investigated (Table 5) .
DISCUSSION Our results show that simple decision trees can be developed to assist TB field workers in prioritizing contacts for investigation. Our goal in developing these trees was to provide field workers with a scientifically sound yet simple method of prioritizing contacts. When these decision trees are applied to the test sample, we could eliminate between 17 and 25% of the contacts investigated. This is a substantial reduction in the number of contacts examined and would mean that large amounts of resources previously devoted to contact investigation could be redirected toward other important activities such as directly observed therapy. In addition, the false-negative rate of these trees ranged from 7 to 10%. This rate is close to the presumed background prevalence of latent TB infection in the state of Alabama. Although this is encouraging, we acknowledge that some persons with recent transmission of TB are likely to be included in this false-negative rate. The decision trees presented are easy to interpret and apply to TB contact investigations. Although a positive TST among contacts to an active TB case can be predicted using a logistic regression model (3), all information related to factors in the logistic model has to be available to make a prediction for a contact. Decision trees allow field workers to act on information that is currently available. Furthermore, tree-structured classification better detects interactions among variables because the process of recursive partitioning is directed toward finding such meaningful relations (13). Using a decision tree, we can predict positive TSTs among contacts to an active TB case by using several simple rules. Using the tree shown in Figure 3, we would investigate contacts if (1) the case to which they were exposed had cavitary disease, or (2) the total exposure time per month was larger than 120 hours, or (3) the contact was less than 15 years of age. In these instances, it is not necessary to collect further information. If the case did not have cavitary disease and the exposure time was less than 120 hours per month, then the following contacts would be investigated: (1) all contacts exposed to smear positive cases in their home, and (2) contacts exposed to smear positive cases in places other than their home where the ventilation was minimal. This hierarchical nature of the decision tree model allows TB field workers to prioritize the contact information that he or she needs to determine the likelihood of a positive TST thereby saving time and resources. We included the demographic characteristics of contacts in the decision trees to show readers how such characteristics are strong predictors of skin test positivity. This may well be due to varying background rates in subpopulations as well as social and cultural issues related to transmission. When developing decision trees for clinical practice, one must consider the implications of selecting contacts for investigation based on such factors as race and sex. For example, selection of persons for investigation based on race or sex may harm the credibility of the public health system. Such practices may be seen by the public as biased and unfair. Therefore, we would recommend a decision tree such as the one shown in Figure 3 for use in the field. One limitation of this study is the low number of HIV-positive cases or contacts in our population. Although HIV status in contacts may be important in transmission of TB, the impact of TB infection should it occur causes grave consequences. Whether or not a case's HIV status alters the infectiousness of pulmonary TB is controversial (14). The decision tree presented can still be applied in such areas with the additional rule of always investigating contacts who are HIV-positive. In fact, the Alabama Department of Public Health uses that rule in addition to investigating all contacts of HIV-positive cases because of the likelihood of shared exposures that would increase a contact's risk of being HIV positive. Furthermore, the issue of foreign-born cases and contacts is important in areas where these groups are prevalent. Recent studies indicate the importance of this risk factor in controlling TB, and our trees may have limitations in areas where there is a high prevalence of foreign-born persons (1517). Public health officials in such areas should realize that our decision trees may have limitations in these areas. Future studies should examine the effectiveness of such decision trees in the field. One consideration that must be addressed in future studies is the costbenefit analyses of the use of such decision trees in determining prioritization of contacts for investigation such as the studies that have been conducted on DOT versus self-administration of TB therapy (18), isoniazid prophylaxis (19), and screening for latent TB infection (20, 21). The potential cost savings of using the decision tree to prioritize contacts for investigation should be weighed against the costs of treatment and future infections related to contacts who are classified as false-negatives by the tree. Use of such trees can result in a substantial reduction of the number of contacts investigated by health departments. Such potential cost savings could be used to increase rates of adherence to latent TB infection treatment among contacts who are evaluated. However, such savings need to be weighed against the costs of potentially missing a contact recently infected with TB (22). Recent development of new tests for latent TB infection implies that in coming years we may be able to distinguish latent TB infection from immunization with BCG and from exposure to atypical mycobacterial or mycobacteria other than TB (23, 24). Therefore, algorithms for prioritizing contact investigation created with the use of such new tests should become even more specific.
Conclusion Although this study emphasizes the science of TB contact investigation, it is important to remember that contact tracing is also an art requiring other forms of expertise and intuition. The use of this decision tree, combined with the experience of TB field workers and health department officials, is one potentially important step in reaching the goal of TB elimination while ensuring efficient and effective use of public health resources. FOOTNOTES Supported by a grant from the National Heart, Lung, and Blood Institute of the National Institutes of Health. This article has an online data supplement, which is available from this issue's table of contents online at www.atsjournals.org Received in original form February 18, 2002; accepted in final form May 7, 2002 REFERENCES
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||