MUTUAL INFORMATION ANALYSIS USED TO DETERMINE MOST INFORMATIVE FEATURES IN BREAST CANCER DIAGNOSIS

Monday, October 25, 2010
Sheraton Hall E/F (Sheraton Centre Toronto Hotel)
Yirong Wu, PhD1, Elizabeth S. Burnside, MD, MPH, MS1, Oguzhan Alagoz, PhD1, Mehmet U.S. Ayvaci, MS1, David Page, PhD1 and Ross Shachter, PhD2, (1)University of Wisconsin-Madison, Madison, WI, (2)Stanford University, Stanford, CA

Purpose: We aim to develop a methodology to determine which mammography feature variables predict breast cancer most efficiently and accurately. We use Bayesian reasoning and mutual information analysis to inform decision makers which test or features would be most valuable in the diagnosis of breast cancer.

Method: Our database consisted of 9986 structured reports comprised of demographic factors and mammographic findings. We matched these reports with our Comprehensive Cancer Center tumor registry which served as our reference standard. We tested our methodology on example features including: mass margin, breast density, calcification shape, and architectural distortion. We used the tree augmented naïve Bayes (TAN) learning algorithm to train a Bayesian network on these features using 10-fold cross validation. We obtained the mutual information between the target variable (benign/malignant) and each feature variable by using the variable elimination algorithm. We estimated the probability of breast cancer conditioned on each variable individually and then compared the predictive accuracies by area under the ROC curve (AUC). With a threshold that penalized false negative (FN) results 50 times that of false positive (FP) results, we calculated the number of erroneous diagnoses (FN and FP) for each variable.

Result: We ranked the feature variables on their mutual information with the target variable and found mass margin>calcification shape>breast density>architectural distortion. These rankings were reinforced by ROC analysis demonstrating mass margin as the most informative variable, and architectural distortion as the least informative variable for distinguishing benign and malignant breast disease (see Table below). Based on our predefined threshold each variable resulted in FN and FP diagnoses in concert with the mutual information of that variable.

  AUC FN number FP number
All features 0.815

74

3700

Mass margin 0.740

93

4650

Calcification shape 0.697

96

4800

Breast density 0.647

103

5150

Architectural distortion 0.634

103

5150

Conclusion: Mutual information can be used to specify the relative importance of mammographic feature variables for breast cancer diagnosis. Extension of this mutual information analysis methodology has the potential to rank the value of different variables to influence the selection and application of variables or diagnostic tests in the pursuit of optimal breast cancer diagnosis.