*Preya Shah, Harvard College ’13*

#### Abstract

Early detection of breast cancer is crucial for successful treatment of the disease. While mammography screenings are relatively effective, only around 20% of biopsies conducted after abnormal mammogram results suggest tumor malignancies. The purpose of this project is to use a simple multivariate logistic model to determine the probability of tumor malignancy based on several mammogram BI-RADS covariates. Results and comparision with a decision-tree algorithm suggest that the logistic model can allow for successful detection of malignant tumors with fewer unneeded biopsies.

#### 1 Introduction

Breast cancer is the most common form of cancer among women, leading to over 40,000 fatalities per year in the United States.1 Since breast cancer is treatable in its beginning stages, early detection is integral to combating the disease and increasing chances of survival. Mammography is currently the most effective available technique for breast cancer screening; however, its ability to predict malignancies is imperfect.2 It is estimated that out of every 1000 women screened, 80 to 100 will be “recalled,” i.e. called back for additional evaluation, often consisting of additional mammographic views and/or other scans (such as ultrasound and MRI). Of those patients, about 15 will be recommended to have a breast biopsy, and 2 to 5 of those biopsied will be found to have breast cancer3;4;5. Indeed, it is estimated that approximately 80% of breast biopsy results are benign findings.6 As biopsies are invasive, expensive, and potentially risky, it is important to minimize unneccessary biopsies while maximizing the probability of catching malignant tumors.

To provide a standard classication system for mammography studies, the American College of Radiology has developed The Breast Imaging Reporting and Data System (BI-RADS).7 One component of BI-RADS is a list of seven mammography assessment codes which are assigned by a radiologist after interpreting a mammogram. The seven codes are as follows:

0: Incomplete

1: Negative

2: Benign finding(s)

3: Probably benign

4: Suspicious abnormality

5: Highly suggestive of malignancy

6: Known biopsy, proven malignancy

These categorizations represent the radiologist’s assessment of the chance of malignancy, and are informed by several BI-RADS attributes. These attributes differ depending on the type of abnormality found on the mammogram (e.g., masses, calcifications, architectural distortions). For masses, these attributes include mass shape, margin, and mass density. Recently, several computer-aided detection (CAD) techniques have been developed which aim to determine breast tumor malignancy using these BI-RADS attributes. The goal of these systems is to aid physicians in their decision on whether or not to perform a biopsy based on an abnormal mammogram. One particular CAD system developed by M. Elter et al. uses a decision tree algorithm which utilizes the BI-RADS attributes to characterize tumors as either malignant or benign.8

The purpose of this study is to develop a multivariate logistic model which can serve as an additional system to characterize the malignancy of a breast tumor. The logistic model determines which BI-RADS characteristics are most influential in the indication of a malignant tumor. Additionally, results of this study indicate that the model’s predictions are comparable to those given by the BI-RADS assessments or CAD techniques, leading to the possibility for better detection results and fewer unwarranted biopsies.

#### 2 General Methods

The data utilized for this project was obtained from a publicly available mammography database provided by UC Irvine.9 The database is based on modern full-field digital mammograms, and the dataset contains data from 961 mass regions collected at a large radiological center from 2003 to 2006. 515 (53.6%) of these mass regions are benign and 446 (46.4%) are malignant. Notably, the providers of the dataset did not indicate what percent of the mammograms were screening mammograms (used to check for breast cancer in asymptomatic women) versus diagnostic mammograms (used to check for breast cancer in patients with prior symptoms). However, the large percentage of malignant regions and the young age of some of the screened patients suggest that the mammograms are likely to be diagnostic.

The following information is provided for each data point:

**Covariates:**

BI-RADS assessment: 0 to 6 (ordinal, a characterization for the data)

Age: patient’s age in years (integer)

Shape: round=1 oval=2 lobular=3 irregular=4 (nominal)

Margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)

Mass Density: high=1 iso=2 low=3 fat-containing=4 (ordinal)

**Independent variable/outcome:**

Severity: benign=0 or malignant=1 (binominal)

Using Google Refine, the dataset was cleaned to remove data points with missing information, and was further separated into two sets: a training set and a test set. 100 data points were selected at random to be part of the test set; the rest became the training set. As expected, the percent of patients with malignancy was similar in the training set and the test set (48.56% in the training set and 48% in the test set). The full characteristics of the training and test sets is given in Figure 1. For age, the median, min, max, and 25th and 75th percentiles are listed. For shape, margin, mass density, BI-RADS code, and severity, the percent of patients falling into each category is listed.

Once the data was obtained and cleaned, a multivariate logistic regression was performed on the training data using Matlab. Logistic regression was used because the dependent variable (severity) is binary, and because the logistic model can estimate probability of a malignant tumor as a function of age, shape, margin, and mass density.

The logistic model is given by:

where X is a design matrix of the covariates, Y is the dependent variable (where Y = 1 if the mass is malignant and 0 if not), is a vector of the regression coefficients, and G is given by:

Since the shape and margin covariates are given in the dataset as discrete nominal variables with no clear a priori ordering, they were replaced by binary variables. For shape, the binary variables are oval, lobular, and irregular (where each variable is 1 if the mass is characterized by that shape and 0 otherwise), with round as the baseline. Similarly, for margin, the variables are microlobulated, obscured, spiculated, and ill-defined (where each variable is 1 if the mass is characterized by that margin and 0 otherwise), with circumscribed as the baseline. Age and mass density remained as integer variables.

To estimate the β coefficients of the logistic regression, we must maximize the log likelihood function – in other words, we must choose the set of values which maximize the likelihood of the data in the dataset. Likelihood is defined for a particular set values as the probability of obtaining the observed data if the logistic model is correct and the values are the true values. The log likelihood equation is given by:

Finally, marginal effects of a specic covariate i can be calculated using the following equation:

These methods were implemented in Matlab, and results of the regression model were compared with the BI-RADS assessment approach and Elter’s CAD decision tree technique.

#### 3 Results

**3.1 Logistic regression**

Once the conversion of shape and margin from norminal variables to binary variables was made, the optimal regression coefficients were calculated on the training data. Results are shown in the table below.

The signs of the coefficients can tell us the following: masses which are lobular, irregular, microlobulated, obscured, ill-defined, or spiculated have a higher probability of being malignant than those which do not have those properties. Furthermore, older patients are more likely to have a malignant tumor, and more dense tumors are more likely to be malignant (recall that the mass density scale runs from 1-4 with 1 being most dense).

For each data point, the probability of malignancy was calculated using the results obtained from the logistic regression. Overall, masses with higher probability of malignancy are associated with higher BI-RADS codes, as intuitively expected.

As age is the continuous variable present in this dataset, we can visualize probability of malignancy as a function of age while fixing shape, margin, and mass density variables. Background literature research suggests that characteristics of a benign tumor include round or oval shape, circumscribed margin, and low mass density, while characteristics of a malignant tumor include irregular shape, spiculated margins, and high mass density. Therefore, Pmalignancy vs. age was plotted for three different types of tumors: (1) benign-type (round, circumscribed, low mass density), (2) in-between (lobular, obscured, isodense), and (3) malignant-type (irregular, spiculated, low mass density). As expected, all three tumor types demonstrate a logistic trend in which probability of malignancy increases with age. Furthermore, the tumor type with malignant characteristics has a higher probability of malignancy for all ages than the tumor types with benign characteristics.

The marginal effects of age on malignancy can also be visualized for the three types of tumors. We note that the marginal effect for malignant-type tumors is highest at lower ages, whereas the marginal effect for benign-type tumors is highest at old age. This makes sense because at old ages, malignant-type tumors are almost guaranteed to be malignant, so the marginal effect is minimal.

Marginal effects of mass density were calculated using a fixed age of 57 (the median age of the dataset) for benign-type (round shape and circumscribed margin), in-between (lobular shape and obscured margin), and malignant-type (irregular shape and spiculated margin) tumors. The marginal effect was -0.020 for benign-type tumors, -0.038 for in-between tumors, and -0.018 for malignant-type tumors. The results indicate that at median age of the dataset, increased mass density increases probability of malignancy, and that the effect of mass density is higher for in-between tumors than for malignant-type or benign-type tumors.

**3.2 Assessing the predictive value of the logistic model**

The final, and perhaps most relevant, stage of this project is the assessment of the logistic model’s predictive value. In other words, how effective is the model in predicting malignancy compared with the radiologists’ BI-RADS assessments and the CAD decision tree model? A metric for testing effectiveness was developed. Firstly, a function mimicking the CAD decision tree model was coded in Matlab. The decision tree model, shown in Figure 4, consists of a series of questions which utilize the BI-RADS covariates to determine whether or not a tumor is malignant. In the study by Elter et al, the CAD decision tree model also included a confidence value for each assigned classication. By varying the confidence threshold,they generated receiver operating characteristic (ROC) curves of sensitivity (true positive fraction) vs. specificity (1 – false positive fraction). Since it is often considered that high sensitivity is more important than high specificity in breast cancer prediction—i.e. it is better to falsely clasify a benign region as malignant rather than falsely classify a malignant region as benign—the authors included several performance measures including the specificity at a given sensitivity of 0.95 as a reflection of the greater cost of false negatives to false positives.8 In the present study, to simplify comparison between the CAD and logistic models, only the binary decision of “malignant” or “benign” is considered forthe CAD model, and confidence values were not considered.

Now that the decision tree model is implemented, each data point can be associated with four pieces of information. Three of them are predictive measures: the BI-RADS assessment code, Pmalignant as calculated by the logistic model, and the CAD decision. The fourth is the actual outcome as indicated by the severity field (1 if the tumor is malignant, and 0 if it is benign). We can therefore calculate the effectiveness of the three models by determining the percent of correct predictions, false positive predictions, and false negative predictions for each.

To aid in this calculation, we can assume that all cases with a BI-RADS assessment of greater than 4, a Pmalignant greater than or equal to 0.5, or CAD decision of 1 are predicted to be malignant. Note again that for the purpose of this study, the cost of false negatives is viewed as equal to the cost of false positives, which is why 0.5 is chosen as the cutoff for malignancy. Future studies can assign different costs to prioritize a higher sensitivity (i.e. fewer false negatives).

The effectiveness of the three models was determined on the test set as well as a new training set of 100 data points. Results of both tests are summarized in Figure 5. Note that a false positive indicates that a tumor predicted to be malignant is actually benign, and a false negative indicates that a tumor predicted to be benign is actually malignant. False positives are undesirable because they lead to unnecessary biopsies; false negatives allow malignant tumors to be undetected and untreated, decreasing chances of the patient’s survival.

Results of the metric on the training data indicate that the logistic model designed in this study has lower rates of false positives than the decision tree model and lower rates of false negatives than the BI-RADS assessment model.

Since the logistic model was developed using the same data it was trained on, it is important to test the model on new data in order to truly measure effectiveness. Therefore, the same calculations were made on the test data set of 100 new data points. Results on the test data indicate that the logistic model is actually the most successful of the three models in predicting malignancies, with 78% accuracy, lower rates of false positives than the decision tree model and lower rates of false negatives than the BI-RADS assessment model. In general, it appears that the CAD decision tree assessment results in a high percentage of false positives (which can lead to excess biopsy operations) and BI-RADS assessment results lead to a high percentage of false negatives (which can lead to untreated malignant tumors). The logistic model nicely evens out the percentage of false positives and negatives.

These results are highly promising in their indication that the simple logistic model is comparable to, and perhaps more effective than, the BI-RADS and CAD assessments. Further tests should be done on larger datasets to corroborate these findings.

#### 4 Conclusions

This project utilizes a multivariate logistic regression to develop a model for the probability of a malignant tumor as a function of age, shape, margin, and mass density. The strategy of maximizing the log-likelihood function with respect to the regression coefficients was employed, and several marginal effects calculations were made to determine the effects of the various covariates on probability of malignancy. The model allows for simple probability calculations upon input of the BI-RADS mammogram covariates, and can facilitate the decision-making process when considering the need for biopsy.

This research serves as a stepping stone for several areas of further work. The next step to validate the results of this study is to repeat the regression with different training and test set sizes, and use cross-validation to ensure that every example from the original dataset has the same chance of appearing in the training and testing set. Furthermore, a range of cut-offs of the predictive probability of malignancy in the logistic regression score should be tested, in addition to the cut-off of 0.5 used in the study. For each of these cut-offs, the sensitivity and specificity can be estimated and compared to the sensitivity and specifities of the CAD decision tree and BI-RADS approaches. Another area for future work is to gather a dataset with screening mammogram data from asymptomatic women to confirm the results and determine statistical signicance of the findings. It is important to note that BI-RADS scores from mammograms are often not the only factor involved in the decision of whether or not to biopsy. This is especially true in younger women, in which mammograms have low accuracy due to higher density breast tissue; a young symptomatic patient would likely get a biopsy in spite of a low BI-RADS score.11

Overall, the logistic model is an excellent first step in developing a simple quantitative measure that can guide radiologists in mammogram analyses and recommendations for biopsies, leading to more effective early stage breast cancer detection.

#### 5 References

1. A. Jemal, T. Murray, and E. Ward. Cancer statistics, 2005. Cancer J. Clin. 55, 10-30. 2005.

2. L. L. Humphrey, M. Helfand, B. K. Chan, and S. H. Woolf. Breast cancer screening: A summary of the evidence for the U.S. Preventive Services Task Force. Ann. Intern Med. 137, 347-360. 2002.

3. Nelson HD, Tyne K, Naik A, Bougatsos B, Chan BK, Humphrey L. Screening for Breast Cancer: An Update for the U.S. Preventive Services Task Force. AHRQ Publication No. 10-05142-EF-5, November 2009.

4. Berg W, Hendrick E, et al. Frequently Asked Questions about Mammography and the USPSTF Recommendations: A Guide for Practitioners. [http://sbi-online.org/associations/8199/les/Detailed Response to USPSTFGuidelines-12-11-09-Berg.pdf].

5.Rosenberg RD, Yankaskas BC, et al. Performance benchmarks for screening mammography. Radiology. 2006;241(1):55-66

6. Liang W, Lawrence W, Burnett CB, et al. Acceptability of diagnostic tests for breast cancer. Breast Cancer Res Treat. 2003;79:199-206.

7. American College of Radiology, Breast Imaging Reporting and Data System BI-RADS, Altas 2006.

8. M. Elter, R. Schulz-Wendtland and T. Wittenberg. The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Medical Physics 2007;34(11):4164-4172.

9. Frank, A. and Asuncion, A. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. 2010. [http://archive.ics.uci.edu/ml]

10. Liao, T.F. Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models. Quanititative Applications in the Social Sciences. SAGE, 1994;101:18.

11. Yankaskas BC, Haneuse S, et al. Breast Cancer Surveillance Consortium. Performance of first mammography examination in women younger than 40 years. J Natl Cancer Inst. 2010;102(10):692-701.