A machine learning approach to the accurate prediction of ... - AAPM

0 downloads 0 Views 373KB Size Report
Mar 23, 2018 - dicted values from machine learning models was analyzed. ... Conclusion: Machine learning methods can be used to predict OF for ...
A machine learning approach to the accurate prediction of monitor units for a compact proton machine Baozhou Suna), Dao Lam, Deshan Yang, Kevin Grantham, Tiezhi Zhang, Sasa Mutic, and Tianyu Zhao Department of Radiation Oncology, Washington University School of Medicine, 4921 Parkview Place, Campus Box 8224, St. Louis, MO 63110, USA

(Received 26 September 2017; revised 26 January 2018; accepted for publication 5 February 2018; published 23 March 2018) Purpose: Clinical treatment planning systems for proton therapy currently do not calculate monitor units (MUs) in passive scatter proton therapy due to the complexity of the beam delivery systems. Physical phantom measurements are commonly employed to determine the field-specific output factors (OFs) but are often subject to limited machine time, measurement uncertainties and intensive labor. In this study, a machine learning-based approach was developed to predict output (cGy/MU) and derive MUs, incorporating the dependencies on gantry angle and field size for a single-room proton therapy system. The goal of this study was to develop a secondary check tool for OF measurements and eventually eliminate patient-specific OF measurements. Method: The OFs of 1754 fields previously measured in a water phantom with calibrated ionization chambers and electrometers for patient-specific fields with various range and modulation width combinations for 23 options were included in this study. The training data sets for machine learning models in three different methods (Random Forest, XGBoost and Cubist) included 1431 (~81%) OFs. Ten-fold cross-validation was used to prevent “overfitting” and to validate each model. The remaining 323 (~19%) OFs were used to test the trained models. The difference between the measured and predicted values from machine learning models was analyzed. Model prediction accuracy was also compared with that of the semi-empirical model developed by Kooy (Phys. Med. Biol. 50, 2005). Additionally, gantry angle dependence of OFs was measured for three groups of options categorized on the selection of the second scatters. Field size dependence of OFs was investigated for the measurements with and without patient-specific apertures. Results: All three machine learning methods showed higher accuracy than the semi-empirical model which shows considerably large discrepancy of up to 7.7% for the treatment fields with full range and full modulation width. The Cubist-based solution outperformed all other models (P < 0.001) with the mean absolute discrepancy of 0.62% and maximum discrepancy of 3.17% between the measured and predicted OFs. The OFs showed a small dependence on gantry angle for small and deep options while they were constant for large options. The OF decreased by 3%–4% as the field radius was reduced to 2.5 cm. Conclusion: Machine learning methods can be used to predict OF for double-scatter proton machines with greater prediction accuracy than the most popular semi-empirical prediction model. By incorporating the gantry angle dependence and field size dependence, the machine learning-based methods can be used for a sanity check of OF measurements and bears the potential to eliminate the time-consuming patient-specific OF measurements. © 2018 American Association of Physicists in Medicine [https://doi.org/10.1002/mp.12842] Key words: machine learning, monitor units, output factor, proton therapy 1. INTRODUCTION Proton beam therapy, characterized by its low entrance dose to the target and a sharp dose fall off at the distal range, has significant dosimetric advantages in sparing normal tissue.1,2 However, this advanced technology has not been widely available because of its prohibitive cost and the space required to host a facility. There are currently 25 proton treatment centers in clinical operation in the US and many others are under construction. With the reduction of cost, it is anticipated that the technology will continue to proliferate in the future, increasing the accessibility of proton therapy to cancer patients. 2243

Med. Phys. 45 (5), May 2018

0094-2405/2018/45(5)/2243/9

One of the most important processes in radiation therapy to ensure safe and accurate treatments is to convert the prescribed dose to the machine deliverable Monitor Units (MU). Due to the complexity of the beam delivery systems, current proton treatment planning systems (TPSs) do not calculate MUs of treatment fields for the passively scattered proton delivery techniques. In the current clinical practice, the fieldspecific output factor (OF), i.e., dose per MU (d/MU) is determined with measurements prior to treatment. This process is time-consuming and requires beam time. To avoid incorrect dose delivery to patients, the actual MUs delivered for each radiation field should be checked independently with © 2018 American Association of Physicists in Medicine

2243

2244

Sun et al.: MU prediction using machine learning

2244

a secondary method. Such an independent verification of MUs for proton therapy is very important to identify the errors in the MU derivation. However, there is no commercial software to check MU measurements. Kooy et al. developed a semi-empirical model to predict OFs for a passive scattering proton therapy system at the Massachusetts General Hospitals (MGH).3 In this model, the OF was modeled as function of combined parameters of range (R) and modulation width (M) and source shift changes, which is given as d=MU ¼ CF  ð1 þ a1 r a2 Þ  ð1 þ a3  ðR  minRÞ

(1)

where CF is a constant to correct the output variations for different options, a1, a2 and a3 are option-specific fitting parameters. Range is denoted by R is and minR is the lowest range of the option. The value of r is a function of R and M, explicitly r = (R-M)/M. This model is highly dependent on the vendor-specific definition of R and M. Some vendor defines modulation width as proximal 90% to distal 90% dose levels, while others define proximal 98% or 95% dose to distal 90% dose level. A nominal modulation width has to be used to provide a better fit of measured OFs. A constant 0.91 was used to convert University of Pennsylvania’s definition (proximal 90% to distal 90% dose level) to the original theoretical definition of proximal 100% to distal 100%.4 Mevion defines R as distal 90% dose level of the normalized percentage depth dose and M as the distance between the proximal 95% and distal 90% dose level. Ferguson et al. implemented a similar model using different variations in the definition of R and M to predict OFs on limited data points for Mevion S250.5 In addition, Kooy’s model predicts OFs with relatively large deviation (>3%) for full range and full modulation (where R = M).6 This is because with full range and full modulation width, the percentage depth dose shows a flat dose distribution from entrance to the distal 90% dose level. Therefore, there is a large uncertainty between the nominal modulation width (used for fitting) and actually measured modulation width, resulting in a significant deviation in the fitting. Machine learning is an interdisciplinary field combining computer science and statistics to develop models with the intent of delivering maximal predictive accuracy.7 These new mathematical tools lead to a new horizon of predicting treatment outcome8 and quality assurance9–11 in the field of radiotherapy. In this work, we built several different machine learning models with a part of the measured OFs as the training data. The accuracy of machine learning based models was evaluated using the rest of the measured OFs and subsequently compared with the values from the analytical model published by Kooy, etc. By incorporating the OF dependence on gantry angle and field size, we demonstrated that the machine learning based approach can achieve a more robust and accurate prediction of MUs for passively scatter proton therapy. Medical Physics, 45 (5), May 2018

2. MATERIALS AND METHODS 2.A. Mevion S250 beam line The Mevion S250 proton therapy system utilizes a superconducting magnet synchrocyclotron with passive double scattering method for beam delivery. The maximum magnetic field is 9 T. The system uses low power and is a compact, lightweight system that allows for a gantry-mounted cyclotron. The treatment gantry rotates from 5° to 185°. The proton beam exits the synchrocyclotron and enters the beamshaping system through a vacuum window approximately 2 m away from the isocenter. Protons are accelerated to 250 MeV before passing through field shaping systems (FSS). The FSS comprises of a first scatter, 14 range modulator wheels (RMW), two range shifters, three secondary scatters, and a final absorber. With different combinations of the FSS, 24 options for different field sizes, range and modulation width are available for treating targets with range up to 32 g/cm2 with maximal field size of 14 cm, and up to 25 g/cm2 in depth with maximum field of 25 cm. All options are categorized into three groups, “small,” “deep,” and “large,” based on the selection of the secondary scatters. Options 1 to 12 are “large” options and treat targets up to 25 g/cm2 in depth with maximum field size of 25 g/cm2 and modulation width up to 20 g/cm2. Options 13 to 17 are “deep” options and treat targets with depth from 20 to 32 g/cm2 and with maximum field size of 14 cm. The maximum modulation width is limited to 10 cm for the five deep options. Options 18 to 24 are “small” options and treat targets up to up to 20 g/cm2 in depth with a maximum field size of 14 cm and modulation width from 2 to 20 g/cm2 cm. The spread-out-Bragg-peak (SOBP) is achieved using a similar approach with other double-scatter systems. Proton beams pass through a range modulator wheel with different depths. The wheel spins during treatment and delivers individual Bragg peaks to proper depths, such that they combine to make a broad SOBP of uniform dose. A control system is used to adjust the relative intensities of the individual Bragg peaks creating the SOBP through time-dependent corrections to the proton beam current synchronizing it with the rotation of the range modulator wheel. This is required in order to correct slopes or ripples in the SOBP and improve the uniformity of dose as a function of depth.12 The proton beams are calibrated following the IAEA TRS398 protocol.13 The dose monitor chambers in the nozzle are calibrated to produce 1 cGy/MU for a user defined condition. The reference calibration condition of this machine is option 20 with range of 15 g/cm2, modulation width of 10 cm and field size of 10 9 10 cm. The point of calibration measurement point is at the depth of the central SOBP (10 g/cm2). 2.B. Output measurements and calculations of monitor units Output measurements were made in a water phantom with SAD geometry using a calibrated Farmer ionization Chamber FC65 or CC13 (IBA Dosimetry, Schwarzenbruck, Germany)

2245

Sun et al.: MU prediction using machine learning

aligned to isocenter and the center of the SOBP. Due to the SAD geometry, no inverse square correction was needed. The gantry was rotated to 0° for all the measurements. The output dependence on snout position was investigated in previous reports and therefore is not discussed here.14,15 We measured OFs for a variety of patient fields across 23 of 24 options. Option 13 with range from 28 to 32 cm had never been used clinically and was excluded from this study. For field radius larger than 3.5 cm, OFs were measured with a farmer chamber in a 10 9 10 cm square field. For field radius smaller than 3.5 cm, OFs were measured with a cross-calibrated CC13 chamber with patient-specific apertures. For all the measurement, compensators were not used in order to reduce the uncertainty at high dose gradient area. To reduce the uncertainties due to machine output fluctuations, the OF was measured at the calibration condition (R = 15 g/cm2 and M = 10 cm) prior to patient-specific field measurements and used to normalize the patient-specific OF measurements. The measured OFs (OFmeas.) in the unit of cGy/MU were used to derive the patient-specific MUs with corrections for gantry angle (fgantry) and field size (ffsz). The MU for each planned field is calculated by: MU = dose/OFmeans, where the dose was calculated from a verification plan applied to the water phantom without a compensator, and OF = OFmeas 9 fgantry 9 ffsz. 2.B.1. Gantry angle dependence (fgantry) The MU was derived using the OFs measured at 0 degree. For this gantry-mounted synchrocyclotron, energy spectrum may be altered as the gantry rotates, resulting in variation of the output. We have measured the relative OFs vs gantry angle for large, small and deep options using a cylindrical plastic phantom, attached to the snout, holding a farmer ionization chamber in the center. Snout extension was adjusted and fixed during the rotation to leave the sensitive volume of the ionization chamber at the radiation isocenter. The measurements were repeated for different options with various range and modulation widths. 2.B.2. Field size dependence (ffsz) The field size factor ffsz was introduced as the ratio of the dose at the isocenter with a certain aperture denoted by Dap. The dose measured with a full opened ring aperture (largest opening of field size) denoted by Dopen: ffsz = Dap./Dopen. With the field radius less than 3.5 cm for patient-specific treatment fields, the OFs were measured with and without apertures. In addition, the same measurements were also performed with and without apertures for several square fields with larger field size at the calibration condition. For different snout positions, the same physical aperture projects different field size at the isocenter. Regardless of the snout position, the field size was defined as the projection of the radiation field at the isocenter. It is well known that the output factors for small fields decrease due to charged particle disequilibrium as the field size reduces to be comparable to lateral Medical Physics, 45 (5), May 2018

2245

penumbra. The loss of proton fluence due to scattering on the central axis can be described by applying the pencil beam formalism.16 The normalization factor accounting for “missing” protons which come from outside of a finite-size field is given by   r2 F ðr; rtol Þ ¼ 1  exp  2 ; (2) 2rtot ðzÞ where r is the field radius and rtot is the standard deviation of angular distribution at depth z. r is contributed by virtual source size and scatters in the beamline and patient. Eq. (2) provides a simple form of OFs vs field radius. 2.C. Machine learning models Predicting OF is a supervised learning problem for which several solutions exist with different requirements and performance. In this study the inputs (features) to the models were option number, range, and modulation width, which were the three major factors determining the OFs. The parameters r = (R-M)/M and minR used in the semi-empirical model were not independent variables and therefore not used as the inputs. The predictors have both numerical values (range and modulation width) and a categorical feature (option number). Here we opted for those supervised learning methods that can work both for numerical and categorical features. Tree-based learning algorithms including random forest, XGBoost, and Cubist were considered to be the best and most widely used in data science.17,18 With the high accuracy and robustness against overfitting, they can work with nonlinear relationship and high-dimensional data. These algorithms are described below. 2.C.1. Random forest Random forest19 is an ensemble method in machine learning in which several decision tree predictors are combined together. The random factor contributes great performance to the random forest prediction since it is applying to both features space (random feature selection) and sample space (bagging). In random forest regression, the output prediction is simply the average of the output of the decisions trees in the forest. As the result of randomization, while the bias of the forest remains the same as of one tree, the variances decrease and therefore the error is reduced. Random forest input can be either categorical or numerical, which is suitable for the problem in this paper since option variable is a categorical variable. Scikit-learn20 implementation of random forest offers hyperparameters tuning for the number of trees, maximum depth of trees, maximum number of features, minimum samples of splits, and minimum sample of leaves. We used 1000 trees to achieve the best fitting results. The average score, i.e., mean absolute error and maximum absolute error, was almost the same when the number of trees was larger than 1000, which confirmed that the random forest algorithm did not over-fit the data.

2246

Sun et al.: MU prediction using machine learning

2.C.2. XGBoost XGBoost21,22 belongs to the class of gradient boosted regression tree (GBRT). GBRT is a generalization of boosting to arbitrary different loss function. It builds the model by combining a set of weak decision trees in an iteratively forward manner with gradient descent optimization. XGBoost provides a parallel tree boosting that solves many data science problems in a fast and accurate way. The advantages of XGB are natural handling mixed type data and robustness to outliers. Several parameters need to be tuned for a suboptimal solution. In the original implementation, the main parameters to tune XGB were learning rate and depth of tree. The implementation used in this paper22 provided tuning for learning rate, maximum depth of a tree, L1 regularization coefficient, L2 regularization coefficient, and subsample rate. These parameters were optimized to achieve the best score after 1000 randomized search during the training and validation process. 2.C.3. Cubist Cubist23 is a tree-based model similar to random forest. During the optimization, the random forest algorithm performs predictions using an average of the training points within the terminal mode of a given branch, while the cubist algorithm builds a linear regression model at each terminal node. Cubist is a specific model tree that uses linear model smoothing, creating rules, and tree pruning. It also has the boosting procedure to combine several model trees. Predictions are adjusted using nearby points from the training set. Cubist model training implemented in R was simplified where the main parameters to tune were the number of committees and neighbors. Committees are somewhat analogous to the number of decision trees used to contribute their predictions to the final prediction, and neighbors represent a

2246

number of neighboring training points which can be used to aid in prediction.9 Unlike random forest, the number of committees in Cubist is limited to 100. We used the default neighbor control value of 0 in the algorithm. While Random Forest and XGBoost were implemented in Python (Python Software Foundation, https://www.python. org/), Cubist algorithm was implemented in R.24 2.D. Data set and workflow for training, validation, and testing Figure 1 shows the process of building and validating the predictive models with 1431 OFs for 23 options measured before April 2016, and testing with 332 OFs measured from April 2016 to May 2017. Ten-fold validation was used to prevent “overfitting”, which is a problem of learning overly complex models that are well-tuned to the training data but perform poorly with new data.8 There are two approaches to search for the optimal hyperparameters: grid search and randomization search. Grid search is suitable when the machine learning algorithm has few hyperparameters to tune or when the hyperparameters are discrete. Randomization search works well when more hyperparameters are tuned and they are continuous parameters. In this paper, we used grid search for Random Forest and Cubist. For XGB, randomization search was used to search the hyperparameter space 1000 times to look for the optimal parameters. During the cross-validation, the training data were initially divided into 10 sets with one set aside as a “validation set”. The models were built from the nine sets, using the remaining “validation set” to validate the models. This process was repeated 10 times such that every portion was used to assess the performance of the models. The tunable model parameters were iterated to minimize absolute mean errors, i.e., differences between predicted and measured output values on the validation set. The performance of the

FIG. 1. Workflow for training, cross-validation, and final testing of predictive models. [Color figure can be viewed at wileyonlinelibrary.com] Medical Physics, 45 (5), May 2018

2247

Sun et al.: MU prediction using machine learning

2247

models was the average scores of the model trained on each fold. Using the cross-validation method, the tuning model parameters were optimized from the training data. After the models were built, the final performance of each model was evaluated on a separate test set of data. Table I shows the number of measurements for each option. Large, deep, and small options are labeled as different background colors. Because the majority of treatment sites were CNS tumors and pediatric patients, option 20–23 with proton range from 7 to 13.2 g/cm2 are the dominant options for those disease sites, whereas most treatment in option 15 and 16 with range between 20 and 27 g/cm2 are predominant options for prostate patients. The predicted OFs were compared with the measured values and the difference was defined as: difference ¼ 100% 

ðd=MUÞcalc:  ðd=MUÞmeas: ðd=MUÞmeas:

(3)

3. RESULTS Figure 2 shows histograms of the percentage difference between the modeled output predictions and measurements for the testing data set for the three models based on machine learning (A-C) and the semi-empirical model (D) developed by Kooy, et al. The normal distribution fits were TABLE I. Number of fields per option used for training and testing the models.

Option

Maximum range (cm)

Minimum range (cm)

No. of fields for training

No. of fields for testing

1

25.0

22.6

26

4

2 3

22.5 20.8

20.9 18.8

25 35

3 7

4

18.7

16.8

87

17

5

16.7

14.9

85

17

6

14.8

13.2

47

8

7

13.1

11.5

30

16

8

11.4

10.0

46

12

9

9.9

8.6

36

13

10 11

8.5 7.2

7.3 6.1

31 16

20 11 9

12

6.0

5.0

16

14

29.5

27.1

27

3

15

27.0

24.6

29

13

16

24.5

22.1

35

15

17

22.0

20.1

16

0

18

20.0

17.8

36

4

19 20

17.7 15.3

15.4 13.3

64 140

10 18

21

13.2

11.2

237

31

22

11.1

9.1

210

35

23

9.0

7.0

112

35

24

6.9

5.0

45

12

Medical Physics, 45 (5), May 2018

also displayed in Fig. 2. The model based on the Cubist algorithm outperformed all other models with the least mean absolute error, standard deviation (SD), and maximum absolute error. The modeled output from the Cubist algorithm was within 2% of the measurements for more than 97% of these fields with only one point exceeding 3%. The normal distribution fitting of the histogram from Cubist algorithm [Fig. 2(c)] shows a sigma of 0.8%. T test with P < 0.001 confirmed with a significant statistical power that Cubist algorithm provided the most accurate predictions than all other methods. Table II shows the mean absolute error, mean error, and maximum difference between predicted and measured OFs for the testing data set. Additionally, Table II also shows the percentage of the number of predictions that were within 2% and 3% of the measures values. On the other hand, the semi-empirical model showed a maximum difference up to 7.7% between model predictions and measurements. The large prediction errors came from the data points with small r = (R-M)/M values. Figure 3 shows the plot of r vs prediction errors. It is evident that the semi-empirical model significantly under-predicts output for the fields with small r values. This finding was consistent with the observation by Kim et al., who reported semi-empirical model with large prediction error when r was less than 0.3.6 At the condition of full range and full modulation, the percentage depth dose shows a flat dose distribution from entrance to the distal 90% dose level. Therefore, there is a larger uncertainty in determining the nominal modulation width (the distance from proximal 95% to distal 90% dose level) resulting in a larger deviation in the fitting. For the random forest method, the largest prediction error (up to 5.4%) also occurred when r was small. However, the errors were more symmetric with largely equal odds of being over- or underpredicted. The Cubist and XGBoost algorithms, on the other hand, were less sensitive to r. It is plausible to expect that extra data used in the training would provide higher prediction accuracy. An important question to answer is, what is the number of training data sets required by the machine learning models for converged prediction? To answer this question, we investigated how the number of samples used for the training impacted the prediction accuracy of the Cubist algorithm. As the number of samples were not equally distributed for all the options in the training data set, we built the Cubist model with 10%, 20%, 30%, . . ., and 100% of the number of training data sets in each option and calculated the prediction errors during validation and testing. Figure 4 shows learning curves of deviations (mean absolute percentage error) between the measured and predicted values vs the number of measurements used in the training set. The training error was calculated as the error on the fraction of training data set instead of the whole data set. With fewer data, the model tended to over-fit the training data. As the training set size increased, the training error became larger. However, the model managed to fit the validation set better and the

2248

Sun et al.: MU prediction using machine learning

2248

Random Forest

(a)

40

60

30

40

20

20

10

0 -8

-6

-4

-2

0

2

XGBoost

(b)

80

4

6

8

0 -8

-6

-4

Difference (%)

Cubist

(c)

60

40

40

20

20

-6

-4

-2

0

2

0

2

4

6

8

4

6

8

D. Kooy

(d)

60

0 -8

-2

Difference (%)

4

6

8

0 -8

-6

-4

Difference (%)

-2

0

2

Difference (%)

FIG. 2. Histogram of percent differences between predicted and measured OFs for (a) Random Forest algorithm; (b) XGBoost; (c) Cubist; (d) Semi-empirical model developed by Kooy, et al. [Color figure can be viewed at wileyonlinelibrary.com]

TABLE II. Percent differences between predicted and measured output. Random forest (%)

XGBoost (%)

Cubist (%)

Kooy (%) 0.83  0.96

Mean absolute error  SD

0.88  0.83

0.83  0.69

0.62  0.52

Mean error  SD

0.16  1.20

0.02  1.08

0.14  0.78

0.01  1.27

Maximum absolute error

5.36

3.17

7.67

3.48

Difference within 2%

92.2

93.2

97.5

93.8

Difference within 3%

97.2

98.5

99.7

98.5

validation error decreased. The two learning curves converged with the increased training data. The accuracy did not show significant improvements by including more data points at approximately 1200 samples., i.e., ~1200 data points will be sufficient to achieve a mean absolute error of less than 0.7%. With ~1200 training data used, the gap between training and testing accuracy was ~0.2%. The learning curves shown in Fig. 4 has indicated that the Cubist model can predict the OFs with low bias. The OFs as a function of gantry angle are shown in Fig. 5. The OFs measured at different gantry angles were normalized to gantry at zero degrees. We measured multiple options for each option group (large, deep, and small options). The final gantry angle dependence was averaged for each group. Figure 5 shows the large options have very small gantry angle dependence and deviate from unit by less than 0.5%. The OFs show increasing sensitivity with gantry angle for small and deep options. The averaged output for the gantry angle at Medical Physics, 45 (5), May 2018

180 degree is 1.5% and 2% lower than that at gantry angle of zero degrees for deep and small options, respectively. The 5th order polynomial fittings were shown in Fig. 5 and the fitting parameters were used to correct the gantry angle dependence in the MU derivation. Figure 6 shows measured field size factor ffield vs field radius. Due to the irregular shape of patient fields, the field radius was calculated with the maximum distance from the aperture edge to the central axis. The OF measurements were performed from patient-specific fields covering multiple options and depths. Figure 6 shows the field size factors were close to 1 for field radius equal to or larger than 2.5 cm. The secondary electrons generated by protons have small range and deposit their energy locally. Therefore, as long as the field radius was larger than 2.5 cm, the aperture did not affect the output. When the field radius decreased to 2 cm the output reduced by 3%–4%. The solid line represents the fitting of Eq. (2).

2249

Sun et al.: MU prediction using machine learning

Random Forest

(a)

6

4

4

2

2

0 -2

0 -2

-4

-4

-6

-6

-8 -0.5

0

0.5

1

1.5

2

XGB

(b)

6

Error (%)

Error (%)

2249

2.5

-8 -0.5

3

0

0.5

(R-M)/M

4

2

2

Error (%)

Error (%)

6

4

0 -2

-6

-6 1

1.5

3

2

2.5

3

0

-4

0.5

2.5

-2

-4

0

2

Kooy

(d)

6

-8 -0.5

1.5

(R-M)/M

Cubist

(c)

1

2

2.5

3

-8 -0.5

0

0.5

(R-M)/M

1

1.5

(R-M)/M

FIG. 3. Distribution of errors vs r = (R-M)/M for the four methods. The maximum absolute errors occurred when r was close to 0 for both semi-empirical model and random forest model (a) Random Forest; (b) XGBoost; (c) Cubist; (d) Semi-empirical model developed by Kooy, et al. [Color figure can be viewed at wileyonlinelibrary.com] 2

Output vs Gantry Angle

1.8

Large Deep

1.6

Small

1.01

1.4

Output Factor

Mean absolute percentage error

1.02

testing training

1.2 1 0.8

1

0.99

0.6 0.98

0.4 0.2 0

0.97 0

0

200

400

600

800

1000

1200

1400

Number of training samples FIG. 4. Learning Curves. Testing and training error vs number of data points used to build the Cubist model. [Color figure can be viewed at wileyonlinelibrary.com]

4. DISCUSSION A model capable of accurately predicting the OFs for passively scatter proton beams is critical to ensure the correct MU delivery. The model can serve as the secondary verification of measured OFs. In this work we built three models based on machine learning methods to predict OFs and compared with the semi-empirical modes developed for the MGH proton machine3,25 and later implemented for other proton machines.4,5 We have concluded that all the three machine Medical Physics, 45 (5), May 2018

20

40

60

80

100

120

140

160

180

Gantry Angle

FIG. 5. Normalized output factors as a function of gantry angle for different option groups. [Color figure can be viewed at wileyonlinelibrary.com]

learning models provided better results in terms of maximum absolute difference than previously developed semi-empirical model, which has a large prediction error for the proton fields with a full range and full modulation width. To overcome the limitations, vendor-specific definitions of R and M have to be modified to achieve a better fitting.4,5 The prediction accuracy of the machine learning approach presented in this study shows less dependence on r value. Ferguson et al. has implemented the semi-empirical models for Mevion S250 by adding various factors on R or M.5 The models were fitted for 177 data points for all the 24

2250

Sun et al.: MU prediction using machine learning

2250

1.02 1.01

Field Size Factor

1 0.99 0.98 0.97 0.96 0.95 0.94

2

4

6

8

10

Field Radius (cm) FIG. 6. Comparison of calculation and measurement results for field size effect (the ratio of measurement with aperture and without aperture). The circles show the measurement results, and the solid line shows the predicted results from Eq. (2). [Color figure can be viewed at wileyonlinelibrary.com]

options. The different definitions of R and M were used to find the best fitting of the limited data. In addition, the data used for building the model were the same data set for the testing; therefore, the model could predict well in the training data but may not hold in general. Our models were built from over 1400 data points for 23 options and tested on an independent set of data points. Our approach could avoid overfitting and the prediction results are more reliable. Among the three machine learning techniques, both the Cubist and the XGBoost generated comparable results (with 0.62%  0.52% and 0.83%  0.69% of mean absolute error and 3.17% and 3.48% of maximum difference, respectively). The computation time for Cubist was much shorter than the time requested for XGBoost algorithm. The Cubist model was the most straightforward method to implement in R24 and provided the most accurate prediction. The open source Cubist library for R can be downloaded at (https://topepo. github.io/Cubist/). The data with the three features (R, M, and option number) were fed into the Cubist function to build the model. The final model of Cubist method was an ensemble of a committee of 100 models. Each model had 30 rules to specify the OF prediction based on the input features. The final prediction was the average of 100 outputs from the committees. An example of the rules learned by Cubist is shown as follows: if Option >12 and R < = 5.89 then OF = 0.956172– 0.1019 M + 0.0542 R”. The accuracy of the measured OFs is important to determine prediction accuracy. Daily machine OF can easily fluctuate by 1%–2%. To eliminate the machine deviation, OF at the reference field (Option 20, R = 15 cm and M = 10 cm) is always measured with temperature and pressure corrected before the patient-specific measurements. The measured patient-specific OFs are normalized to machine OF at the reference condition. Medical Physics, 45 (5), May 2018

As shown in Fig. 4, one of the limitations is the requirement for relatively large data points (~1200 measured OFs may require near 450 patients with each patient treated with 2–3 fields) to build an accurate model. It is not realistic to implement the Cubist model for a single treatment room during the first year since commissioning. As all Mevion S250 machines have the same FSS and our model can be used to predict OFs obtained at other institutions. Our next step is to validate the model and assess the accuracy using the data from other institutions. With further tests, the OF prediction model will not only provide a secondary check of the measurements but also possibly replace the time-consuming OF measurements. The single-room proton machine with the synchrocyclotron mounted on the gantry eliminates the need for a transportation beamline and reduces the requirement on space. Due to large weight of the synchrocyclotron, the output of dependence on gantry angle has to be carefully evaluated. Our measurements show that the output decreased by 1.5%–2% for small and deep options, while it is constant for large options. The gantry angle dependence is likely caused by the slight shift of the superconducting coils due to the gravity. To compensate for the energy shift with gantry rotation, a variable thickness wedge was introduced at the extraction point inside the synchrocyclotron to fine-tune the proton energy at different gantry angles. The variable thickness wedge may introduce a slight energy spectrum difference and has a small impact on the monitor chamber. The second scatter, which serves like a Gaussian smoothing of the energy spectrum, is thicker for large options than that used for small and deep options. Therefore, the skewed spectrum is better smoothed and compensated with a thicker secondary scatter, resulting in smaller output dependence of the gantry angle. The field size factor of proton output was investigated for a small number of patient-specific fields. It should be noted that the field size factor is complicated and it depends on many factors, such as snout position, depth, and aperture size. A field size normalization from a simple pencil beam model explains the field size dependence of measured output factors. The simple model does not take into account the extra aperture edge scatter. We use a divergent cut of apertures and the aperture scatter is expected to be small and contributed dose only to shallow dose level.26 The snout position is another factor we do not consider. However, the field radius is defined at the isocenter plane, which is independent on the snout position. To accurately model the field size factor, more data measurements are needed. Ideally, the OF measurements should include the field-specific apertures and the field size can be used as a feature in the machine learning models. This is another limitation of the current machine learning based prediction. It did not include the field size. In clinical practice, we performed the OF measurement for small field size apertures. With more OFs measured for small field size, it is possible to include field radius for the model.

2251

Sun et al.: MU prediction using machine learning

5. CONCLUSION The accurate determination of OFs is one of the most important components in the proton treatments using the Mevion S250 double-scatter proton machine. We employed machine learning methods to predict the OFs for patient-specific treatment fields and compared to Kooy’s semi-empirical model. The machine learning models outperformed the semiempirical model with improved prediction accuracy in terms of maximum absolute error. Among the three machine learning algorithms investigated, the Cubist algorithm provides the most accurate and robust prediction. This model is currently being used as an important secondary check of MUs. The MU derived from the model can potentially replace the timeconsuming OF measurement. In addition, we have derived the correction factors for the change in OF with field size and gantry angle which were included in the MU derivation. CONFLICTS OF INTEREST The authors have no conflicts to disclose. a)

Author to whom correspondence should be addressed. Electronic mail: [email protected]

REFERENCES 1. Goitein M, Lomax AJ, Pedroni ES. Treating cancer with protons. Phys Today. 2002;55:45–51. 2. Kotagal S. Radiotherapy for the future. Pediatrics. 2004;114:44–49. 3. Kooy HM, Rosenthal SJ, Engelsman M, et al. The prediction of output factors for spread-out proton Bragg peak fields in clinical practice. Phys Med Biol. 2005;50:5847. 4. Lin L, Shen J, Ainsley CG, Solberg TD, McDonough JE. Implementation of an improved dose-per-MU model for double-scattered proton beams to address interbeamline modulation width variability. J Appl Clin Med Phys. 2014;15:297–306. 5. Ferguson S, Ahmad S, Jin H. Implementation of output prediction models for a passively double-scattered proton therapy system. Med Phys. 2016;43:6089–6097. 6. Kim DW, Lim YK, Ahn SH, et al. Prediction of output factor, range, and spread-out Bragg peak for proton therapy. Med Dosim. 2011;36:145–152. 7. Dhar V. Data science and prediction. Commun ACM. 2013;56:64–73.

Medical Physics, 45 (5), May 2018

2251 8. Kang J, Schwartz R, Flickinger J, Beriwal S. Machine learning approaches for predicting radiation therapy outcomes: a clinician’s perspective. Int J Radiat Oncol Biol Phys. 2015;93:1127–1135. 9. Carlson JN, Park JM, Park S-Y, Park JI, Choi Y, Ye S-J. A machine learning approach to the accurate prediction of multi-leaf collimator positional errors. Phys Med Biol. 2016;61:2514. 10. Valdes G, Scheuermann R, Hung C, Olszanski A, Bellerive M, Solberg T. A mathematical framework for virtual IMRT QA using machine learning. Med Phys. 2016;43:4323–4334. 11. Valdes G, Chan MF, Lim SB, Scheuermann R, Deasy JO, Solberg TD. IMRT QA using machine learning: a multi-institutional validation. J Appl Clin Med Phys. 2017;18:279–284. 12. Hill PM, Klein EE, Bloch C. Optimizing field patching in passively scattered proton therapy with the use of beam current modulation. Phys Med Biol. 2013;58:5527. 13. Andreo P, Burns DT, Hohlfeld K, et al. Absorbed dose determination in external beam radiotherapy: an international code of practice for dosimetry based on standards of absorbed dose to water. Vienna, Austria: IAEA TRS; 2000;398. 14. Daartz J, Engelsman M, Paganetti H, Bussiere M. Field size dependence of the output factor in passively scattered proton therapy: influence of range, modulation, air gap, and machine settings. Med Phys. 2009;36: 3205–3210. 15. Zheng Y, Ramirez E, Mascia A, et al. Commissioning of output factors for uniform scanning proton beams. Med Phys. 2011;38: 2299–2306. 16. Hong L, Goitein M, Bucciolini M, et al. A pencil beam algorithm for proton dose calculations. Phys Med Biol. 1996;41:1305. 17. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning. Pittsburgh, PA: ACM, 2006:161–168. 18. Valdes G, Luna JM, Eaton E, Simone Ii CB, Ungar LH, Solberg TD. MediBoost: a patient stratification tool for interpretable decision making in the era of precision medicine. Sci Rep. 2016;6:37854. 19. Breiman L. Random forests. Mach Learn. 2001;45:5–32. 20. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830. 21. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–1232. 22. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016;785–794. 23. Kuhn M, Weston S, Keefer C, Coulter N, Quinlan R. Cubist: rule-and instance-based regression modeling. R package version 0.0. 15. 2013. 24. Team R. R development core team. RA Lang Environ Stat Comput. 2013;55:275–286. 25. Kooy HM, Schaefer M, Rosenthal S, Bortfeld T. Monitor unit calculations for range-modulated spread-out Bragg peak fields. Phys Med Biol. 2003;48:2797. 26. Zhao T, Sun B, Grantham K, et al. Commissioning and initial experience with the first clinical gantry-mounted proton therapy system. J Appl Clin Med Phys. 2016;17:24–40.