Photometric Determination of Quasar Candidates - Semantic Scholar

1 downloads 0 Views 188KB Size Report
Sheelu Abraham and Ninan Sajeeth Philip. Santhom Computing ... could detect quasars with above 96.96% confidence regarding their true classifi- cation.
**FULL TITLE** ASP Conference Series, Vol. **VOLUME**, **YEAR OF PUBLICATION** **NAMES OF EDITORS**

Photometric Determination of Quasar Candidates Sheelu Abraham and Ninan Sajeeth Philip Santhom Computing Facility, St. Thomas College, Kozhencheri, India -689641 Abstract. We describe an efficient and fast method for the detection and classification of quasars using a machine learning tool, making use of photometric information from SDSS DR7 data release. The photometric information used are the ten independent colours that can be derived from the 5 filters available with SDSS and the machine learning algorithm used is a difference boosting neural network (DBNN) that uses Bayesian classification rule. An adaptive learning algorithm was used to prepare the training sample for each region. Cross validations were done with SDSS spectroscopy and it was found that the method could detect quasars with above 96.96% confidence regarding their true classification. The completeness at this stage was 99.01%. Contaminants were mainly stars and the incorrectly classified quasars belonged to a few specific patches of redshifts. Color plots indicated that the colors of some stars and quasars in those redshits were indistinguishable from each other and was the major cause of their incorrect classification. A confidence value (computed posterior Bayesian belief of the network) was assigned to every object that was classified. Most of the incorrect classifications had a low confidence value. This information may be used to filter out contaminants and improve the classification accuracy at the cost of reduced completeness.

Introduction Modern surveys are capable of gathering several terabytes of data on a daily basis. Data management and its classification thus becomes a real challenge in astronomy. Automated classification tools such as Bayesian classifiers, Support Vector Machines, Decision trees etc have found extensive application in large surveys. Thus the need for more robust photometric classification and photometric redshift estimation tools are rapidly becoming apparent. This brief paper describes the use of a machine learning tool for classification and the identification of interesting objects from large surveys considering photometric identification of quasar candidates from Sloan Digital Sky Survey (SDSS1 ) as an example. Although it may be argued that spectroscopy is more reliable for classification, the inherent spectroscopic limitations at fainter levels pauses significant limitations on deep sky surveys. SDSS is an all sky survey covering 11663 deg2 of the northern sky providing optical images of 357 million celestial objects (Abazajian et al. 2009). We used a Difference Boosting Neural Network (DBNN) (Philip & Joseph 2000), which

1

http://www.sdss.org/dr7/

1

2

Abraham and Philip

is a machine learning implementation of the Bayesian theorem, for this study. DBNN is an extension of Naive Bayes learning algorithm (Elkan 1997). It is a supervised neural network and thus requires a training data to configure the network before it can be used for classification of unseen data. Object Classification The classification was restricted to the colour space dominated by quasars (Richards et al. 2004) in SDSS Data Release 7 that gave 90033 unresolved objects for which spectroscopic identification was available. The dataset consisted of point sources of all spectral types from the region. A representative sample of this data was taken as the training sample. The classification was done using the ten colours, derived from five filters, namely u-g, u-r, u-i, u-z, g-r, g-i, g-z, r-i, r-z, i-z and the u magnitude, all obtained from the SDSS DR7 photometric catalog after applying necessary corrections. The classification label taken was the object’s spectral type given by SDSS. In the training process, the Bayesian estimate for each of the training example is computed and stored by the network. This information is later used for classification when a new data is presented to the network. When a prediction is made, in addition to a predicted spectral type, DBNN returns the computed Bayesian posterior estimate for its prediction. This value can be used as a measure of the confidence the network has for each prediction. In our example, the training data consisted of 1698 stars, 2875 quasars, 529 galaxies and 178 late-type stars. After training, the network was tested on the entire dataset containing 90033 objects. The network predictions for the objects are given in Table 1 . It may be noted that the network was able to classify quasars with a completeness of 99.01% and was able to correctly pick up about 96.96% quasars from the total number of quasars in the region. The contamination from other sources were about 3.04%. Table 1. The total number of objects and the percentage of correct classification by the machine learning tool for each class are shown.

Object

Total no. of objects

Correctly Predicted

Total Predicted

% of prediction

Star Galaxy Quasar Star-Late

11867 585 77552 29

9964 53 76789 2

10703 129 79198 3

83.96 6.77 99.01 6.89

Total

90033

86808

90033

96.42

The confidence plot for correctly and incorrectly classified objects are shown in Figure 1. In the histogram, the gray region corresponds to correctly classified (passed) objects and the black to incorrectly classified (failed) objects. It may

Photometric Determination of Quasar Candidate

3

Figure 1. Confidence levels of correctly classified and incorrectly classified objects. Gray colour corresponds to correctly classified objects and black to incorrectly classified. Note the relative abundance of incorrectly classified objects at lower confidence levels.

Figure 2. Colour merges between quasars classified as stars and stars classified as quasars are shown. The ten colour features used are given on the x-axis and their ranges are on the y-axis. Gray colour corresponds to quasars incorrectly classified as stars and white to stars incorrectly classified as quasars.

be noted that we can remove most of the incorrectly classified objects by adding certain cut off on the confidence level. This process, however, removes a good number of correctly classified objects that falls within that confidence level. This is the cost of the confidence cut off.

Beyond Classification There were 763 quasars that got incorrectly classified in the dataset. We analyzed the failed data in detail. The main contaminants were stars (∼90.7%). The colour plots shown in Figure 2 shows the colours of those failed objects in all the ten features. It may be noted that the features for both classes span the same colour space in all ten features. The reason for this is obvious in Figure 3(a) which reveals that failures are at some specific patches of redshift, where the redshifted colours of quasars matches that of a few stars.

4

Abraham and Philip

Figure 3. The incorrectly classified objects in the data set are from specific patches of redshifts. The right side plot shows histogram after removing objects with confidence below 60%. Note that it removes most of the contaminants.

The important point to be noted here is that Bayes rule assigns lower confidence value to objects belonging to such overlapping patches in the feature space and that they can be easily filtered out for separate analysis as shown in Figure 3. Of these, we found 197 quasars that were incorrectly labeled as galaxy by SDSS. Conclusion In addition to photometric classification, machine learning tools can have interesting and useful applications in the study of the outliers that may be used to discover the inherent limitations in a system and the better understanding of the universe. It is found that using a Bayesian network, objects that exhibit peculiar feature can be detected easily by filtering out objects with lower confidence values. Low confidence could also signify the presence of a new class of objects in the dataset. This is a very desirable feature for the selection of candidates for follow up observations. Acknowledgments. This work was supported by ISRO Respond project grant No: ISRO/RES/2/339/2007-08. Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautis and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society. The SDSS web site is htpp://www.sdss.org/. References Abazajian, K. N., et al. 2009, ApJS, 182, 543 Philip, N. S., & Joseph, K. B. 2000, Journal of Intelligent Data Analysis, 4, 463 Elkan, C. 1997, Technical Report No.CS97-557 Richards, G. T., et al. 2004, ApJS, 155, 257