Meteorological Phenomena Forecast Using Data Mining Prediction ...

Meteorological Phenomena Forecast Using Data Mining Prediction Methods František Babič1, Peter Bednár2, František Albert1, Ján Paralič2, Juraj Bartók3, and Ladislav Hluchý4 1

Department of Cybernetics and Artificial Intelligence, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Letná 9/B, 042 00 Košice, Slovakia [email protected], [email protected] 2 Centre for Information Technologies, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Němcovej 3, 042 00 Košice, Slovakia {peter.bednar,jan.paralic}@tuke.sk 3 MicroStep-MIS spol. s.r.o., Čavojského 1, 841 08 Bratislava, Slovakia [email protected] 4 Institute of Informatics of the Slovak Academy of Sciences, Dúbravska cesta 9, 845 07 Bratislava, Slovakia [email protected]

Abstract. The occurrence of various meteorological phenomena, such as fog or low cloud cover, has significant impact on many human activities as air or ship transport operations. The management of air traffic at the airports was the main reason to design effective mechanisms for timely prediction of these phenomena. In both these cases meteorologists already use some physical models based on differential equations as simulations. Our goal was to design, implement and evaluate a different approach based on suitable techniques and methods from data mining domain. The selected algorithms were applied on obtained historical data from meteorological observations at several airports in United Arab Emirates and Slovakia. In the first case, the fog occurrence was predicted based on data from METAR messages with algorithms based on neural networks and decision trees. The low cloud cover was forecasted at the national Slovak airport in Bratislava with decision trees. The whole data mining process was managed by CRISP-DM methodology, one of the most accepted in this domain. Keywords: meteorological data, prediction, decision trees, neural networks.

1 Introduction The most important meteorological phenomena are clouds, hurricane, lightings, rain, fog, etc. These events have strong influence on many day-to-day activities, so their effective forecast represents important decision factor for various domains such as traffic and transport, agriculture, tourism and public safety. We have selected management of the air traffic at the specified local airports as our business goal that shall be supported by our solution. The experimental forecast of meteorological phenomena represents very difficult process based on many sources containing raw data in P. Jędrzejowicz et al.(Eds.): ICCCI 2011, Part I, LNCS 6922, pp. 458–467, 2011. © Springer-Verlag Berlin Heidelberg 2011

Meteorological Phenomena Forecast Using Data Mining Prediction Methods

459

various formats. These input datasets have to be preprocessed in cooperation with domain experts in order to obtain good-quality data with all necessary attributes and metadata. The main goal of the presented work is to examine suitability of data mining methods for selected meteorological phenomena forecast in specific conditions of the local airports in United Arab Emirates and Slovakia. Both cases have their specialties and differences that have to be considered during preparation of the data and selection of suitable prediction methods. The obtained results will be compared with already existing and used methods in order to create an effective automatic system for fog and low cloud forecasting at the airports. This Airport Weather System will be deployed by MicroStep-MIS company based on agreed contracts in examined localities. The whole paper is organized as follows: the first section contains an introduction and brief presentation of data mining domain and relevant Slovak national project called Data Mining Meteo; in the next section we describe several similar or relevant research approaches; the core section is devoted to the detailed description of the whole data mining process in both cases based on CRISP-DM methodology and the paper closes with short summary and a sketch of our future work. 1.1 Data Mining Data mining covers in our case an application of the whole process with well selected methods to the heterogeneous meteorological data in order to effectively predict specified phenomena [13]. This process can be labeled as knowledge discovery in databases too; this naming is mainly used in academic conditions, but in this paper we understand data mining as the whole knowledge discovery process. This field combines methods from statistics, artificial intelligence, machine learning, database management in various exploitation cases as business (insurance, banking, retail), scientific research (astronomy, medicine), or government security (detection of criminals and terrorists) [14]. CRISP-DM (CRoss Industry Standard Process for Data Mining1) represents an industry initiative to develop a common and tool-neutral standard for whole data mining process. This methodology is based on collected practical experiences from various companies gained during solving data mining tasks. The whole process can be understood as a life cycle containing six main phases: • • • • •

1

The Business understanding is oriented to specification of business goal, followed with transformation of specified business goal to concrete data mining task(s). The Data understanding covers collection of necessary input data for specified tasks; its understanding and initial description. The Data preparation contains all performed preprocessing methods, as data transformation, integration, aggregation, reduction, etc. In the Modeling phase the selected algorithms are applied on prepared data producing specific models. The Evaluation of obtained models based on specific metrics, which depend on the type of used model.

http://www.crisp-dm.org/

460

F. Babič et al.

•

The Deployment contains the exploitation of created data mining models in real cases, their adaptation, maintenance and collection of acquired experiences and knowledge.

1.2 DMM Project The tasks described in this paper have been solved within the Slovak national project called Data Mining Meteo. The project consortium consists of one business partner MicroStep-MIS2 with extensive experience in meteorology and two scientific partners with experience in data integration and data mining: Institute of Informatics of the Slovak Academy of Sciences3 and Faculty of Electrical Engineering and Informatics4, Technical University of Košice. Each partner has long time experiences in relevant domains as MicroStep-MIS develops, deploys and markets monitoring and information systems in the fields of meteorology, seismology, radiation and emission monitoring and crisis information systems; Institute of Informatics of the Slovak Academy of Sciences is the leading Grid Computing research institution in Slovakia, and has experience in (among others) parallel and distributed computing, grid computing, as well as application of these technologies in the Earth Sciences domain; and the team from the Faculty of Electrical Engineering and Informatics has practical experiences with data-mining in various domains including the mining on unstructured textual data. Additionally, we have experience with the modification of the data-mining algorithms for the distributed GRID environments.

2 State-of-the-Art The prediction of visibility-reducing fog starts with a common 3D meteorological model executed for a limited region; its outputs are converted using empirical formulae into visibility [8]. This approach by itself cannot achieve results of satisfactory quality and common meteorological models often fail to handle inversion weather conditions, which commonly produce fog. Therefore there are several experimental models in development worldwide, which further process the results of common meteorological model: 1D physical fog modeling methods, statistical post-processing of model outputs [9], [10]. The result is then interpreted by a meteorologist, who takes into account further factors – mainly his/her experience with meteorological situations and local conditions, satellite images, real-time data from meteorological stations suggesting that fog has started to form, or conditions are favorable for the occurrence of fog, conditions of the soil in the target locations, snow cover, recent fog occurrences, etc. The research group from Italian Aerospace Research Centre developed several fog classifiers based on Bayes networks [1]. The same method was used in [6] for creation of basic network structure that was further adapted to local prediction models. This approach was implemented and tested in the conditions of major Australian 2

http://www.microstep-mis.com/ http://www.ui.savba.sk/index_en.php 4 http://web.tuke.sk/fei-cit/index-a.html 3


461

airports and achieved results represent more than 55 forecasted fogs in row instead of previous operational 7-8 forecasted fog cases. The fog formation and its important parameters were identified based on collected historical dataset from the International Airport of Rio de Janeiro [2]. The Fog forecasting with association rules mining is described in [12]. This paper presents in details the whole process that starts with collection of relevant data; understanding of it; data pre-processing as e.g. feature selection and feature construction, operations with missing values, data transformation; models creation and rules generation. The identified association rules with computed confidence and support represent combination of factors that triggered the fog occurrence. All these rules are further stored in knowledge base and used for relevant expert system creation. Weather forecasting problem can be stated as the special case of time series prediction. The time series prediction uses algorithm of neural networks able to learn important characteristics from past and present information. The created model is further used for prediction of future states in investigated time series [3], [4]. This approach was used in [5] for fog prediction at the Canberra International Airport. They have created 44-years database of standard meteorological observations and used it to develop, train, and test relevant neural network and to validate obtained results. This neural network was trained to produce forecast for 3h, 6h, 12h and 18h time intervals. Results with cross-validated mean value 0.937 in 3 lead times indicate good forecasting ability of used neural network that was robust to error perturbations in the meteorological data. Y. Radhika and M. Shashi in [7] proposed an exploitation of Support Vector Machine method for weather prediction. The time series datasets of daily maximum temperature in a location were analyzed to predict the maximum of the next day. The performance of Support Vector Machine was compared with Multi Layer Perceptron using the Mean Square Error metrics. In the first case the error was in the range of 7,07 to 7,56, whereas the second one varies between 8,07 and 10,2. The delays in aircraft traffic caused by weather conditions were predicted at the Frankfurt airport within algorithms of neural networks, decision trees, linear regression and fuzzy clustering [11]. In this case value of travel time was used as target value for the algorithms listed above. The obtained results documented easier interpretation of decision trees and clustering results; as well as up to 20% higher prediction accuracy as with simple mean estimators.

3 Proposed Approach for Detection of Fog and Low Cloud Cover The performed data mining process was divided into two branches based on two specified business goals - fog forecast and low cloud cover prediction. Several operations were very similar, but each direction has its specifications and dependencies that have to be solved.

462

F. Babič et al.

3.1 Business Understanding In the first case we focused on the best prediction of fog occurrence at given locality, where any suitable historical data can be used for it. We identified two possible alternatives with different costs, i.e. more costly to predict a fog which does not appear than do not predict a fog, which suddenly appears. We have specified a binary classification: fog or no fog, as our primary data mining task. Even if our main goal was to get a prediction of the best possible quality, also the interpretation of the rules used for prediction can be interesting - the ability to comprehensively describe the processes leading to occurrence of fog. The prediction of low cloud cover is quite different; we have defined five classification categories that each represents relevant fragment of the sky covered by low cloud. The sky is divided into eight fragments, so e.g. low cloud cover index is 2/8 and it means that two sky fragments (out of 8) are covered by low cloud. 3.2 Data Understanding Data understanding phase started with the selection of data relevant for the specified problems. We have investigated several available data sources as sets of physical quantities measured automatically by meteorological stations or radars; sets of physical quantities computed by standard physical models, etc. When we had the identified data available, we performed an initial data examination in order to verify the quality of the data. These operations were extended with a calculation of the basic statistics for key attributes and their correlations. We have selected different input samples to test and evaluate suitability of relevant data mining methods for our goals. For the purposes of fog forecasting we collected historical data geographically covering the area of 10 airports in United Arab Emirates mainly located around Dubai and north coastline with time span and granularity of 10 years measured each one-hour. The quality of available meteorological data was low with high number of missing records (in average 30% of records per airport, for some airports as much as 90%). In the second case the historical dataset contained data from selected ceilometers and relevant METAR messages describing weather conditions at the Bratislava airport in Slovakia for three past years (2007 - 2010). The ceilometers are sensors routinely deployed at the airports for measurement of cloud base heights above the points of their installation. These three ceilometers determine the height of a cloud base through laser in 15s intervals. Sometimes the ceilometers data is used to determine also the cloud amount using the FAA method (approach developed by U.S. Federal Aviation Administration that uses the widely recognized Rational Coefficient to describe cloud cover), where the result is a simple combination of laser reflection counts in different height categories. 3.3 Data Preparation Data pre-processing is usually the most complex and also most time consuming phase of the whole data mining process; usually taking 60 to 70 percent of the overall time. We applied necessary pre-processing operations to obtain ready datasets for


463

implementation of selected mining methods. Meteorological data from all sources (i.e. data extracted from messages, meteorological stations and physical model predictions) were integrated into one relational database. The performed operations were very similar for both cases with little modifications based on characteristics of the data and related data mining tasks. The first step was data extraction from the meteorological messages broadcasted from the meteorological stations in METAR format. This format of the messages was fixed with standard codes denoting the parts of the messages and data values. The output of this task was structured data stored in a relational database with raw extracted data. Each record in the integrated database has assigned valid from/to time interval and 3D coordinates of measured area (i.e. ranges for longitude, latitude and altitude). Since each data source had different data precision and/or granularity, the goal of this operation was to interpolate measured values and compute additional data for the requested area and time with the specified data granularity. The same approach was used for replacement of missing values. We have selected a representative sample for next modeling phase. Reduction was necessary by reason of the technical restrictions inherent to some methods, but it can also lead to simplification of the task at hand by removing irrelevant attributes and records, thus even increasing the quality of the results. The consultations with the domain experts resulted into specified valid ranges of values and detected invalid data. Out-of-range data were considered as missing values. Also we identified the principal problem specific for fog prediction - unbalanced distribution of fog-positive and fog-negative examples; fog occurred only in 0.36% of cases. We have tried to integrate additional data source from Climatological Database System (CLDB5) but still data quality had to be improved. We identified key attributes for both cases: • •

The fog prediction e.g. actual weather at the airport as fog, drizzle, rain; cloud amount in first layer; overall cloud amount; visibility; relative humidity; etc. – parameters from METAR messages The low cloud cover prediction e.g. detection status, CAVOK (Ceiling And Visibility Okay indicates no cloud below 1 500m, visibility of 10km and no cumulonimbus at any level), detection level 1, detection level 2, detection level 3 – parameters from ceilometers and METAR too.

Additionally, data were enhanced with some derived attributes computed using empiric formulas, as a ratio of physical attributes or trend. These attributes included information about the fog situation in neighboring airports (average for 3 or 5 closest airports to the target area) and relative humidity computed empirically from temperature and dew point. Trend attributes were computed for temperature, dew point, relative humidity, difference between temperature and dew point and pressure. In the second case of low cloud cover we have integrated data from ceilometers and selected METAR attributes into one dataset for modeling phase. This aggregation was realized based on assignment of relevant METAR message (every half an hour) to each ceilometer record (every 15s). The final dataset contains 1081 columns: 120 time points x 3 ceilometers x 3 attributes for each ceilometer + 1 target attribute CTOT extracted from METAR messages. 5

http://www.microstep- mis.com/index.php?lang=en&site=src/products/meteorology/cldb

464

F. Babič et al.

In both cases we started with initial time series of five records that were further modified in several iterations based on modeling phase results. All these pre-processing operations were realized within SQL database, IBM SPSS software and own designed and implemented applications in Microsoft .Net or C#. 3.4 Modeling Modeling phase represents the core of the whole data mining process when selected data mining methods are applied on pre-processed data. Our models are simple predictors for time series, where the prediction of outputs for time t+1, …, t+K is based on the sequence of historical data (i.e. time “window”) from time …, t-2, t-1, t. Prediction of outputs is limited to future one hour, i.e. K = 1. There is a whole range of prediction methods – from statistical methods to artificial intelligence methods, like linear or logistical regression models, Support Vector Machine, neural nets, probabilistic models, decision/regression trees and lists, etc. We have tested several of these methods provided in the IBM SPSS data mining environment. Finally we selected decision trees models and neural networks. In order to obtain optimal results, all parameters of these algorithms were tuned by testing several strategies to divide input dataset into training and test set. For example, this division was important for fog prediction by reason of the unbalanced distribution for fogpositive and fog-negative cases. We have tested three types of distribution: random division with 90% examples for training and 10% for testing; the same random division with stratification and 10-fold cross validation with stratification. In the case of low cloud cover we have realized already the initial experiments with decision trees algorithms as C5.0 or CHAID based on distribution into 50% for training and 50 % for testing. 3.5 Evaluation The created models for fog forecasting were evaluated using these measures: • • •

Recall = TP / (TP + FN); False alarm = FP / (TP + FP); True skill score = recall – false alarm; TP (FP) is the number of true (false) positive TN (FN) is the number of true (false) negative examples respectively.

Table 1. Accuracy of generated models (90% training set, 10% testing set) for fog prediction Model () Decision tress Neural networks

Recall 0.77 ± 0.8 0.68 ± 0.8

False Alarm 0.44 ± 0.14 0.41 ± 0.1

True skill score 0.33 ± 0.19 0.26 ± 0.12

The results presented in Table 1 represent prediction of continuous fog occurrence, i.e. if fog = 1 in ti-1 (previous record in the timeframe) then fog = 1 in ti (target attribute). In the next step we eliminated just these records and we continued with experiments on new cleansed dataset that resulted into models with lower prediction quality, but covered more representative and required situations for the prediction at the


465

airports. We have continually experimented with specified data distribution in order to balance positive and negative records of fog occurrence. In current training dataset there were only 0.2% of positive cases of fog. We have tried to balance data using simple re-sampling or 10-fold cross validation with stratification (True skill score = 41.44 ± 0.049). The obtained results are plausible and comparable with the other existing methods, but they still need improvement in several directions, e.g. quality of input data can be improved with inclusion of data extracted from satellite images; utilization of clustering analyses for identification of representative patterns from all negative records of fog occurrence in the same quantity as positive records; or understanding of created models for domain experts. In the case of low cloud cover we have started basic experiments within decision trees algorithm (C5.0) and 10-fold cross validation for evaluation of generated models. Based on first results, we have implemented stratified 10-fold cross validation in order to balance training data for each target value. The generated models had accuracy around 80%, but we identified several inconsistencies in data characteristics, see Table 2. Table 2. Coincidence matrix for C5.0 model CTOT value (original vs. predicted) -1 0.0 1.5 3.5 6.0 8.0 9.0

-1 2016 1 4 11 47 4 0

0.0 77 14 0 2 5 0 0

1.5 128 0 8 8 5 0 0

3.5 128 1 1 258 51 1 0

6.0 186 5 1 29 768 47 0

8.0 14 2 0 4 70 156 0

9.0 0 0 0 0 0 0 1

These results are plausible too, but you can see the problem with relatively high number of false classified records, mainly in category -1. This fact can be caused by unbalance distribution of target attribute; high number of missing values in target attribute that are labeled with -1 value; low number of records for category labeled with 9 – other meteorological phenomenon, no low cloud. Based on these findings, we have realized several other experiments when we eliminated records without target value or we joined three target categories -1, 0.0 and 9.0 into one category. The main problem was relatively high number of missing values in the target attribute CTOT that will be solved as one part of our future work. 3.6 Deployment The main aim of this last phase is to design and develop a practical deployment plan for the best generated models. This plan covers the strategy for implementation, monitoring and maintenance in order to provide effective application. In our case, the generated models with detailed description will be used as integrated part of Airport Weather System developed in MicroStep-MIS company specialized in design, development and manufacturing of various monitoring and information systems. The whole data mining process will be evaluated in order to identify “best practices” for future projects of similar character.

466

F. Babič et al.

4 Conclusion The prediction of various meteorological phenomena represents important factor for many human activities. We have predicted fog occurrence at several airports in United Arab Emirates and low cloud cover at the Slovak national airport in Bratislava. In both cases we have collected necessary historical data, mainly from METAR messages and ceilometers. All data were preprocessed and verified based on selected data mining methods: decision trees and neural networks. We used CRISP-DM methodology for the realization of the whole data mining process as one of the most used methodology in this domain. This approach contains six main phases with relevant operations, conditions and opportunities; see section 3 for detailed information. We have implemented the whole chain of data preprocessing operations which extracts and integrates data from various meteorological sources as our own application that will be presented as one of the core outputs of Data Mining Meteo project. The description of used methods and their parameters can be used in the future as helping or inspiring materials for someone that will perform similar experiments and based on available information it will be possible to prevent the same inappropriate steps to save the money, time and energy. According to preliminary results presented in section 3.5, our models can be compared with the other existing methods based on the global physical model and empirical rules. The future plans contain several interesting and perspective tasks mainly oriented in data processing domain as collection and integration of additional data sources (e.g. satellite images); experimental evaluation of various balance strategies (e.g. clustering analyses resulted into representative negative patterns) for positive and negative fog records; descriptive data mining; and the additional experiments with additional algorithms for low cloud cover. Acknowledgments. The work presented in the paper was supported by the Slovak Research and Development Agency under the contract No. VMSP-P-0048-09 (40%); the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/0042/10 (30%) and project implementation: Development of the Center of Information and Communication Technologies for Knowledge Systems (ITMS project code: 26220120030) (30%) supported by the Research & Development Operational Program funded by the ERDF.

References 1. Zazzaro, G., Pisano, F.M., Mercogliano, P.: Data Mining to Classify Fog Events by Applying Cost-Sensitive Classifier. In: International Conference on Complex, Intelligent and Software Intensive Systems 2010, pp. 1093–1098 (2010) ISBN 978-1-4244-5917-9 2. Ebecken, F.F.: Fog Formation Prediction In Coastal Regions Using Data Mining Techniques. In: International Conference On Environmental Coastal Regions, Cancun, Mexico, vol (2), pp. 165–174 (1998) ISBN 1-85312-527-X 3. Acosta, G., Tosini, M.: A Firmware Digital Neural Network for Climate Prediction Applications. In: Proceedings of IEEE International Symposium on Intelligent Control 2001, Mexico City, Mexico (2001) ISBN 0-7803-6722-7


467

4. Koskela, T., Lehtokangas, M., Saarinen, J., Kaski, K.: Time Series Prediction With Multilayer Perceptron, FIR and Elman Neural Networks. In: Proceedings of the World Congress on Neural Networks, pp. 491–496. INNS Press, San Diego (1996) 5. Fabbian, D., de Dear, R., Lellyett, S.: Application of Artificial Neural Network Forecasts to Predict Fog at Canberra International Airport. Weather and Forecasting 22(2), 372–381 (2007) 6. Weymouth, G.T., et al.: Dealing with uncertainty in fog forecasting for major airports in Australia. In: 4thConference on Fog, Fog Collection and Dew, La Serena, Chile, pp. 73–76 (2007) 7. Radhika, Y., Shashi, M.: Atmospheric Temperature Prediction Using SVM. International Journal of Computer Theory and Engineering 1(1), 1793–8201 (2009) 8. Gultepe, I., Müller, M.D., Boybeyi, Z.: A new visibility parameterization for warm fog applications in numerical weather prediction models. J. Appl. Meteor. 45, 1469–1480 (2006) 9. Bott, A., Trautmann, T.: PAFOG - a new efficient forecast model of radiation fog and lowlevel stratiform clouds. Atmos. Research 64, 191–203 (2002) 10. COST 722 - Short range forecasting methods of fog, visibility and low clouds. Final Report, COST Office, Brussels, Belgium (2007) 11. Rehm, F.: Prediction of Aircraft Delay at Frankfurt Airport as a Function of Weather. Presentation from German Aerospace Center, Germany (2004) 12. Viademonte, S., Burstein, F., Dahni, R., Williams, S.: Discovering Knowledge from Meteorological Databases: A Meteorological Aviation Forecast Study. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 61–70. Springer, Heidelberg (2001) 13. Hluchý, L., Habala, O., Tran, D.V., Ciglan, M.: Hydro-meteorological scenarios using advanced data mining and integration. In: The Sixth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 7, pp. 260–264. IEEE Computer Society, Los Alamitos (2009) ISBN 978-0-7695-3735-1 14. Clifton, C.: Encyclopedia Britannica: Definition of Data Mining (2010), http://www.britannica.com