Performance Comparison of Learning Techniques for

Performance Comparison of Learning Techniques for Intelligent Channel Assignment in Cognitive Wireless Sensor Networks Chayaphon Tanwongvarl

Soamsiri Chantaraskul

The Sirindhorn International Thai-German Graduate School of Engineering (TGGS) King Mongkut’s University of Technology North Bangkok 1518 Pibulsongkram Road, Bangsue, Bangkok, Thailand e-mail: [email protected]

The Sirindhorn International Thai-German Graduate School of Engineering (TGGS) King Mongkut’s University of Technology North Bangkok 1518 Pibulsongkram Road, Bangsue, Bangkok, Thailand e-mail: [email protected]

Abstract— With the increasing number of devices sharing the 2.4 GHz ISM band, coexistence problem becomes one of the major issues experienced by Wireless Sensor Networks (WSN). Cognitive Wireless Sensor Networks (CWSNs) has been proposed in order to achieve reliable and efficient communication via spectrum awareness and intelligent adaption. The learning and decision making technique is one of the core competences of such system. In this work, there machine-learning techniques under the umbrella of Reinforcement Learning (RL) including GPOMDP, Episodic-Reinforcement, and True Policy Gradient are implemented for our proposed learning and decision engine of CWSN. Simulation model has been developed and used for the investigation and the results are obtained for performance comparison in terms of prediction accuracy and WSN system performance. From this study, True Policy Gradient offers better prediction accuracy in comparison with the other two techniques. As results, CWSN implementing True Policy Gradient offers lowest packet delay under interference environment. Keywords— assignment, Gradient

cognitive wireless sensor networks, channel GPOMDP, Episodic-Reinforcement, True Policy

I.

INTRODUCTION

Cognitive Wireless Sensor Networks (CWSNs) have been proposed recently in order to enhance the system performance of Wireless Sensor Networks (WSNs) by providing more efficient spectrum usage. This is due to the fact that currently, there is a dramatic increasing of demand for communication devices operating on the 2.4 GHz ISM band such as WiFi access points, wireless sensors, and Bluetooth devices. Additionally, the usage environment of such devices is likely to have many transceivers coexisting in the same area. This leads to unavoidable and higher inference among them. As a result, the coexistence problem has become one of the major problems within the 2.4 GHz ISM band. General conclusion has been made as a result of studies in [1] and [2] that IEEE 802.15.4 devices, which are part of WSN, seem to have severe impact from other systems especially from the IEEE 802.11 devices operating on an overlapping frequency band in the same area. Designed for Low-Rate Wireless Personal Area Networks (LRWPANs), wireless sensor devices based on IEEE 802.15.4

standard are limited to low transmit power as they are usually battery-powered devices. The IEEE 802.15.4 specifies up to 16 channels, however with the currently standardized MAC (Multiple Access Control) protocol the entire network once established will usually remain on the same channel. As a result, other devices start the transmission on the overlapping frequency that will interfere with this WSN [3]. Cognitive Radio (CR) concept offers the wireless system performance enhancement by using spectrum awareness with intelligent decision making. Based on such concept, CWSN has been proposed with the aim to enhance awareness about the network and environment, and make adaptive decisions based on the application objectives. This work employs such concept of CR. Three Machine-learning techniques under the umbrella of Reinforcement Learning (RL) including GPOMDP [4], Episodic-Reinforcement [5], and True Policy Gradient [5] are employed in our proposed learning and decision engine and being investigated. The performance comparison in terms of prediction accuracy and WSN system performance will be given in this paper. The rest of this paper is organized as follow. In Section two, the structure of our proposed intelligent channel assignment for CWSN is given. Section three provides overview of the three proposed learning techniques. In section four, the simulation model developed in this work for the performance evaluation and test scenario are illustrated. Section five shows the simulation results and discussion. Finally, the paper is concluded in section six. II.

INTELLIGENT CHANNEL ASSIGNMENT

The operational flow of intelligent channel assignment developed here, which includes learning and decision making engine, is shown in Figure 1. The Figure illustrates the interaction between learning and decision making process (Agent) and the channels environment (State) for reinforcement learning, which occurs at the PAN coordination (client computer). The concept of basic learning and decision making is similar to the situation of driving on a multi-lane highway under built-up traffic condition. When other car is moving into the lane that we are currently using, this leads to

increasing congestion. Able to learn and predict what would happen, we will as a result try to avoid the foreseen situation in order to prevent accident as well as traffic delay. This is equivalent to our radio making decision and changing into the channel that is free or has less noise in order to avoid a communication collision with others. Shown in Figure 1, process of the proposed system starts when sensor nodes send channel observation information to the coordinator. Coordinator then maps information from all sensor nodes into a single grid called here channel observation map, which concludes current state of the channels. In the next step, learning and decision making engine will provide action according to the input information from the channel observation map. Action together with channel observation map will then be input to the classification process, which then offers positive or negative response. This response (system performance measurement) is treated as a reward for the action chosen and is sent back to the learning and decision making engine for optimization.

Figure 1. Intelligent channel assignment flow for wireless sensor networks

A. Channels Mapping The channels mapping process occurs at the network coordinator and the process can be seen in Figure 2. Coordinator will maps channel observation information from all sensor nodes and quantize into eight levels for each of the 16 channels.

Figure 2. Channel mapping process

B. Learning and Decision Making in Channel Assignment Table 1 illustrates the internal process for learning and decision making engine with the Policy Gradient being utilized. TABLE I. CLIENT COMPUTER USING POLICY GRADIENT INPUT: N =NUMBER OF STATES(NUMBER OF CHANNEL) , M =NUMBER OF ACTIONS(LIFT-DOWN,RIGHT-DOWN), = DISCOUNTED REWARD

1. While System running do 2. Get the observation from a Coordinator sensor node 3. Create automatically transition probability table of The Sixteen state problem base on actions 4. Automatically create reward table base on channel level selection 5. Calculate expected reward map from , , ,----6. Obtaining Policy Gradient 7. Learning process for optimization policy end 1. Set initial channel 2. Get action form optimize policy 3. Get appropriate channel 4. Send control message to a coordinator changing the new frequency

In general, there are two types of decision making process including Deterministic approach and Stochastic approach. Deterministic approach offers predictable result, while Stochastic approach is suitable for non-static environment, which is not able to accurately predict how the result will come out. However, Stochastic approach can explain probability of such result using unlimited processing in order to relate distributed parameters together. Based on Reinforcement Learning, most of the problems are discrete type of problems, which are a type of stochastic such as the spectrum usage of ISM frequency band. Table 1 explains the process for computing solution, in this case suitable channel for WSN to be utilized. When the channel environment is observed, step 3 and 4 determine system environment limitation and possible criteria, which will be used in step 5 for probability estimation. Step 5 calculates all possible solutions based on Markov process, which analyses parameters’ behavior for future prediction by adapting the parameters , , ,--- (policies), which are non-stationary. End result of this step is the plot of discounted distribution of the Markov process (Expected reward in this work) base on the specified problem and policy. Note that State (x) is current channel usage status (16 possible states). Action (a) is the choice of changing to channel usage that has less interference. Based on channel observations mapping, channel changing direction can either be to the left or to the right hand side depending on the next state. Reward (r) is the result of changing channel and the value depends on whether new channel has less or more interference than the current one. Step 6 and 7 will find result parameters based on the formulas shown in the next section. After that, the learning process will obtain appropriate policy, which will best fit the environment. The system will then set the initial channel, as the current channel that the sensor network is using, and then calculate next appropriate channel based in the obtained policy.

III.

OVERVIEW OF THE PROPOSED LEARNING TECHNIQUES

The cognitive radio (CR) has been used to refer to radio devices that are capable of learning and adapting to their environment [6]. However, only in recent years there has been a growing interest in applying machine learning to Cognitive Radio Networks (CRNs) [7]. Reinforcement learning is probably the most general framework suitable for such learning problem of channel assignment. Nevertheless, most of the methods proposed in the reinforcement learning community are not applicable more than three or four degrees of freedom and/or cannot deal with parameterized policies [5]. On the other hand, Policy Gradient methods are applicable to high-dimensional systems such as channel assignment. A. General Assumption [8] assumes the probability distribution as shown in Equation (1). It characterizes the probability of reaching a new state under the assumption of a specific action in a state and current time step by ~

|

,

(1)

B. Gradient estimation Refer to [8], the gradient estimation equation is )

g ≈ & '5( ∇ log *+

,

)

9 5( r ;*+

,

)

'5( ∇ log *+

,

)

9 5( r= −? 9, 8 ;*+

* ? = Estimate the optimal baseline Finally, GPOMDP gradient estimation is )

;

'( 5( ∇ log ;*+

*+

,

9 r= −?; , 9

(2)

The index denotes the parameters of the policy. In addition, the value function and state-action function are given by the Bellman equations: = "

,

+

∑

!

,

(3)

=

∑

!

| ,

(4)

Where denotes the discount factor for problems with unlimited horizon. The goal of policy gradient methods is to optimize the expectation of the cumulated reward function #

1

=

∑

)

& '(

Figure 3. Map of True Policy Gradient

, 5

*+

The term ∑ denotes a normalization factor to ensure that . ∑)*+ / = 1. Note that we assume a value of = 1 for all .∑

further equations and experiments of this work. An alternative representation of Equation 5 computes this expectation as the integral over the state distribution and the integral over all actions of the policy by #

=0 1

02

,

,

1 1

(6)

The next sections use this equation to derive the policy gradient for the parameter .

9, 7

Next, we can deviate episodic reinforce of the gradient estimation as

Furthermore, based on policy gradient methods, all actions of the agent are generated by a probabilistic policy ~

;, ;

Figure 4.

Map of Episodic Reinforce

TABLE II. Device Parameters Application : Packet Rate

IEEE 802.15.4

Interference

1-32 (packet/sec)

19 (packet/sec)

MAC : Data Rate

250 kbps

1024 kbps

Radio : Output Power

-25 dBm

-10 dBm

Carrier Frequency

24102460 MHz

24102460 MHz

V. Figure 5. Map of GPOMDP

Figure 3, 4 and 5 illustrate the expected reward map of True Policy Gradient, Episodic Reinforce, GPOMDP, respectively. The background plot represents the rewards map and the foreground plot represents the map of Policy Gradient estimation. The blue plot at the background shows the calculated reward map of each learning approach. Each type of estimation offers suitable value, the initializing value represented by red dots and the appropriate policy that is suitable estimation represented with green crosses. Next section, the resulting accuracy will be observed from the WSN test simulation environment. IV.

SIMULATION PARAMETERS

SIMULATION RESULTS

Figure 7 shows the simulation result in term of estimation accuracy in comparison for the three proposed approaches. The accuracy is measured by evaluating the chance that the estimator offers inaccurate or impossible solution. It can be seen that True Policy Gradient offers the highest estimation accuracy while GPOMDP provides the lowest value.

SIMULATION MODEL

To observe the performance of our proposed learning approaches for CWSN, simulation model is developed. The simulation model for WSNs employed here is based on the open-source Castalia simulator [9], which has the basis of the OMNeT++ platform [10]. The main features of Castalia are the channel model based on empirically measured data and the advanced radio model based on real low-power radios. The WSN test scenario used here is given in Figure 6 and the simulation parameters used for the simulation are given in Table 1.

Figure 7. Accuracy Compairison

In Figure 8, the simulation result showing average packet end-to-end delay is given. The simulation is carried out with the node generated traffic of 7.6 kbps. The result for the system without implementation of intelligent channel assignment is also provided as a reference. It can be seen that without intelligent control, system offers the highest packet delay. From the result of this study, it can be concluded that True Policy Gradient offers the lowest delay as a result of high estimation accuracy seen earlier in Figure 7.

Figure 6. Simulation sceanrio

implementing True Policy Gradient offers lowest average packet end-to-end delay under interference environment. ACKNOWLEDGMENT This work was supported by a scholarship TOT Public Company Limited of Thailand. REFERENCES [1]

Figure 8. Average packet end-to-end delay performance comparison

VI.

CONCLUSIONS

This work investigates the performance of CWSN under three proposed machine-learning techniques under the umbrella of Reinforcement Learning (RL) including GPOMDP, Episodic-Reinforcement, and True Policy Gradient. Performance comparison in terms of prediction accuracy and WSN system performance are provided. It can be concluded from this study that True Policy Gradient offers better prediction accuracy in comparison with the other two techniques. Illustrated also by the simulation result, WSN

D. Yang; Y. Xu, and M. Gidlund. “Coexistence of IEEE802.15.4 based networks: A survey,” in Proc. IECON 2010, Nov. 2010, p. 2107 – 2113. [2] A. Sikora, V.F. Groza, “Coexistence of IEEE802.15.4 with other Systems in the 2.4 GHz-ISM-Band,” in Proceedings of the IEEE Instrumentation and Measurement Technology Conference, pp. 17861791, May 2005. [3] C. Liang, “Interference characterization and mitigation in large-scale wireless sensor networks,” doctoral dissertations, Johns Hopkins University, Baltimore, Maryland, 2011. [4] J. Baxter and P. L. Bartlett, “Reinforcement Learning in POMDP’s via Direct Gradient Ascent,” in In Proc. 17th International Conf. on Machine Learning, 2000, pp. 41–48. [5] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural Netw, vol. 21, no. 4, pp. 682–697, May 2008. [6] J. Mitola and J., G.Q. Maguire, “Cognitive radio: making software radios more personal,” IEEE Personal Communications, vol. 6, no. 4, pp. 13–18, Aug. 1999. [7] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A Survey on MachineLearning Techniques in Cognitive Radios,” IEEE Communications Surveys Tutorials, vol. 15, no. 3, pp. 1136–1159, Third 2013 [8] A. Witsch, “Applying Policy Gradient Reinforcement Learning to Optimise Robot Behaviours,” 2010. [9] https://castalia.forge.nicta.com.au/index.php/en/, Mar 2015. [10] N. LTE, O. released, I. released, I. released and O. released, 'OMNeT++ Discrete Event Simulator - Home', Omnetpp.org, 2014. [Online]. Available: http://www.omnetpp.org. [Accessed: 25- Feb- 2015].