Massively Parallel Neural Signal Processing on a ... - Semantic Scholar

1 downloads 0 Views 2MB Size Report
Work by J. Adam Wilson and Justin C. Williams is closely related to ours.3 To process massive electrocortico- graphic (ECoG) signals, Wilson and Williams used ...
l ea t i c s N e u rPoeitnafsocram Computing

Massively Parallel Neural Signal Processing on a Many-Core Platform Although the ensemble empirical mode decomposition (EEMD) method and Hilbert-Huang transform (HHT) offer an unrivaled opportunity to understand neural signals, the EEMD algorithm’s complexity and neural signals’ massive size have hampered EEMD application. However, a new approach using a many-core platform has proven both efficient and effective for massively parallel neural signal processing.

T

he past decade has brought remarkable advances in the processing of neural signals—that is, neural activity signatures1 such as electro­ encephalogram (EEG), magnetoencephalography (MEG), electrocorticography (ECoG), magnetic resonance imaging (MRI), and functional MRI (fMRI). Neural signal analysis is vital in detecting, diagnosing, and treating brain disorders and related diseases.2 Neural signals are naturally nonlinear and nonstationary. Prior to the introduction of nonlinear methods, linear algorithms—such as shortFourier transform, Wigner-Ville distribution, and wavelet filtering—had been extensively used in spectral analysis of EEG recordings.3 A major shortfall of linear approaches is that neural signals’ temporal patterns, such as instantaneous

amplitude and phase or frequency, can’t be accurately estimated.4 The recent advent of ensemble empirical mode decomposition (EEMD),2 in conjunction with Hilbert-Huang transform (HHT),5 has revolutionized the study of neural signals, giving neuroscientists an unrivaled opportunity to understand the true physical and neurophysiological meanings of neural signals. However, the EEMD algorithm’s complexity and neural signals’ routinely massive size hamper EEMD application in neural science research and practice. Here, we present an approach to this problem that uses a many-core platform (a GPU for home entertainment) that enables massively parallel neural signal processing. By analyzing a multichannel EEG of absence seizures, we demonstrate our approach’s efficiency and effectiveness in identifying the seizure state.

Why EEMD? 1521-9615/11/$26.00 © 2011 IEEE Copublished by the IEEE CS and the AIP

Dan Chen China University of Geosciences

Lizhe Wang Indiana University

Gaoxiang Ouyang and Xiaoli Li Yanshan University, Qinhuangdao, China

42

This article has been peer-reviewed.

EEMD breaks down a complicated non­linear and nonstationary signal without a basis function— such as sine or wavelet functions—into a collection of oscillatory intrinsic mode functions (IMFs) that are embedded in the signal and reflect the signal’s true physical characteristics. To do this, EEMD adds white noise to the signal under investigation to form an ensemble of noise-added signals and takes full advantage of white noise’s statistical Computing in Science & Engineering

characteristics to perturb the signal in its true solution neighborhood, and to cancel itself out after serving its purpose. EEMD eliminates mode mixing in all cases automatically and excels in resistance to noise, which makes EEMD far superior to its linear counterparts in processing neural signals. On the other hand, as we describe later, the EEMD algorithm demands repetitively processing many trials of the noise-added signal in an ensemble, and the outputs’ precision depends on the number of trials, which should be large enough to neutralize the effect of added noises. Furthermore, as experimental techniques for recording neural activities have been advancing quickly—rapidly increasing the number of channels (electrodes) and sampling frequencies—neural signals’ density and the spatial scale have been increasing exponentially. Neural signal analysis with EEMD manifests a problem that’s not only data intensive, but also highly compute intensive. For example, it typically takes a few weeks to process one hour’s EEG data using a high-end desktop computer. Neural signals have high temporal and spatial parallelism, while the EEMD algorithm has builtin parallelism. As researchers have noted, parallelism is the future of computing.6 The key to effectively improving a compute- and data-intensive application’s performance lies in appropriately using the potential parallelism in various forms embedded in the target application. It’s a natural choice to develop a parallel EEMD on a highperformance computing (HPC) architecture7 to fulfill the performance requirement by using the multidimensional parallelism in an application of neural signal analysis. For more information on this, see the sidebar, “Related Work in HPC Research and GPGPU Computing.” Many-core computing platforms—GPUs— provide massively parallel computing platforms.8 We developed parallel EEMD approaches that successfully adopt general-purpose computing on the GPU (GPGPU), a state-of-the-art technique empowered by a many-core computing platform. Our approaches properly parallelize the EEMD application into numerous concurrent fine-grained subtasks, each of which handles a small piece of neural signal (time series). The proposed approaches have been developed using Nvidia’s CUDA, a generalpurpose parallel computing architecture. The intensive computation involved in the whole problem space is then mapped onto numerous threads evenly executed by the many parallel GPU cores. Here, we present a case study using EEMD and HHT to analyze an EEG dataset generated by an epilepsy patient with absence seizures. We first November/December 2011 

decomposed the EEG dataset using EEMD into a number of IMFs. We then calculated the HilbertHuang Spectral Entropy (HHSE). These efforts aimed to explore the feasibility of distinguishing seizure states based on the theory of the HHT and EEMD approach. Experimental results show that HHSE is highly suited for detecting seizure states. After that, we carried out a series of performance studies on the GPGPU-enabled parallel approaches. For a one-hour EEG segment, the parallel approach can complete analysis in less than an hour; the original serial counterpart requires hundreds of hours to handle the same task.

Neural Signal Structure

The neural signals we investigated are a massive EEG dataset continuously captured—using multiple simultaneous electrodes—from an epilepsy patient over several hours. As a result, the obtained EEG dataset persist as a multichannel time series. The volume of the EEG dataset prior to

For a one-hour EEG segment, the parallel approach can complete analysis in less than an hour; the original serial counterpart requires hundreds of hours to handle the same task. analysis scales in three dimensions: the time (T ), the number of channels (NC), and the sampling rate (SR). The total number of variables in the multichannel time series is T × NC × SR. When processing the source EEG dataset with EEMD, we apply a sliding time window to each individual EEG time series, corresponding to one channel, to generate numerous short time series (epochs) covered by the windows. There’s an overlap of a fixed length between two consecutive epochs. We write the ratio of the overlap’s length against an epoch’s length as ro. Clearly, the size of data in the form of epochs “swells” to 1/(1 − ro) of the source EEG dataset. Obviously, the number of epochs that might be extracted from a source time series is proportional to T. In the initial stage of an EEMD method, each epoch will be repeatedly added with white noise to generate an “ensemble” of trials, where each trial represents an instance of the original epoch mixed with noise. A trial is the elementary data chunk computed by the EEMD algorithm. Typically, the number of trials (NT) in an ensemble should be in the hundreds. After initialization, 43

Related Work in HPC Research and GPGPU Computing

T

here are several interesting neuroscience research projects supported by high-performance computing; we describe some here and offer an overview of massively parallel computing with general-purpose GPUs.

Examples: HPC-Supported Neuroscience Research The application of cutting-edge HPC techniques to address compute- and data-intensive neuroscience problems is still in its infancy. Existing work falls into two main categories. The first category is the management and storage of massive neural data. The effective management of complex and massive experimental data remains a grand challenge.1 Researchers have attempted to improve the management of massive neuroimaging data by integrating database management systems (DBMS) with Grid computing and cluster-based computing.2 The approach centers on DBMS, which both facilitates storing and sharing functional magnetic resonance imaging (fMRI) data and eases its analysis. The second category is the analysis of massive neural data. Work by J. Adam Wilson and Justin C. Williams is closely related to ours.3 To process massive electrocorticographic (ECoG) signals, Wilson and Williams used GPGPU to parallelize the processing algorithms. Although they’ve obtained encouraging results, their approach focuses on accelerating matrix–matrix multiplication and the autoregressive method. The two target methods are generally used in research areas and disciplines that aren’t specific to neural signal analysis. In this study, we aim to facilitate an efficient solution for core issues in neural signal analysis—first, to extract the signal’s true physical characteristics, and second, to monitor the signal’s evolving dynamics.

Massively Parallel Computing Using GPGPU A modern GPU possesses single-instruction multiple-data (SIMD) cores, high on-chip bandwidth, and explicit data transfers between fast local memories and external dynamics RAM. All of these establish GPU as the go-to solution when coping with data and compute-intensive problems. GPGPU has recently boomed with the enhanced programmability of GPUs. Relatively high abstraction levels have emerged, and Nvidia’s Compute Unified Device Architecture (CUDA) is salient. The current CUDA fosters a software environment that allows programming on a GPU with a slightly extended version of C, so a CUDA-based application can execute on a GPU’s many cores in a massively parallel fashion. A CUDA-based application defines C functions as “kernels” to be explicitly executed on the GPU, or device. Usually, the kernels are invoked from the CPU (“host”) application’s main executable, which will be executed for N times via N individual CUDA threads. CUDA threads are grouped into blocks, and threads in a block can cooperate together by efficiently sharing data through a fast shared memory and by synchronizing their execution to coordinate memory access. Threads, memories, and synchronization are exposed to developers for fine-grained data and thread parallelism—such as dealing with a single floattype variable per thread. A developer should partition the problem into a set of fine-grained subtasks, which can be mapped to CUDA threads accordingly. References 1. J.L. Teeters et al., “Data Sharing For Computational Neuroscience,” Neuroinformatics, vol. 6, no. 1, 2008, pp. 47–55. 2. U. Hasson et al., “Small Sl: Improving the Analysis, Storage and Sharing Of Neuroimaging Data Using Relational Databases and Distributed Computing,” Neuroimage, vol. 39, no. 2, 2008, pp. 693–706. 3. J.A. Wilson and J.C. Williams, “Massively Parallel Signal Processing Using the Graphics Processing Unit for Real-Time Brain-Computer Interface Feature Extraction,” Front Neuroengineering, vol. 2, no. 11, 2009; www.ncbi.nlm.nih.gov/pubmed/19636394.

and immediately before being fed to the EEMD algorithm, the data’s size in the form of trials (see Figure 1a) explodes by hundreds of times over that of the source EEG dataset—that is, in terabytes. As Figure 1b shows, the formation of the initialized EEG dataset can be viewed as a three-level hierarchy. Each EEG data channel (time series) can be singled out from the whole multichannel dataset at the top level. At the middle level, many epochs can be generated from an EEG time series by applying a sliding time window. At the bottom level, an epoch spawns numerous trials by adding white noise to itself. 44

The order of processing different trials has no influence on the final results as long as trials rooted in the same source epoch are properly bundled. Temporal parallelism exists along the X dimension (time T ) and spatial parallelism exists along Y (data channels, NC ) and Z (trials, NT ) dimensions in the initialized dataset in the form of trials. As Figure 1b shows, from the top to the bottom levels, the grain of parallelism becomes exponentially finer. A fixed number of data channels focus the extremely fine-grained parallelism in the other two dimensions. Computing in Science & Engineering

Hilbert-Huang Transform and EEMD

As Figure 2 shows, for a nonstationary signal x(t), the EEMD method can decompose the signal into a series of intrinsic mode functions (IMFs), imf i (i = 1, 2, …, N ), where N is the number of IMFs.5 The signal x(t) can be written 

(1)

Applying the Hilbert transform and the concept of Shannon entropy, we can construct the HilbertHuang spectral entropy (HHSE), which denotes a more accurate and nearly continuous distribution of the signal’s energy. IMFs and HHSE precisely characterize the true physical nature of signal x(t). More in-depth discussions on EEMD and HHT are available elsewhere.2,5 Given the size of a time window and the overlap of two consecutive windows, the whole procedure for processing an epoch of signal in a time window, x(t), occurs in several steps. We first need to decompose a signal to obtain its true IMFs prior to calculating its HHSE; we do this in eight steps: • Step 1. Calculate the number of IMFs in x(t): NI = log2(length(x(t)) − 1; set the amplitude of white noise to be added to x(t); set the number of trials in the ensemble: NT; and set k = 1. • Step 2. In processing the kth trial, set xk(t) = x(t) + nk(t), nk(t) denotes the white noise; set i = 1; and initialize the residual signal ri(t) = xk(t). • Step 3. After initiating the extraction of the ith (in the kth trial) IMF, set j = 1 and hj−1(t) = ri(t); and calculate the local minima hmax(t) and maxima hmin(t) of hj−1(t). • Step 4. Interpolate hmax(t) and hmin(t) using cubic spline interpolation lines, thus to extract the upper and lower envelopes of hj−1(t); eventually mj−1(t)— the mean of the upper and lower envelopes—can be calculated as hj(t) = hj−1(t) − mj−1(t). • Step 5. Check if hj(t) is an IMF; when the termination requirement is satisfied, Step 3 will be returned with j = j + 1. Otherwise, the ith IMF in the kth trial is obtained, imfi[k](t) = hj(t); then set ri+1(t) = ri(t) − imfi[k](t) as the new residual signal for sifting the (i + 1)th possible IMF. • Step 6. If ri+1(t) still has at least two extrema, then Step 3 will be returned with i = i + 1; Other­wise, the decomposition procedure on the kth trial stops. Thus the noise-added signal xk(t) can be decomposed as x k (t ) =

NI −1

∑ imf [ k ](t )i + rNI (t ), i =1

November/December 2011 

Level 3: Trials

Level 0: Multi-channel EEG dataset

... Level 2: Epochs

Level 1: Single channel EEG dataset (a)

Channels ...

i =1

Trial ......

...

Tim Time

...

n −1

x ( t ) = ∑ imf ( t )i + rn ( t ).

......

(b)

Trial

Trails in an ensemble

Figure 1. The multidimension electroencephalogram (EEG) dataset. (a) View of the EEG dataset at different levels, from data channel to trial. (b) The formation of the dataset in three dimensions: time, trials in an ensemble, and data channels.

where rNI(t) is the residual or trend of the xk (t); • Step 7. If k isn’t greater than NTE, Step 2 will be returned with k = k + 1; Otherwise, the preset trials have been completed; the ith IMF of (x)t is the average of the results in all the trials,

imf i (t ) =

  NT  imf [k ](t )  i ∑  k =1

NT

,

where i = 1, 2, …, NT. • Step 8. Based on imfi(t), i = 1, 2, …, NT, we can eventually obtain the HHSE of x(t); Step 1 will then be returned to process the next epoch, if any. We’ve measured the overhead of the serial implementation. The instructions for implementing the EEMD algorithm (Steps 2 through 6, known as an EEMD procedure) can consume approximately 99.8 percent of the overall execution time. Hence, we trust the bottleneck is with these EEMD procedures and focus on them here, tackling the bottleneck by parallelizing EEMD procedures. 45

An Epoch of the original EEG time series

uVs

2,000

(a)

0 –2,000

0

–2,000

0

200

uVs uVs uVs uVs

1,000

400

600

800

1,000

800

1,000

0 –2,000

0

200

400

600

Third IMF

Multilevel Parallelism

0 –2,000

0

200

400

600

800

1,000

200 400 600 800 Time (ms) scale in a window

1,000

Fourth IMF

2,000 0 –2,000 (e)

800

Second IMF

2,000

(d)

600

0

2,000

(c)

400

First intrinsic mode functions (IMF)

2,000

(b)

200

0

Figure 2. Decomposition of an EEG epoch into averaged intrinsic mode functions (IMFs). (a) An epoch of the original EEG time series. (b) The first, (c) second, (d) third, and (e) fourth IMFs.

Parallelism: Calculating HHSE on EEMD

Calculating HHSE on EEMD manifests a perfect paradigm of single instruction, multiple data (SIMD). The Parallelism Concept

The performance improvement of parallel applications is determined by parallelism in various forms with the problem, the software environment, and the hardware platform (see Figure 3). We use the term degree of parallelism (DoP) to quantify the parallelisms with several fronts. A problem’s DoP means the number of portions of it that can be concurrently solved and executed with the same results as those obtained in a serial manner. A problem’s parallelism might be spatial, temporal, or organizational, and might concern data or computations. A software environment’s DoP denotes the number of concurrent tasks that the environment supports at the development stage (for example, you might set less than 500 threads in a thread pool over Windows Vista; see http://msdn. microsoft.com/en-us/magazine/cc163327.aspx). 46

A hardware platform’s DoP equals the number of processing units, which operate in parallel in a collaborative manner (for example, the number of CPUs/cores in a parallel computer, the number of processing nodes in a distributed system, and the number of cores in a GPU). The grain of a problem determines the type of parallel/distributed systems suitable for solving the problem. For example, the Internetbased Grid computing platform typically handles coarse-grained problems, while high-end computer clusters are more suited to fine-grained ones. Multicore computers and many-core platforms (such as GPUs) often apply to the problems of extremely fine grains. The EEMD application’s parallelism can be characterized in at least two levels (see Figures 4a and 4b). At the epoch level, the EEMD procedure for an epoch of time series is treated as a whole. The data in an epoch are input to the same EEMD procedure individually, and the outputs from any EEMD procedure instance won’t be consumed by another. The DoP equals the number of epochs, which increases linearly with the EEG dataset’s size. At the trial level, a trial (a noise-added epoch) is treated as a whole. Given a number of trials per EEMD instance, each trial’s decomposition is always performed independently from the others. An original epoch’s IMFs (see Figure 2) are inferred only after computing the ensemble of trials. The DoP equals the number of trials per ensemble × the number of epochs. A task at the trial level handles only a short time series (such as approximately 1,000 points). The instructions in steps 2 through 7 remain the same when processing any trials. Even at the trial level, we might be able to further divide the computation into finer subtasks to exhaustively utilize a many-core platform, such as a GPU, with extremely fine-grained parallelism.

Massively Parallel EEMD on GPGPU

We developed two alternatives of massively parallel EEMD on top of CUDA by adapting to the trial-level parallelism. GPGPU-Aided EEMD

We first considered using SG-EEMD, a single GPU to support a parallel EEMD. Figure 5a shows the execution task graph as a directed acyclic graph (DAG). The job of processing the Computing in Science & Engineering

original signal (or any portions of the signal) can be mapped to a number (denoted by NW in the figure) of concurrent epoch-level tasks. Moreover, each epoch-level task can be further decomposed and mapped to a number (denoted by NT in the figure) of trial-level subtasks, which are executed by an ensemble of CUDA threads. Figure 5b shows how an individual epoch-level task can be handled by SG-EEMD. Given that an epoch (Epoch[i]) consists of 1,024 data points, we produced a scheme that maps this task to CUDA threads in the following steps: • Step 1. After initialization, the host invokes 1,024 CUDA threads evenly grouped as 8 blocks to calculate the standard deviation (STD) of the 1,024 points of the original epoch. The EEMD procedure then makes NT (no less than 100) duplicates of Epoch[i]/STD and adds white noise to each duplicate to generate an individual Trial[ j] ( j = 1, 2,…, NT) in an ensemble. • Step 2. To deal with NT trials, the host initializes NT thread blocks with 128 threads operating in each block. Each thread calculates only the extrema of 8 data points. Obviously, we exploit a parallelism here that’s even finer than the trial level. • Step 3. The host invokes NT blocks with only one thread per block to handle the calculation of envelope lines, which is compute intensive. This means NT cores on the GPU are assigned to these subtasks, each core computing one trial’s envelope lines individually; • Step 4. The host once again initializes NT thread blocks with 128 threads operating in each block. There are 128 × NT CUDA threads working together to extract an IMF of NT trials. Steps 2 through 4 loop 10 times until all IMFs of the ensemble’s trials are completely extracted. We have an adaptive CUDA thread setting. The cubic interpolation method (Steps 3 and 4) used to calculate envelope lines is complex and hard to parallelize, and is encapsulated in a single CUDA thread. Our design minimizes the thread number in a thread block,9 while it creates thread blocks—as many as possible—when executing threads of this type. Hence, these threads can make the most of CUDA cores to deal with intensive computations and occupy as much fast shared memory (manipulated by each block) as possible to buffer the intermediate data generated during their executions. On the contrary, we maximize the number of threads in each block for simple operations. November/December 2011 

Parallelizable problem Original problem space Finer grain

... Portion 1

In parallel

Portion N

...

...

Finer grain Portion 1–1

...

Portion Portion In 1 – J parallel N – 1

Portion N–K Finer grain

... ...

Parallelization scheme 2. Bound

Define concurrent tasks

1. Bound Parallel/distributed system

Parallel/distributed computing libraries Parallel threads processes at runtime

Many-core

Multicore

3. Bound

4. Bound Parallel/distributed software environment

Networked computers

Parallel/distributed hardware platform

Figure 3. The relationship between the degree of parallelism (DoP) in the problem, software environment, and hardware platform. At the programming stage, the DoPs of the problem and the underlying parallel/distributed computing libraries bound each other. At runtime, the DoPs of the parallel threads/processes and the underlying parallel/ distributed hardware platform will also bound each other.

Epoch

Task (a)

Trial

Task (b)

Figure 4. Ensemble empirical mode decomposition (EEMD) application parallelism. Possible parallelization includes (a) the epoch level and (b) the trial level.

47

Retrieval of epochs ... Initialization of trials in Epoch[NW]

Initialization of trials in Epoch[1]

Trial_Thread [1][1]

...

Trial_Thread [1][NT]

Trial_Thread [NW][1]

...

(a)

...

Trial_Thread [NW][NT]

...

Calculation of Epoch[1]’s HHSE

...

Calculation of Epoch[NW]’s HHSE

Epoch[i], length = 1,024 Calculation of standard deviation 8 blocks, 128 threads per block

... Making NT duplicates of Epoch [i]/ STD and adding noise Trial [NT ]

Trial [1] ...

Calculation of extrema

NT blocks, 128 threads per block, each thread handling 8 data points

... Calculation of envelope lines (upper & lower)

NT blocks, 1 thread per block

... Extraction of an IMF of Trial [i ]

NT blocks, 128 threads per block

... Looping for IMFs of Trial[1]

...

IMFs of Trial[NT]

(Log21024) = 10 times

(b)

Figure 5. Executing general-purpose GPU (GPGPU)-aided EEMD. (a) We can view the execution of a G-EEMD application as processing a number of epochs; processing an epoch consists of many trial threads. (b) Our design adopts a flexible scheme of assigning CUDA threads in four different stages. HHSE stands for the Hilbert-Huang Spectral Entropy.

Using Multiple GPUs

SG-EEMD adopts an execution mode of “single CPU thread, many CUDA (GPU) threads”—that is, a single CPU thread performs the executions 48

from the point at which the neural data are fetched from storage to the point at which the HHSE is calculated. However, the overhead incurred by fetching data from storage is significant. Also, we want to explore the new EEMD approach’s scalability and the potential for using multiple GPU devices. We therefore developed another GPGPU-aided EEMD—MG-EEMD—that has a “multiple CPU threads, many CUDA threads” design to hide the data-access overhead and exploit multiple GPU devices when possible. To achieve this, we used CPU-based multithreading and hybrid parallelization as follows. First, to achieve CPU-based multithreading, GPU_Num + 1 concurrent CPU threads are created at the host end, where GPU_Num denotes the number of GPUs embedded in the host. An individual CPU thread is dedicated to loading data from the file system to the host’s main memory and initializing the epochs to be consumed by CUDA threads. To achieve hybrid parallelization, the other GPU_Num CPU threads manage the same number of GPUs (if there’s more than one). Each of these CPU threads corresponds to an individual GPU device and assigns an exclusive channel of EEG data to the designated GPU device for processing. In other words, we enable parallelization at the data-channel level. In the context of each CPU thread, the corresponding GPU still processes the EEG time series at the trial level. Such a hybrid parallelization scheme maps the most coarse-grained tasks to GPU devices and the finest subtasks to CUDA threads.

Experiments and Discussion

We implemented the GPGPU-aided EEMDs using CUDA 2.3 and analyzed an EEG dataset with absence seizures (see Figure 6). Our testbed is a desktop computer equipped with an Nvidia GeForce GTX 295 video card, a last-generation card designed for gaming that has dual GPU chips with 240 CUDA cores and 896 Mbytes of Graphics Double Data Rate, version 3 (GDDR3) per GPU (for more information, see www. nvidia.com/object/product_geforce_gtx_295_ us.html). Analysis Using G-EEMD and HHSE

The EEG dataset was obtained from a patient with absence epilepsy. The EEG signals were recorded from scalp surface at F3, F4, C3, C4, O1, and O2 in the International 10–20 System using six electrodes. The onset and offset of absence Computing in Science & Engineering

1 2 3 4

Channel

Occurrences of absence seizure? Onset and offset times?

5 6

Collection of multichannel EEG signals

0

Parallel processing using G-EEMD Extraction of IMFs

November/December 2011 

40 Time (s)

50

60

70

80

200 0

0

–200 0

Frequency

(a)

1

2

3

4

0

60

40

40

20

20

0

(c)

1

2

3

1 20 40 Frequency

0

1

1

(d)

Time (s)

0 0

0

4

2

(e)

(b)

60

0

Power

seizures were determined by the start and end time of spike-wave discharges; the SWDs were defined as large-amplitude rhythmic 3–4 Hz discharges with typical SW morphology lasting less than a second. Our study investigated absence seizure by analyzing the EEG dataset via calculation of HHSE. Figure 7 shows a segment of sixchannel 80-second-long EEG recordings covering the period of an absence seizure (the onset is at 63 seconds). The parameter settings for the EEMD algorithm are epoch = 1,000, overlap = 750, white noise amplitude for an EEMD trial = 0.1 × standard deviation of an epoch, and 100 trials per ensemble. Figure 7b shows the Hilbert-Huang spectrum of the first channel, emphasizing two EEG segments: a seizure-free segment (from 0−4 s) and a seizure segment (from 64−68 s). We checked the results against those obtained using alternative EEMD implementations, and the differences are negligible. The EEG recordings demonstrate a highamplitude firing pattern during the seizure state. Figure 8a and 8b show the corresponding Hilbert-Huang spectrum and the marginal spectrum. The difference in spectrum distributions of the two EEG segments is apparent. In the absence

30

100

–100

Figure 6. EEG signals collection and analysis with GPGPU-aided EEMD. The testbed desktop computer is equipped with an Nvidia GeForce GTX 295 video card, which has dual GPU chips with 240 CUDA cores and 896 Mbytes of Graphics Double Data Rate, version 3 (GDDR3) per GPU. HHSE stands for the Hilbert-Huang Spectral Entropy; IMF stands for intrinsic mode functions.

20

Figure 7. Segment of the six-channel EEG recordings of an absence seizure. The recordings are 80 seconds long; the seizure’s onset is at 63 seconds.

EEG

Calculation of HHSEs

10

60

8 6 4 2 0

0

(f)

2

3

4

2

3

4

Time (s)

20 40 Frequency

60

Figure 8. Results from seizure-free and seizure EEG epochs. The EEG epochs during (a) seizure-free and (b) seizure intervals. The HilbertHuang spectrum during (c) seizure-free and (d) seizure intervals. The marginal spectrum during (e) seizure-free and (f) seizure intervals.

seizure state, a significant increase of 3–4 Hz in activity can be observed. The low-frequency band of 0–2 Hz and the frequency band over 30 Hz were removed to filter out artifacts to highlight the middle- and lowfrequency brain activities, which make strong contributions during seizures. We then calculated HHSE within the band of 2–30 Hz. Figure 9 plots the HHSE time courses for all six channels of the EEG recordings. Clearly, the entropy values decrease significantly during the seizure state. 49

table 1. results of performance evaluation on gpgpu-aided eemDs. problem size eeg data

number of epochs

One-channel segment

number of trials

sg-eemD*

mg-eemD**

2,688

268,800

583 seconds

16,128

1,612,800

3,473 seconds

1,488 seconds

530,820

53,082,000

31.81 hours

13.62 hours

Six-channel segment Entire six-channel dataset (24-hour recordings)

execution time —

* An SG-EEMD is a single GPU that supports a parallel EEMD. ** An MG-EEMD uses multiple GPUs to support a parallel EEMD.

Entropy

6 5 4 3

(a)

0

20

40

60

80

40

60

80

Entropy

5.5 5 4.5 4

(b)

0

20

Time (s)

Figure 9. Analysis of EEG dataset with absence seizure. (a) Time courses of Hilbert-Huang spectral entropy of the six channels of EEG recordings. (b) The lines shows the mean values of entropy for all six channels and the bars represent the standard deviation.

performance evaluation

We began the performance test by analyzing the first 2,628.65 seconds of one channel of EEG data. Table 1 shows the execution times to process oneand six-channel EEG data segments and then the whole EEG dataset (24-hour recordings) using the GPGPU-aided EEMDs. The SG-EEMD needs only 583 seconds to process the one-channel EEG data of 2,628.65 seconds. When dealing with six-channel EEG data of the same length using one GPU, it can barely support the requirement of real-time processing. The MG-EEMD further reduces the execution time to 42 percent of the SG-EEMD’s. The result is reasonable because MG-EEMD can exploit two GPUs, and the MG-EEMD’s design takes advantage of the minimal synchronization overhead among multiple GPUs on the same host—in contrast to traditional parallel and distributed computing platforms.10 EEG data loading also incurs significant overhead in SG-EEMD application. 50

MG-EEMD uses an additional CPU thread to load data, which operates simultaneously with the CUDA threads; it therefore effectively hides the overhead of accessing massive EEG data. Our results also show that the performance of alternative EEMD implementations remain almost the same when the neural data under investigation scale up. In the third test, we compared the performance of SG-EEMD and MG-EEMD with the whole EEG dataset, and the results were exactly as expected.

T

he EEMD application is data intensive and has profound parallelism at multiple levels. The many-core computing platform provides massive computing capability for data- and computeintensive applications. The GPGPU-aided approaches exhibit excellent scalability when handling massive data from one channel to multiple channels. This is enabled by • a proper mapping of the application’s data and computation parallelism to the parallelism with CUDAs and GPUs, and • the many-core architecture’s extensive scalability. We can envision further performance improvements to EEMD applications using a host with more GPUs or a platform with more GPUs linked via a high-speed network such as Infiniband. We anticipate that, in the near future, a hybrid computing technology combining GPGPU and cuttingedge parallel and distributed computing will prevail in analyzing massive neural signals with rapidly increasing density and channels.

acknowledgments This work was supported in part by the National Natural Science Foundation of China (grants No. 90820016, 60804036), the Hundred University

Computing in SCienCe & engineering

Talent of Creative Research Excellence Programme (Hebei, China), the Programme of High-Resolution Earth Observing System (China), and the Fundamental Research Funds for the Central Universities (CUGL100608, CUG, Wuhan). Dan Chen gratefully acknowledges support from the BirminghamWarwick Science City Research Alliance.

Lizhe Wang is a principal research engineer at Indiana University’s Pervasive Technology Institute. His research interests include high-performance computing, Grid computing and e-Science, and cloud computing. Wang has a Doctor of Engineering degree from the University of Karlsruhe, Germany. Contact him at [email protected].

References

Gaoxiang Ouyang is a postdoctoral research fellow in the Institute of Electrical Engineering at Yanshan University, Qinhuangdao, China. His research interests include EEG analysis, neural models, and dynamic systems. Ouyang has a PhD in manufacturing engineering from the City University of Hong Kong. Contact him at [email protected].

1. S. Sanei and J. Chambers, EEG Signal Processing, John

Wiley & Sons, 2007, pp. 35–50. 2. Z.H. Wu and N.E. Huang, “Ensemble Empirical

3.

4.

5.

6.

7.

8.

9.

10.

Mode Decomposition: A Noise Assisted Data Analysis Method,” Advances in Adaptive Data Analysis, vol. 1, no. 1, 2009, pp. 1–41. I.J. Rampil and R.S. Matteo, “Changes in EEG Spectral Edge Frequency Correlate with the Hemodynamic Response to Laryngoscopy and Intubation,” Anesthesiology, vol. 67, no. 1, 1987, pp. 139–42. X.L. Li et al., “Analysis of Depth of Anaesthesia with Hilbert-Huang Spectral Entropy,” Clinical Neurophysiology, vol. 119, no. 11, 2008, pp. 2465–2475. N.E. Huang et al., “The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-Stationary Time Series Analysis,” Proc. Royal Society London, vol. 454, Nabu Press, 1998, pp. 903–95. L. Wang et al., “Cloud Computing: A Perspective Study,” New Generation Computing, vol. 28, no. 2, 2010, pp. 137–146. V. Kindratenko, G.K., Thiruvathukal, and S. Gottlieb, “High-Performance Computing Applications on Novel Architectures,” Computing in Science & Eng., vol. 10, no. 6, 2008, pp. 13–15. O. Schenk, M. Christena, and H. Burkharta, “Algorithmic Performance Studies on Graphics Processing Units,” J. Parallel and Distributed Computing, vol. 68, no. 10, 2008, pp. 1360–1369. J.A. Wilson and J.C. Williams, “Massively Parallel Signal Processing Using the Graphics Processing Unit for Real-Time Brain-Computer Interface Feature Extraction,” Front Neuroengineering, vol. 2, no. 11, 2009; www.ncbi.nlm.nih.gov/pubmed/19636394. D. Chen et al., “Synchronization in Federation Community Networks,” J. Parallel and Distributed Computing, vol. 70, no. 2, 2010, pp. 144–159.

Xiaoli Li is a professor and head of the Department of Automation at the Institute of Electrical Engineering, Yanshan University, Qinhuangdao, China. His research interests include computational intelligence, signal processing and data analysis, monitoring, biosignal analysis, and manufacturing systems. Li has a PhD in mechanical engineering from Harbin Institute of Technology, China. Contact him at [email protected]. Selected articles and columns from IEEE Computer Society publications are also available for free at http://ComputingNow.computer.org.

Dan Chen is a professor in the School of Computer Science, China University of Geosciences, Wuhan. His research interests include computer-based modeling and simulation, high-performance computing, and neuroinformatics. Chen has a PhD in computer engineering from Nanyang Technological University, Singapore. Contact him at [email protected]. edu.sg.

November/December 2011 

51