NASA Conference Publication 3315 1995 Science Information ...

3 downloads 291 Views 9MB Size Report
Nikolas Galatsanos, Illinois Institute of Technology, Chicago, IL, USA. -/ ... University of Illinois, Champaign-Urbana, IL, USA ... Lexington, MA, USA. V ...
NASA

Conference

Publication

3315

1995 Science Information Management and Data Compression Workshop

James Goddard

Edited by C. Tilton

Space Flight Center Greenbelt, Maryland

Proceedings of a workshop sponsored by the National Aeronautics and Space Administration in cooperation with the WashingtonNorthern Virginia Chapter of the Institute of Electrical and Electronics Engineers Geoscience and Remote Sensing Society and held at NASA Goddard Space Flight Center Greenbelt, Maryland October 26 - 27, 1995

National Aeronautics and Space Administration Goddard

Space Flight Center Greenbelt, Maryland 1995

800 Elkridge Landing Road, Linthicum Heights, MD 21090-2934, I This publication is available from the NASA Center for AeroSpace

(301) 621-0390. Information,

FOREWORD The Science Information Management and Data Compression Workshop was held on October 26-27, 1995, at the NASA Goddard Space Flight Center, Greenbelt, Maryland. This NASA Conference Publication serves as the proceedings for the workshop. The workshop organized by the Information Sciences Technology Branch, Space Data and Computing Division of the NASA Goddard Space Flight Center, and was supported by the Office of Mission to Planet Earth, NASA Headquarters. The workshop was held in cooperation with the Washington/Northern Virginia Chapter of the Institute of Electrical and Electronics Engineers (IEEE) Geoscience and Remote Sensing Society. The goal of the Science Information Management and Data Compression Workshop was to explore promising computational approaches for handling the collection, ingestion, archival and retrieval of large quantities of data in future Earth and space science missions. It consisted of fourteen presentations covering a range of information management and data compression approaches that are being or have been integrated into actual or prototypical Earth or space science data information systems, or that hold promise for such an application. Papers were selected from papers submitted in response to a widely distributed Call for Papers. Fourteen papers were presented in 3 sessions. Discussion was encouraged by scheduling ample time for each paper. The workshop was organized Space Flight Center.

by James

C. Tilton

Workshop James C. Tilton Mail Code 935 NASA GSFC Greenbelt, MD 20771 phone: (301) 286-9510 FAX: (301) 286-1776 Internet: tilton @chrpisis.gsfc.nasa.gov

and Robert

F. Cromp

of the NASA

Organizers Robert F. Cromp Mail Code 935 NASA GSFC Greenblet, MD 20771 phone: (301) 286-4351 FAX: (301) 286-1776 Internet: cromp @ sauquoit, gsfc.nasa.gov

iii

Goddard

CONTENTS A Spatially Adaptive Spectral Re-ordering Technique for Lossless Coding of Hyper-spectral Images .............................. Nasir D. Memon, Northern Illinois University, DeKalb, IL, USA; and Nikolas Galatsanos, Illinois Institute of Technology, Chicago, IL, USA

-/

1

A Comparison of Model-Based VQ Compression with other VQ Approaches Mareboyana Manohar, Bowie State University, Bowie, MD, USA; and James C. Tilton, NASA Goddard Space Flight Center, Greenbelt, MD, USA

...........

13 -2---

Remotely Sensed Image Compression Based on Wavelet Transform ................ Seong W. Kim, Heung K. Lee, Soon D. ChoL Korea Advanced Institute of Science and Technology, Taejon, Korea; and Kyung S. Kim, Korea Institute of Science and Technology, Taejon, Korea

23 _ 3

A Packet Data Compressor ......................................... Mitchell R. Grunes, Allied Signal Technical Services Corp.; and Junho Choi, Naval Research Laboratory, Washington, DC, USA

35 - Z/

Reduction David

and Coding

G. TiIley,

of Synthetic

Elkridge,

Aperture

Radar

Data

with Fourier

Transforms

........

45

- _5"-

55

--_'

65

-7

MD, USA

Selecting a General-Purpose Gary Jason Mathews, NASA

Data Compression Algorithm ...................... Goddard Space Flight Center, Greenbelt,

MD,

USA

Compression Research on the REINAS Project ............................. Glen Langdon, Jr., Alex Pang, Craig M. Wittenbrink, Eric Rosen, William Macy, Bruce R. Montague, Caries Pi-Sunyer, Jim Spring, David Kulp, Dean Long, Bryan Mealy and Patrick Mantey, Baskin Center for Computer Engineering & Information Sciences, University of California-Santa Cruz, Santa Cruz, CA, USA KRESKA: A Compression System for Small Krystyna W. Ohnesorge and Rene Sennhauser, MultiMedia Laboratory, University of Ziirich,

and Very Ziirich,

Large

Images

75

Switzerland

Alternate Physical Formats for Storing Data in HDF .......................... Mike Folk and Quincey Koziol, National Center for Supercomputing University of Illinois, Champaign-Urbana, IL, USA Context-Based Retrieval of Remote Sensed Images Using Asha Vellaikal and Son Dao, Information Sciences Lab, Hughes Research Laboratories, Malibu, CA, USA; and C.-C. Jay Kuo, Signal and Image Processing Institute, University of Southern California, Los Angeles, CA, USA

91 Applications,

a Feature-Based

Machine Learning for a Toolkit for Image Mining .......................... Richard L. Delanoy, Massachusetts Institute of Technology / Lincoln Lexington, MA, USA

V

...............

Approach

.....

103 -

/0

115 - / / Laboratory,

Dissemination of Compressed Satellite Imagery within the Navy SPAWAR Central Site Product Display Environment .................. Oleg KiseIyov and Paul Fisher, Computer and Information Sciences, Inc., Denton, TX, USA Data Management and Scientific Integration within the Atmospheric Radiation Measurement Program ............................ Deborah K. Gracio, Larry D. Ha_ield, Kenneth R. Yates and Jimmy W. Voyles, Pacific Northwest Laboratory, Richland, WA, USA; Joyce L. Tichler, Brookhaven National Laboratory, Upton, NY, USA; Richard T. Cederwall, Mark J. Laufersweiler, Martion J. Leach, Lawrence Livermore National Laboratory, Livermore CA, USA; and Paul Singley, Oak Ridge National Laboratory, Oak Ridge, TN, USA Object-Oriented Structures Supporting Remote Sensing Databases ................ Keith Wichmann, Global Science and Technology, Greenbelt, MD, USA; and Robert F. Cromp, NASA Goddard Space Flight Center, Greenbelt, MD, USA

vi

..

123 ./L

131 " 1_'_

139" i

N96-15447 ),/

A Spatially

Adaptive for Lossless

Technique

/ //

Spectral Re-Ordering Coding of Hyper-spectral

Images Nasir Department

D. Memon

Nikolas

of Computer

Northern

Illinois

Science,

Galatsanos

Electrical

University,

Illinois

Dekalb, IL 60115 [email protected]

Engineering

Institute

of Technology

Chicago,

IL

[email protected]

Abstract In this paper, we propose a new approach, applicable to lossless compression of hyper-spectral images, that alleviates some limitations of linear prediction as applied to this problem. tral components re-ordering

According to this approach, an adaptive re-ordering of the of each pixel is performed prior to prediction and encoding.

adaptively

exploits,

on a pixel-by-pixel

correlations for prediction. Furthermore, spatial correlations, and does not introduce

basis,

the

presence

specThis

of inter-band

the proposed approach takes advantage of any coding overhead to transmit the order

of the spectral bands. This is accomplished by using the assumption that two spatially adjacent pixels are expected to have similar spectral relationships. We thus have a simple technique to exploit spectral and spatial correlations in hyper-spectral data sets, leading to compression performance improvements as compared to our previously reported techniques for lossless compression. We also look at some simple error modeling techniques for further exploiting prior to entropy coding.

1

years

have

of multi-spectral important. bands

This

increase

in the

resulting

a tremendous the

is due

images.

on the

decade, Although

data,

images,

in such

be placed

seen images.

hyper-spectral

with

that

remains

in the prediction

residuals

Introduction

Recent

this

any structure

Earth

is designed bit-rates much

sophisticated

work

With

the advent

need

for efficient

to the

high

data

For example Observing

has been

techniques

rates

that

with

from

towards

exploit

the

the

Imaging

scheduled

images, has

bands

and also

become

large

number

Spectrometer

to be launched

192 spectral

200 Megabits

done

algorithms

resulting

(EOS)

images than

compression

transmission

resolution

High-Resolution

System

to acquire of more

the

generation,

of high spectral

storage known

increasingly of spectral (HIRIS)

by NASA

at 30m spatial

as

to

within

resolution

per second.

developing special

algorithms

nature

for compressing

of multi-spectral

images

image have

started emergingonly recently[6, 13]. More specifically,as noticed by researchersin other image processingareas, the spectral correlation in such imagescannot be assumedto be stationary in nature. In other words, the correlation between bands i and j cannot be assumed based

the same

as that

on a stationarity

2-D images, applied

Furthermore,

and

many

two

spectral

most

popular coding

techniques,

see [11] and

niques

known

be extended

correlations context.

Given

a scheme

becomes

both

spatial

and

Linear

predictive

effectively

use adjacent

band

selection

values

which

in a given

fact that

approach

is that

relationships

They

is taken

into

the

optimal

the

image

predictor

the overhead

techcan

nature

for context

non-stationary resulting

based

nature

bf

implementation

exponentially data

needed

not

of spectral

in the

(10

size

to 16 bits),

to effectively

capture

entire

spatially.

bands

best

significant

for

the

can

question

predicting The

setting,

of

intensity

benefits

to prediction.

theoretic

that

in hyper-spectral

compression

prior

nature

image

Hence

to use

in a graph

entire

in [6] that

correlated.

images,

and then small

ordering

bands. The

That

et.

and

spectral optimal

this

set.

is, it makes

the

ordering

of the albeit

can

be

problem

admitting

in the

ordering

to poorer

objects

prediction

an

context

prediction

that

of lossy spatially,

blocks

are

performance

of the spectral

co-efficients

imaged.

a block-by-block

Large

coefficients

prediction

prediction.

approach

limitation

assumption being

changes on

of this

Another and

inter-band

predictors

as overhead. leads

limitation

data

spectral

al.

major

corresponding

the

to optimize

transmitted and

and

composition

in order

compute

one

through

optimal

[12], Rao

co-efficients are

for the

observed

the

reported,

pass

on the

by

blocks,

Due to the non-stationary

be highly that

were

an entire

spectral

reasonably

context-based

image

context

a predictor

shown

An optimal

account

coefficients

the

compression

unfortunately

specifically,

grows

problem.

are

was formulated

depending

prediction into

are

image

image.

it re-orders

too re-order

images

correlated and

of the

contexts

band(s)

ordering

by making

spatially

More

it has been

of multi-spectral

improvements

do not vary

will change

and

non-stationary

capture

sized

can

the bands

it is two-pass.

computed

coding

they

in terms

up with

apart

In [14], it was

significant

are first

fact

band.

far

is, which

for an n-band

Although

to come

Furthermore,

That

an optimal

O(n _) solution

of

as transform

of lossless

size of hyper-spectral

also face a similar

spectrally

by re-ordering

of computing

way

correlations.

bands.

arises.

large

for a reasonably

it is hard

are

the

images.

of different

alphabet

techniques

correlations,

bands

large

image

optimal

is different.

techniques

to effectively

number

for lossy

the

of spatially

redundancies,

to exploit

needed

infeasible

coding

predictive

in hyper-spectra[

the

the

spectral

of spectral data,

manner

the

compression

such

For a survey

spatial

to be prohibitively

is because

such

linear

and

to capture

compression.

coding.

exploit

context

gets

This

obtained

Although

are present

the size of the

complexity. of the

[8].

based

lossless

compression,

for lossless

context

to adequately

that

techniques,

in lossy

and

images

in

dimension

proposed

in goals,

techniques

correlations

to the third were

difference

for lossy

Therefore, spatial

of hyper-spectral

to lossless

approaches

in a straight-forward

redundancies spectral

used

extended

techniques

of the

correlations

[4]).

for capturing

directly

correlation

applicable

and

used

recently,

because

tools

are not

predictive are

spectral

been

when

Very

However,

powerful

quantization,

have

results sets.

of the

12, 5].

spatial

The

data

nature [13,

exploiting

linear

that

will yield sub-optimal

compression

vector

i+ k and j + k (see for example

assumption,

to hyper-spectral

non-stationary

is the

of bands

This

latter

compression. However, they

partition

basis. required

since The to keep

as compared

to that obtained with a smaller block. Furthermore, they alsofound that using the previous band for prediction givesresults that are very closeto those obtained after computing an optimal ordering. In this paper, we proposea new approach, applicable to losslesscompressionof hyperspectral images.According to this approach,an adaptive re-orderingof the spectral components of eachpixel is performed prior to prediction and encoding. This re-ordering exploits, on a pixel-by-pixel basis,the presenceof inter-band correlationsfor prediction. Furthermore, the proposedapproachtakes advantageof spatial correlations, and doesnot introduce any coding overheadto transmit the order of the spectral bands. This is accomplishedby using the assumption that two spatially adjacent pixels are expected to have similar spectral relationships. We thus have a simple techniqueto exploit spectral and spatial correlations in hyper-spectral data sets, leading to significant performanceimprovementsas compared to our previously reported techniquesfor losslesscompression[10].

2

Spatially

Given

two spatially

band

'adjacent'

multi-spectral

dimension. the

Adaptive image,

Hence

current

X that

prediction

error

are found

and

pixels the

Y by using

optimal been

at pixel used

both

spectral

pixel

in band-interleaved

technique and

that

X and

is some

the

'model'

true

Since

simple

at Y.

the

one

of its

and

adjacent

is performed,

Continuing

be well

manner,

effectively

processes

of it would

the

at Y which

in this

which

n

estimate

spatially

(yl,y2,...,yn)

image,

technique

realizations

of an

in the spectral

below)

compression

values

a multi-spectral

correlations. order,

from lossless

the

relationships

be specified

computed Since

Y = (Yl,y2,.-.,Y,_)

similar

(to

parameters

for encoding

spatial

Y have

using

model

and

order

transmitted.

Y is transmitted,

to update

an adaptive

Modeling

= (xl,x2,...,x_)

image

has already

have

X

we assume

we scan

pixel

neighbors

Spectral

the

data

suited

we

captures pixel

by

for real-time

applications. The

question

parameters that

the

that

from adaptive

process

answer

to these

setting

in a manner

by assigning vertices based

including on the

E)

optimal spanning the root

tree

in the

obtained

It is well known

that

a minimum

graphs

image.

from

[3, 7], where

X and

employed X

vice-versa.

in order

weight

notion

directed number

the Y and

Specifically, root

spanning of vertices

edges and

tree

the

incurred

on the edge

a minimum

weight

of the

the

image

encoding

each

is transmitted E the

the can be

cost'

weight

can be found and

graph

8-neighborhood

'coding

to encode vertex

theoretic

between

cost encoding

order,

such

An optimal

of adjacency

to see that

a minimal

we obtain

a weighted

directed

The

to code

The

image?

4-neighborhood

it in depth-first

its parent.

V is the

two

Y.

do

this choice

in a graph

constructing

it is easy

specifies

above.

traverse

entire

problem

to Y represents

Now,

graph

from

We add

pixels X

by

X

one make

the

the

We begin

from

described

model

applied

by formulating

commonly

and

the optimal

for sparse

over

derived

technique of this

neighbors

when

tree of this weighted

adaptive

of its

causal

going

the

is which

how does

'adjacent'

the

represents

vertex

Y,

Furthermore,

pixel

edge

parameters

Y to X

pixel

at Y?

[9, 14, 12].

to spatially

weight

from

to

for each

model

by using using

a vertex

each

be obtained

similar

The

directed at the

can

on any

by using

for values

remains

questions

corresponding

models. going

arises

to estimate

we start pixel

by

as is.

in O(V

number

image

log E+

of edges.

In our case, essentially

E = 8V of the

while

order

compression

prediction

residuals.

residuals that

were

tractable

treated

that

this

may

model

obtained

could

Different

from

implementation

3

or more still

next thus

that

Modeling diction

seem

dimension. features, form

by

to the

to indicate For

cost

I[i,j,

seem

can

values

two facts

re-ordering In order order

by

(xl,x2,...,x_), ascending order

linear

that

by constructing needed

heuristic

for doing that

Nevertheless,

uses

a

computing bound

of the proposed

that

of the

simple

graph

approach.

lead

modeling

to different

schemes

We also

give

that

we

preliminary

the spectral

sets

components data

it is also are

data,

question

that

for

Recent

along

studies

the spectral

homogeneous

by a multiplicative

factor

(i,j)

in band

k.

bands

which

are

of band

a simple

highest

Pre-

ground [12] of the

+/3_

location

below

Linear

of a pixel.

set are

known

related

= c_rsI[i,j,s]

to the

and

selection

modeling

spectrally

far apart

for prediction.

scheme

that

is based

Keeping on spectral

a given

pixel

X, we first

In other

words,

re-order given

its spectral pixel

X

components

from

an

into ascending

n band

image,

let _ be a permutation on n points that sorts the components and let the resulting vector be X = (21, i2,..., z_,_) where,

to the

yield

prediction.

values.

intuitively

as compared

edges

some

at spatial

leading

be used.

investigations.

_: = x_(1) < i2 = xo(2) _< "It is clear

choice

resources

a simple

principle

for the

in hyper-spectral

we describe

to model

shown

[9]. However,

give us a performance

is the general

bands

value

in mind

intensity

we have

problem

can be found

Re-ordering

data

in different

before,

can

in a multi-spectral

correlated

and

pixels

ongoing

of modeling

intensity

as stated

be highly

these

work

prediction

to be promising.

correlations

k] is the

However,

if the

equivalent

memory cases,

as it would

we describe

in our

I[i,j,r] where

exist

pixels

and

functions

section far

hyper-spectral

intensity

cost

of

heuristic.

Spectral

question

that

simple

and

with

the

In such

neighboring

up to this point

In the results

We now turn

coding

in previous

there

for processing

be of interest

such

schemes

experimented

time

can be the entropy

optimization

statistics

applications.

one

described

modeling

graph

on the

However,

the computational

would of any

we have

implementations. have

tree,

for many

ordering

be expected

What

a bound

of the

combinatorial

Gaussian

ordering

spanning

be prohibitive

an optimal

and

a computation

image.

to the edges

coded.

yielding

[9].

an optimal

weight

entropy

model

multi-spectral

represent

an intractable

for Laplacian

although

a minimum

would

as iid and yields

optimizations

Now,

8-neighborhood

tile cost assigned This

this cost function

we showed

the

n log n for an n band

For lossless the

using

that

the

original

vector vector

)( is more X.

4

=

of X

in

< x_ = Xo(N)

amenable

Furthermore,

X

to approximation the

sorting

step

by a linear brings

closer

model spectral

3000

I

I

i

_

i

I

I

I

'ascending_order' 'original_order' 'spectrally_re-ordered'

28O0

---- ....

_

260O

2400

22O0

2000

1800

>

1600

1400

1200

1000

8OO

600

4OO

2OO

0

25

50

75

100

125

150

175

200

Band Figure

1: The

spectral

components

of a pixel

before

and

after

re-ordering

225

Prediction Best JPEG

Scheme

differencing

Spectral

re-ordering

Spectral

re-ordering,

adaptive

first order

Spectral Prediction

re-ordering, Trees

adaptive

second

components

that

correlated

and

5.52

prediction

order

of prediction

have

with

6.27

differencing

5.36

prediction

5.48

1: Entropy

might

Entropy 6.13 8.19

Spectral

Table

highly

Residual

one

been

widely

another

and

residuals

with

separated

have

different

in the

similar

schemes.

original

data

although

X.

function,

that

f(.)

segments,

such vector

for waveform after

the

(yl,..-,

yn),

from the

X

usually

pixel

plot

of the edge

plot

cost

lossless E.

to prove

a good

compression the

from

In table approach. currently

some above.

and

224 spectral

the

preliminary with

progress.

Mining obtained

in ascending

a derived

1 we also

show

order

and

is difficult

order.

in minimizing

the

by minimizing

necessary

by

a of a neighboring

Y is simply

the

number

of E are zero-mean

E is minimized gives

in figure

in ascending

to encode

interested

components

These

Implementations

Cuprite

the

of E [9]. This

1 we show

the

residuals

we are that

of X

=

If X and

a and

permutation

Y

we denote

error.

parameters

if the

pixel

Y, which

prediction

For example sorting

is almost

given

parameters

the

another

e.

obtained

cost

of encoding

of bits

needed

Laplacian the

cost function

then

sum of absolute

for weighting

to

it is easy values

the

directed

X to Y.

described under

components

cost of encoding

components going

the

image,

to the

I5" and the

)(

to be c_, Now,

to model

words,

tolerance

vector

order

of

algorithm

error

of the

used

fewer

approximation

an optimal

at a specified

between

of Y.

respect

plot of components

If we assume

that

techniques

the

difference

approximation with

given

ascending

of Y, or in other

of spectral

of using

into a are

of a hyper-spectral

re-ordering

from

the

For

encode

provide

function

to do so with

piecewise

[2] have

approximation

of X

o_). The

estimate

expect

to the

al.

linear

permutation

t> = f(a, pixels

et.

linear

components

adjacent

resulting

a piecewise

re-ordering

in our

after

to distinguish Now,

with

as compared

Bhaskaran

of a piece-wise

we get

error

tolerance, work,

spectral

the

then

the

The

error

In recent

the

spatially

pixel.

= -/i -1- e for 1 < i < n, we would

parameters

o_ and

E is then

E.

X.

re-arranging

parameters

Y are

f(i)

approximation

Denote

from an AVIRIS vector X can be

by a linear model using a given number of parameters as compared to For example, if we wish to approximate )( with a piecewise linear

for a prescribed

the original

are

intensities.

In figure 1 we plot the spectral components of a random pixel selected image. It can be clearly seen from this figure that that the re-ordered better approximated the original vector

they

The

District,

bands. after

The using

results

were done more

results Cuprite,

obtained with

the

sophisticated were

entry

Best

the

eight

prediction

obtained

Nevada.

The

JPEG

refers

JPEG

6

with intention

simple

approximations

of testing and

the

re-ordering

a portion

of an AVIRIS

image

contained

200 samples,

to the zero-order on each

entropy band

of our

schemes

from

predictors

of the

validity

are

image

of

200 lines,

of prediction

of the

image

and

selectingthe best. Hereno spectral information is being taken into account, the prediction is purely spatial. The entry 'Spectral Differencing'givesthe zero-orderentropyof prediction residualsafter taking differencesamong pixels in adjacent spectral bands but the same spatial location. In other words, a fixed, first-order spectral predictor is employed with coefficient 1. The third entry 'Spectral Re-orderingand Differencing' givesthe zero-orderentropy of prediction errors, after first re-ordering the spectral componentsby using the permutation that sorts the averageof the horizontal, vertical and left-diagonalneighborsand then taking differences as above. It is seenthat a considerablereduction in entropy is attained by simply re-ordering the spectral componentsprior to differencing. The fourth entry 'Spectral re-ordering,Adaptive First order prediction' refers to a technique that re-ordersthe spectral componentsas above but usesthe value c_, • 2i to predict the

value

was

computed

of the

spectral

component

X_+a of the

re-ordered

pixel

2

= (:/1,...

x_n).

Here

average

intensity

values

o_

by Yi;1

from

the

vector

Y = (yl,-..,

vertical,

horizontal

ordering,

Adaptive

but

the

using

next

and

left-diagonal

second

previous

order

by arranging

neighbors

prediction'

two components

the

in ascending

refers

of the

order.

to a technique

re-ordered

vector

The

entry

of the

'Spectral

identical

to the

to estimate

the

re-

one

above

value

of the

component. Finally,

in order

entropies

reported

to make

tree'

obtained

improve

these

results.

over

phisticated

prediction

obtained new

despite

are

4

Modeling

If the

residual

efficiently

Huffman

coding

sophisticated the

a model

that

in the

ite source sources,

that

residual

The

captures

notion

perform

Hence,

the

structure

which

scheme

that

already

using

more

so-

improvements

memory

usage.

is the order could

called

we can

by

current

zero-order

in which

be used

are

Also,

the

hyper-

in real-time

in hardware.

errors

standard

is treated variable

coding.

Unfortunately, the

in order that

modeling

of a composite

source

residual

to encode

remains

after

use a composite

as an iid source,

length

generally

error

the

entropy

even image

after

source

coding

image Most

model

the

structure efficiently

lossless

to capture

it can be

techniques,

applying

has ample

residual

prediction.

then

most which

we need

compression the structure

image.

modeling, called

of the

techniques,

lid assumption.

the and

a technique

to implement

of prediction any

order, such

prediction

We see

that

the

Errors

or arithmetic

prediction

violates techniques

consisting using

Note

lists

be achievable

complexity

Hence,

enough

band.

should

in band-interleaved

trees'

adaptive

spectral

schemes.

acquired.

'Prediction

a spatially

in computational

it is simple

using

entry

improvements

Prediction

image

coded

used

re-ordering

pixels

usually

provided

that

the

an adjacent

Further

and/or

process

images

applications,

work from

a reduction

techniques

spectral

comparisons

in earlier

a 'prediction

like

Y_) obtained

we model

sub-sources.

the

Each

model data

was

first

described

as an interleaved

sub-source

has

"k

its own

by Berger

sequence model

and

[1].

In compos-

generated

by multiple

associated

probability

distribution.

Each

sub-source Error

and

modeling

by one

schemes

that

either

1. A family

of probability

switching

encode

a particular

We have function, be a static each

symbol.

could

The

referred mass

same

Huffman

coding,

and

only

on what

is available

lnake

switching

decisions

The

switching

this choice both

the

used the

could

four

be available

are

used

to con-

is to be

of the

The

used

probability

for all images. functions

to

to the

technique.

The

coding.

Similarly

encoder

and

receiver/decoder

so it

or backward

were

in form

results

we can make

decoder

be provided

work

as in the

adaptive

only

of previously

reported

here

of probability case

decisions

encoder

or we (forward

as side information.

adaptive. encoded

are in terms

of adaptive

adaptive),

to the

to the decoder

all backward

functions This is

switching

(backward

available

would

we encounter

development

function

mass

This

after

to as an online

encoding

then

of information,

a composite

the

used

• PPS:

• PC:

magnitude

of eight

be present

in a given

just

before

The bands,

of the

reason

for

available

to

entropy

of the

that

in the

will be large,

prediction

more

while

residuals

sources. region

active in the

Each

source

rep-

of an image.

We

regions

of the

quasi-constant

will be smaller.

The

image regions

four

different

level were:

of the

prediction

the current

of the

Recall

residuals

activity

magnitude

neighboring

residual

at the

spectral

component

that

was

component.

prediction

residual

at same

spectral

location

but

at a spatially

pixel. Average

• AM E: Average of the

consisting

could

of the

the

Magnitude

COMB:

model

that

of activity.

prediction

to check

The

encoded

measures

of the

image

measures

source

level of activity

different

lnagnitude



following

source. a different

of the

mass

on information

in this

We constructed resents

to both based

decoder.

also

the

studied

wealth

distribution

functions

is referred

adaptive with

decisions

and

that

literature.

of the

the entire image, and construct probability mass then be sent to the decoder as side information.

switching

was the

sub-source,

In terms

mass

probability

This

techniques

encoder

composite

the

arithmetic

could The

in the

consist

particular

component.

would

integrated

based

adaptive).

described

typically

of pixels).

adapt

manner.

adaptive

which

for each

or backward

is often

by the corresponding

manner:

set of probability

we could scan which could

to as an off-line functions

same

information

in the

technique. Finally, for each sub-source,

(or block

We could same

be adapted

techniques model

one for each

indicates

of options

approach.

source source

or explicit

functions,

which

use the

single

generated

code.

pixel

a number

we could

of the

mass

length scheme

as an lid sequence

use a composite

in an implicit

a variable

2. An

is treated

encoded

two components

struct

subsequence

pixel

being

of PPS magnitude currently

and

PC.

of prediction encoded,

error current

8

over the entire pixel.

set of spectral

components

Error SSE

Table

These Ts.

Sub-source

5.19

COMB

5.18

AME

5.31

i was

source

than

the

for encoding

the

is the sum

This

The using

the

four

rather

different

In fact,

a given

image

using

because

our

almost

as opposed

to using

Further,

we have

activity

level

the

incurred eling.

in a larger Our

using

5

image

data

manner. and results

that

What

encoding

again

with

with

different

pixels

prediction

scheme

the

< ...,

less than



Institute

Engineering-Systems

of Southern

Los Angeles,

} _isl.hrl.hac.com

2i

Kuo:_

of Electrical

University

CA 90265

Sensed Approach

Son Daot

Lab

Canyon

Malibu, {asha,son

Vellaikalt'_,

Sciences

Research

Remote

a Feature-Based

Asha t Information

of

/

California

CA 90089

- 2564

[email protected]

ABSTRACT A feature-based image

database

local

pixels,

expected

representation is described

and

the

to occur

model

in this work.

cluster

features

frequently

in the

experimental results tool for content-based

show that access.

portant

and

issue,

coming

since

few years.

terabytes NASA

to allow

Current and

so on.

as given

This

ability

been

retrieval

PRECEDING

sets. issue

in the

is extracted

PAGE

data

in other area

gL_'¢i_

over user

by Dozier

after

based from fields the

types image

a remote

spatially

of queries retrieval.

provides

sensed

which

are

Preliminary

a very

promising

has

EOS

general

sensed the

images

image

flexibility

management

in the

size

image

data

public

with

based

[1], and

sizes

the

in the intent

of

it is important

based

to

good

such

satellite

images

im-

exceeding

to make

on metadata

taken,

to retrieve

data

It is the

to be able

was

is a very

of image

operational.

information

which the

that

and

becomes

to the stored

image

snow.

remote

closest

Map

on its content such

storage

use

as the

information on its content,

[2]:

sensing,

the

images

i'iOT FILI",iE_

and where

time

103

1, 1992, for the Bubbs

snow-covered

area

and

the

utility

enhance

content-based

indexing at the

to April

will greatly

as multimedia

of feature-based from

by clustering

several

increase

remote

Mapper

and

is formed sensed

it is expected

the

index

Thematic

Apart

image

database

give the

is cloud-free

to access

data

a research

not query

from

representation

sensed

of retrieving

location

the Landsat

area that index.

of remote

systems

retrieval

INTRODUCTION

to the image

does

in a sample

to process of remote

will be a tremendous

means

geographical

• Find

has

access

used

feature-based

will be downloaded

database

ard

are

To give an example,

flexible data.

time

sensed

there

everyday

allow very of available

The

flexibility

The representation

context

the

1 Efficiency

for content-based

retrieval

telemedicine features

of insertion.

Indexing

remote

is currently

Most

can aid

grain-size

of the

of images [3,4].

which

snow

Creek

of the

work

content-based

is done

on these

features

and

queries

In this paper, primitives

are first

of these

feature

primitives space.

work

given

from

the

area

the has

and

the

representation the

in Section

feature

Traditional attribute basic

unit

and

it easier

term

will differ

retrieval which analyze image which

each

on

of content sensed

image. in the

is explained

in terms

The

feature

multispectral

et al [5]. material sensed

are given

in Section

about

images.

Details

in Section

4.

feature-

3 and

Experimental

of

query

results

are

dealt

them

exactly.

is the pixel, on

correlation

with

ill response

that

a wide

variety

data,

images

manner

application

computationally

efficient

By using

some

using

should

should

appropriate

be compact

consist

models

the derived

of intensity the

For

This

on each the

as recorded

by

is interpreted

example,

content-based

it is practically

to develop

and

texture

impossible

a representation

to

of the

There are several characteristics representation should be able

types

for an image

representation.

image

so as to minimize

of data

meaning and

such as color

[3]. Since

it is important

semantic conditions

are unstructured

in which

domain.

perception

to a query,

and

clear where

rely on content-descriptors

human

of queries,

have

languages

is just a measurement on the

the

images

data

query

to such

will depend

depending

with

using

Compared

multimedia

image

Models

which content-based queries can be answered. representations should ideally possess. The

determined

is encoded

in remote

images

feature

of the

clustering

by Barros

features.

image

2 gives the background

concept

typically

which

widely,

direct

using these

support exist.

have

"content"

in standard have

entire

the

approach,

BACKGROUNDS

to manipulate

of information, The

based

Representation

can be specified

a sensor.

the

In our

representation

reported

involving

domain.

5.

databases

makes

then

are

Section

primitives

Feature-Based

which

been

to a condition

sensing

and

images

of remote

2 2.1

images

as follows.

models

them

remote

a feature-based

from

is organized

using

to the

to form

in this

representation

by converting

ideas

we extract

feature-based

processing

these

extracted

which

This paper the

processed

primitives

Related

based

are

we extend

for which

class,

the

approach

storage good

content

space

indexing

of the

can be termed

to

and

be

methods

images

can be

as feature-based

representation. Figure image images

1 shows

data

are

which

are

representation

While

the

image

of the images

can be stored

be derived attribute

derived

physical

which

is much from

data

schematic

distinguished

based images

the

the can

logical image data attribute data, the

of a feature-based

into the

physical

physical

physical

in primary

images.

memory

more

compact.

logical

image

be dynamic

images,

images,

can be stored

using

which The

possess

and

that

data

the

logical

actual

images,

methods bases

or class always

and

are

devices,

content models. be

Here

the

logical

a feature-

described

can be applied

different

can

images which

memory

regarding they

model.

characteristics

or tertiary

all access

knowledge sense

are

all the

in secondary

Information

in the

representation

earlier. the. logical

to the logical attributes These

calculated

can content

from

the

and need not always be stored in the system. As opposed to content logical image data are static as they are a permanent representation of

104

KnowledgeBases/

Class Models

_ _ _ _ - -_ -. _.

DYNAMIC

---'-

Feature-based Primary

the

physical

2.2

1: Schematic

image

being

tensity

values

images

gathered are

lie in the

visible/infrared

materials

being

image.

which

The

after

georeferencing

the

the spatial

range.

differently

be explained

for each

classified

image

varied on

for an NOAA

Advanced

Very

ordinates Two

shown)

of each of the

along

individual

main

1. Classification

uses

into

with

projection

information

can

in an

by an image

by

the

latitude

wavelengths

Resolution

Ra-

images had a latitudes and the

co-

are analyzed

so

pixel. of remote

sensed

images

ground

classes:

Here

are

as follows:

the remote

sensed

images

method

that

to cluster

classes

to calculate

one of the ground classes. used to aid classification,

tend

elec-

of pixels

original corner

as to categorize each pixel of the image into features such as texture, shape and size are is one

as the

of the

bands, High

be used

sensing

useful

covered

four

radiometric

different

as given

diometer(AVHRR) image which was used in our experiments. The radiometric resolution of 12 bits which were calibrated to 8 bits.The (not

in-

are

number

area the

measured

wavelength

the

shows

same

resource

between

on earth

2(a)

for the the

ranges

the

In addition,

location

Figure

The

for earth

the geographical

a definite pixel.

data

utilized

These

below.

the

determine

in distinguishing

determine

will span

with

which

depending

is useful

resolution

co-ordinates

corresponding

longitudes

btorage

model.

of interest.

levels

wavelengths

microwave

property

as would

with

and

the

bands

brightness

of the

interact

This

along longitude

data

in nature

wavelength

discrete

majority

and

radiation.

and

representation

multispectral

different

into

interrogated

in an image

image

aeconoary

Images

mainly

through

of the

present

are

quantized

resolution

tromagnetic

a based on b

Actual,Images

IMAGES

in Multispectral

sensed

region

b

tt_

.

of a feature-based

from

data.

"Content" Remote

a- - ->b

"'-'--'"

_

l

Storage Figure

/

I

Representati n_'-----._._._._._ o

a derived

|

which together

makes

use

of the

fact

in multidimensional

a subspace of the multispectral to be snow, land or cloud [6].

space.

pixels For

belonging

to the

same

example,

Figure

2(b)

case using which the AVHRR Quantitatively, methods such

105

Even though the dominant class shows

image can be classified as maximum likelihood

Band

I

Band

2 f_]

Snow

I

Ch:,ud

o land X SnOW = cloud

,h-

i

Land Band

3

Band 4 P,and

Wavelemfl

_8_ °==°=?_ 2_--2= ._

h (nlicrtms}

I

0.58

2

(I.72-

3

3.55 - 3.93

4

10 50 - I I 50

- I).68 1.10

gV.xXe a < xCx

2: (a) Details

for land,

cloud

the

can

[7].

is the

likelihood

(b)

different

bands

be employed

To explain

vector

intensity

decision

let

value

p(wilx)

is the

p(xlwi

.

nl,

and

Calculation

examples

values

images

is that

earth's

surface.

Thus

involve

opposed

image

a small would

to images

(b) classification

band.

such

of the

the

space

and

also Sea

values

classes

are

component

of

the

maximum

x belongs

to as follows:

w, with

the

vector

for

class

of class

-

covariance

ml)t

_i -1

matrix

important

referred

math

each

(X--

data

3 data.

bands

Surface

Temperature(SST)

of these

products

are

in class wi. sensed

These

images products

of an image. and

directly

It

mi)}

use of remote

different

x.

as a multivariate

for the

to as level

on the

given

the

Some

Vegetation

dependent

on

the

pixels.

that

one

characteristic

pixels

are

geographically

queries

will often

subsection be predominant of other

as

each

K classes,

the class which

Another

band

different

where

Given

exp {--1/2(x

vector

for the

vector

distribution

I El I-'/2

products, using

values

a pixel

probability

mean

numerical

to note

sensed

of an

image,

> p(cojlx ) for all j #- i,

probability

Products:

products The

It is important

only

the

secondary

are

if p(wilx)

) = (2rr)-N/2

calculated

Index(NDVI). intensity

the

of Derived

typically

characteristic

x denote

in a particular

conditional

gi are

is to calculate are

AVHRR

rule can be used to determine

is common to represent normal distribution

where

for NOAA

if the

briefly,

x Ccoi, where

1

snow.

classification known

of the

and

xxx Band

(a) Figure

=

refer

which

distinguishes

registered

to a geographic

of a particular

image.

in content-based

kinds.

106

"content"

to a physical location

Thus retrieval

queries

location

and typically regarding

of remote

in remote

sensed

on

the

this would subportions images

as

3

FEATURE-BASED

REPRESENTATION

MULTISPECTRAL There to the

is intrinsically

fact

results

that

in a lot

pression (e.g.

pixels

of adjacent

methods

as to ease

a lot of redundancy

nearby

and

JPEG,wavelets)

pressed

before

suited tor

this

are

selected

for joint

feature

not useful

[8].

Tilton

image compression. However not considered as the emphasis feature-based in the

representation

Let group

the

distributions cluster cluster,

cluster

with

which

larger

than

which

Once

the number

C =

{Ci

image. etc.

of classes

The

m is the

which

were

by

cluster

given

minimize

cluster

mapped

size

so

coding

to be decomis naturally such

as vec-

use of VQ for multispectral

on clustering

trying

of pixels

tend

the

intensity

clusters

are the

of vectors

in each

is dependent the

measures

the

the

number

on

number

the

accuracy

of clusters

in that

X is mapped

some

The

to

part

onto

distortion

measure

covariance

matrix

of the

to be image.

a set of clusters J(X;C). and

Our

the

number

as

mean,

E is the cluster

to that

a distortion

to represent

to be present

mean,

a spatially

above.

is determined,

of the

from typically

to describe

to choose

to minimize

consisted

will

features

of clusters

are expected

vectors

pixels

covariances,

number

N)

pixel

measures

cluster

Ci = Ci(m, where

Com-

file

which

as discussed

natural

It is desirable which

N (n >>

1,2,...,N} of each

per cluster

image

require

is based

These

space are

Typical

band,

be represented.

of clusters

: i =

formed

per

distances,

number

representation

bands.

on clustering,

the

which

image.

irradiance are

variances

X must the

image

multispectral

multispectral

clusters

intercluster

of pixels

multispectral

for the multispectral

means,

as they

be a set of n p-dimensional

sensed

in the

describe

studied

which

on transform

A technique based

is due

class

space.

of a remote

together

that

domain

is that

have

in all the based

This

same

in the

schemes

is possible.

et al. [9,10]

X = {xi : i = 1,2,...,n} region

values

a reduction

compressed

encoding

to the

the strength of VQ techniques with respect to content was was more on compression. In this section, we will develop a of the

multidimensional

local

in the and

intensity

However

or interpretation

images.

of belonging

to achieve

problems.

extraction

quantization(VQ)

similar

redundancy

access

in multispectral

probability

having

transmission

DATA

present

a high

pixels

exploit

storage

have

OF

particular

measure

given

J(X;

_, c})

covariance

matrix

These

clusters

cluster.

and

¢, the

were

number

formed

of vectors

while

trying

to

[11]. Once

the

by 1 - _

C)=

II (x,-

Cj(n_))II:

i=1

where

xi is mapped

clustering nearest can

neighbor

be used There

image.

to cluster

is complete,

In one

Cj using

index

classification

along

are

the

two

with

the

the

to encode image

nearest

I for that

in a manner

C to answer

methods

case,

map

spatial spatial

is segmented

mean

reclassification

particular

similar

region

to VQ-based

rule

can be obtained compression.

by using Note

that

queries. and

spectral

to form

107

information

homogeneous

for a multispectral areas

having

spectral

I

Local Image FEATURE EXTRACTION

_"

1

Clustering of pixels in local image

CLUSTER FEATURES

ENCODING ;_ Global Multispectral

Figure values

close

encoded case,

to each

regions

with

the

perturbations previous

due

section,

subportions regions

to class

in an image. practical

Figure

image

cluster

feature

the feature

Thus

index

our

source map

itself

IMAGE

as sensor with

the

4

The serve

is the

source

local

is then

above.

name,geographical feature

index

QUERY

feature-based

as the

logical

which location

map

I and

etc.

representation in the

scheme

compared

with

as local clusters

as mentioned

spatial

are

involve for local

information

and

of the entire

image.

then

the

and in the

only

separately

for multispectral

representation

images.

clustered

cluster

it is

Each

to get

the

feature

set to form

can be summarized

as follows:

metadata} of various

Each

sector

sectors

S and

s consists

other

of the

metadata

cluster

set

such C along

metadata.

USING MULTISPECTRAL PRIMITIVES

for multispectral data

any

local

distributions

will often

features

a subportion

which

related

PROCESSING FEATURE

image

retain

only

each

k = 1,2,...p),metadata)}

consists

other

Also,

First

into thousands

robust

system

developed

description.

clustering

of local

etc.

sources,

Our

= {S = (sk:

full image

not

cover

s = {C,I, Here

angle, database

representation into

as described 1MAGE

does

is also more

to averaging

to have

I which

divided The

sun image

it is better

feature-based

primitives.

index

moisture,

maps

is first

due

be

homogeneous an image.

image

by

can

In another

the image

the entire

be increased

clustering

out

sensing

Also clustering

to have

3 shows

global

variance,

in a remote

of an image.

Such

regions

not necessarily

in subdividing

to cluster can

of these

map forming

reasons

images.

representation.

but

index

and

efficiency

to spread

boundaries

contiguous

expensive

tend

The

were several

of clusters.

clusters

for multispectral

an approximate

the associated

Clustering

queries

tolerance. to form

There

is computationally

as global

scheme

into spatially and

into a lower number

are tighter

a given features

approach.

is impractical.

subimage

more

within region

parameters

latter

of all, clustering of clusters

the

representation

can be subdivided

feature

We followed

other

with

the image

INDEX MAP

F

Image

3: A feature-based

along

Map each pixel FEATURE MAP to cluster index

i

model

discussed

108

images

developed

in Section

2.1.

in Section Queries

3 can

regarding

content

can be processed

ing queries

at this

anticipated

in advance

that

particular

queries done

are

directly

in the

two cases which

the

perspective, as being

Dozier

but

by

values

that

: Such

extent

of area

a covariance clusters

very

likely

and

the

to belong

query

so on.

to that

of the

class

whether

a cluster

contains

pixels

contains

belong (C(m)

where

m is a parameter

class.

Obviously

Note that covariance

basis

pixels

clusters

and

and

indexed

of area

of a class

in an image,

qualify

and

the

in Section more

then 5.

Note

sophisticated

the

that

the might

will

query.

all C for that

above

spectral

of a mean premise

away

mc

is that which

from

are

the

class

class diminishes. in terms of the rule

to determine

as follows:

to the respect

decision

be needed

can overlap.

109

vector

qualify

value access

unless

the

The

classes.

with

derived

be regarding

pixels

a decision

deviation

clusters

regarding

results

rule

standard

to process

¢ corresponding

Experimental

farther

class

and

Cluster

C

- m_) < (m,:r_)'N['(rno_)

more

than

basic

data feature

characteristic form

our

derived

can

pixels belonging to that mean should be measured

only uses the mean From a database

earlier

of queries

and

Our

to a certain

satellite

if

a_ is the

complex

the

queries,

how the

will contain

move

image

From

classes

in the

mean

we formulate

- mc)'S;-'(C(m)

this decision rule matrix information.

for all types

percentage.

class

that

is available distribution.

belonging

as m increases,

will be computationally calculated

Thus

to a certain

assume

to a class

As the

distribution.

and

classes

mean, the possibility of the cluster containing This distance between a cluster and a class variances

to class to showing

ground

closer

class.

refers

index).

to actual sensor

this

logical

a particular

regarding

We shall

the

grain

queries

_c for a Gaussian lying

references

ourselves

logical

consider

content-based

data(snow which

processing

complexity.

answer

for a given

means

query

we restrict

refer

by which

to be

on the

will not

well defined

This

of ad hoc need

operations

methods

derived

consists for query

paper

are

regarding

processing

to computational

containing

3 can help

class

matrix

with

to the

case

techniques

This

of the

queries

The

attributes

information

the

processing

and

part

second

for which

expensive. regard

content

and

can allow

itself

In this paper,

for a particular

the

case

scenarios

in Section

Class

its presence, and

only

developed

Query

time.

to be a benchmark.

on "content".

representation data.

at run

classes(snow,cloud)

we consider based

and

are two ways of process-

certain

apriori The

queries

too much

with

regarding

separately.

will limit

without

query

location,

former

There

is analyzed

computationally

lack of literature

time,

queries

image

features

as the

representation.

case,

standard

image

more

logical

is stored

from

processing

to the

we consider sensor,

attribute

processing

query

Due

logical

different

much

of query

help

first

the

will differ

are

this

In the and

on the logical

image can

point.

content

which

aspect

by using

value

above

for the decision

of a query

image

is checked

selected

C are

to the above rule

is very

rule

summed

simplistic functions

class.

and not the above query

above

function

can

form

concerning

to find

decision

particular

particular

of the clusters standpoint, the

In case

as distribution

for that

as that

the

the

extent

ones

to find

is

which

the

area

rule

are presented

and

in practice

of several

classes

a

Query by Value: As mentioned in Section2.2, the numerical valuesof the derived products are typically dependentdirectly on the intensity valuesof the pixels. Thus queries about these products can be consideredto be a query regarding the intensity values in the image. These queries can be directly processedusing the feature basedrepresentationas the cluster featuresare directly representativeof the intensity distributions in the image. Data Mining: Data mining refers to algorithms which tries to discoverinteresting conceptsfrom data. Cromp and Campbell [12]discussedsomeimportant applications of data mining to remotesensing.We believethat our representationof remotesensed imagescan be applied to someaspectsof data mining. For instance,anomalydetection is a method by which artifacts, such as outlier values from the standard statistical range, are discoveredto be a manifestationof a previously unknown phenomena.The cluster representationof the multispectral image is very conciseand lends itself to statistical processing.We presentsomepreliminary results towards this end. 5 The the

experiments

National

presented

in Section

georegistered 3 and

used

mean

intensities

gave

very

good

characteristic the mean and For the

which good

depending

x 1000

were

for these on the

to us.

for all the

images.

mean

Thus

intensity

One

stage,

out. on each

clustering

original

image

performance, and

of the

are given

in Figure

for bands

1,4 and

a better

each

sector

primary

between 199 sets

of

America. use

suggested

of these

by Baglio

the final cluster 4(a)-(c)

the image

was

mean where

to which

we had

five sets

of band

and

Since indicator

squared

The MSE

image The

average does

to compare

110

of characteristic

original

per

was

MSE

not take

was measured into

account

performance

1 gives

Clustering To get

was

a measure

between

the

each

pixel

of mapping

performance

for

which These

Table

calculated

is the result

MSE

the

sectors. sector.

not

values image,

classification. classes.

16 equal

error(MSE)

the final to.

into

64 clusters

were

of the differences

1 for a particular

to the

divided

values

because

queries regarding and classes.

to get

it belongs

respectively.

5 respectively.

performance

image

of characteristic

primarily

compared

separately

set

images,

value

performance

bands

already

difference

of Northern

The

was that

had

5. We had

parts

pixels.

5 as the primary

available

classification

classification

clustering

carried

mean

band

The

band

different

scheme

values were also used for answering the variances for the different bands

of the

band,

over

1000

images

of

the pixels

to provide

1 and

stage.

called

taken

classification

2. The

from

details

a maxinmm classes from

images

band

The

and

available Some

for distinguishing

class

to the

monitoring.

were

around

were

classes, viz. snow, land and cloud. For our experiments, we developed classifier for which we estimated the characteristic values for the three

each

then

images

Figure

at this

band

which

Center(NOHRSC).

2.2 and

calibrated

averaged

imagery

Sensing

as a separate

These

images

[6] which

classified

ill the

treated

database.

was for snow

sufficient

radiometrically

4 was

of these

into three likelihood the

already and

band

in our

Holroyd

AVHRR

were

band

images

on NOAA Remote

been

sizes

conducted Hydrologic

images

The

were

RESULTS

Operational

these

images

EXPERIMENTAL

for bands

1,4 and

5

to be 2.82,2.08,2.22 the variance

of each

band

of each would

be

Band

Table 1: Mean and Total Cloud

Total Mean

Variance

Variances Cloud

Mean

for Different Classes. Snow Snow Land

Variance

Mean

Variance

Mean

Land Variance

1

36.9

700.6

58.0

612.9

43.2

481.4

16.8

119.5

2

34.7

468.1

50.0

436.1

38.4

267.6

18.3

103.7

3

86.7

155.8

91.3

111.8

81.5

29.1

90.1

31.0

4

79.1

157.4

71.9

110.6

78.3

27.5

85.6

34.7

5

7.01

73.9

19.4

78.4

3.18

14.2

4.60

16.1

3O 2O I

10

&22

0.05

0i 50

50

100

150

200

o.

50

100

150

.

L_O

MSE

the normalized

MSE

for Band (NMSE),

_1

4 and

is defined NMSEi

The

NMSE

NMSE

was

be seen, better

performance measured

the

NMSE

than

clustering

that based

performance is dominated

complexity formances

[13]. Improved between bands. by

clusters

which

performance

values

lie farther

away

different

200

150

200

100

the

5 (d)-(e)

for the

i band

with

5 are

given

1,4 and

higher

variance(band 4 and

variances. measures

but are

classes

are

expected

4

which

expense

certainly

needed

performance

to find are

how

the

accepted

to be different

111

as the

The

average As can

1) is considerably This

We had

implies

earlier resulted

of higher

that

the

reported

an

in comparacomputational

to give comparable

of the

decision

performance as belonging

was carried out for the three different classes. The is taken to be the class from the maximum likelihood

of the

1, Band

5 respectively.

5).

at the

to see the mean

4(d)-(e).

for bands

larger

bands,

class

for Band

as

in Figure

distortion

tried

NMSE

o?,

with

methods

of m were from

150

variances(bands

different

clustering

100

Band

0.122

band

lower

: We experimented

Different

This experiment for each cluster

with

for the

Class

4.

and

by bands

on clustering

performance

Query

with

200

MSEi

-

1,4 and

for the

bands

ble NMSE

in Section

bands

to be 0.019,0.075

for the

process

approach

for the

.

(f)

1, Band which

,so

(d_

1"_

(c}

Figure 4: (a)-(c) and Band 5.

1110

per-

rule

varies

given

as more

to that

class.

correct classification decision rule. The

covariances

for

the

classes

lo0

)4 '

_" t00 fsO

_ _

.__._.

_-

F - - - _ ......

x

9o

+

'

,. -''"

x

2:_--;

St)o,lv

. Lind - Cloud

"'-.

80

_

' --

:7"_ 31,

m

x

--

80

x

Snow i.nd

x_

5O - " - - _ __

×

- Cloud

40

3O 20

4

30

40

50

60 _1(%)

m

Figure vary

5: (a)Recall

considerably.

efficiency, the

m (b )Efficiency

retrieval

performance

if A represents

number

images

Tile

versus

of images

which

were

the number which

were

missed,

and

recall

-

was

of images

falsely

recall

versus

rn (c) Recall

judged

which

retrieved are

as

-

A + B"

Figure

5(a)-(b)

shows

classes.

expected,

since

resulting

in more

more

efficiency

as more

is shown

in Figure

efficiency.

Thus

low variance

regarding

how

class

sectors

queries

the

of the

percentages.

recall

performances

more

stable

with

the

distributions sectors with

the

condition

retrieval

is dependent

As the decreases.

ranges

to different

on the

for the

However

Since

with.

bands

values

query

conditions.

112

the

much

decrease

and

could again

in

be the could

be that

information

cluster,

which spectral

here divided

queries

were

& condition

the

it is seen

percentages that

It should

queries

would

each of the

As can

conditions are

in the Unlike

image type

on percentages.

values. of the

values values.

the

we had

The

were primarily

original

on the

stringency and

efficiency

It is suggested

queries

performance

in general

worst

of m,

of decreasing

also incorporates

retain

bearing

sector.

values

for this

mixed.

is as

be studied.

of each

on band

corresponding

queries

respect

in each to experiment

which

This

versus

without

the was

could

of m for the

cost

of recall

reason

very simple

direct

performance

plot

The

space

with

will have

checked

rule

multispectral

we experimented which

and

of relevant

for larger

this is at the

performs

a decision

in the

class

to increase case.

of snow

representation

pixels and

cloud

performance

queries

a184

containing the

In comparison, The

number

as m increases.

The

recall

feature-based

which

intensity

like

in this

performance,

: Here,

However

also retrieved.

results

lie nearby

the

we had

performance and

which

well

best

- recall

B represents

as a function

increases

to a certain

retrieved.

are

we would

the

spread.

retrieval

queries

16 sectors,

Ideally

for land.

so as to support

be regarding

gives

the

being

efficiency

recall

as belonging

clusters

gives

by Value

to check the

5(c). land

classes

Query

clusters

and

classes,

qualify

unwanted

to its large

for improving

image

relevant

efficiency

of recall

for all

Efficiency.

measures the

100

A

A + C'

performance

clusters

values

attributed

the

As expected

90

retrieved,

C represents

A

different

two

defined

80

versus

were correctly

and

efficiency

using

70

into

: Find Table

be seen, on

made

the

2 the

bands

narrower,

efficiency

performance

be possible

to increase

is the

Table 2: Retrieval Performancefor Different Queriesby Value. Condition on Bands Condition on Percentages Recall(%) Efficiency(%) Band5

>_ 11

Band5

2 11

3 < Band5

> 20

_< 5

50

it was possible

as avg, variance

efficiency

100.0

Detection):

system,

considerably(four

the answers

the

can in general

(Anomaly

management

20.06

_< 50

_> 50 &

by relaxing

performance of the queries used for representation.

database

99.58

> 20 &

Bandl

recall

95.83

ASC-9350309.

The

113

authors

would

like to thank

Fellowship. Presidential Tim

Szeliga

of NOHRSC AVHRR

for some

images

help

used

with

in the

the

classification

presented

Future

of Remote

OTA-ISC-558,

Global

"How

Change,"

Sensing

from

Space:

Printing

Sequoia

2000

Sequoia

Technical

Civilian Office,

Addresses

Issues

Report

and

shape,"

Databases,

February

[4] M. Arya, dbms

W.

J.

New

W.

the

Advanced

Richards, York,

Tilton,

Marin,

Data

and

J.

Proceedings

"Methods

Remote

High

for

A. Toga, on Data

IEEE

Data

and

J.C.

Tilton,

Using

a Massively

Conference,pp.

[12] R.F.

Cromp

and

Images,"

White,

of 23rd

for

Image

"Qbism:

for

and

Video

Extending

Engineering

AIPR

Radiometer," March

Image

Analysis,

Digital

"Indexing

Operational

Center,

a

(ICDE),

pp.

Multispectral

Workshop

Im-

in Image

and

Snow

Cover

Technical

Area

Report,

Mapping

National

Op-

1989. An Introduction,

Springer-Verlag,

Proceedings

Wash.,D.C, Vellaikal, Using

Orlando,

pp.471-480, C.-C. Vector

SIMD

181-190,

1992.

of the

"Data

Conference

Mining on

with

411-420,

Quantization

Machine,"

Arnold,

Compression,

Experiments

Vector

Parallel

Edward

Signal

Conference,pp.

"Progressive

Campbell,

and

"Compression

Compression

Analysis,

W.J.

Quantization

M. Manohar,

of the

Cluster

IV,

Systems

1994.

Resolution

Sensing

Sensing

and

Everitt,

A.

Retrieval

and

P. Kelly

Holroyd, Remote

[11] B.S.

Images

and

Conf.

Oct.

Very

D. Han

Compression

[13]

Information

1993.

M. Manohar age

Storage

Int.

Wash.,D.C.,

E.W.

Hydrologic

Proceedings [10]

Applications,"

E. Glasman, D. Petkovic, P. Yanker, C. Querying images by content using color,

Tenth

[8] A. Gersho and R.M. Gray, Vector demic Publishers, Boston, 1992. [9] J.C.

and

and

1993.

1992.

J. Richardson,

images,"

Retrieval,"

Systems, and

erational [7] J.A.

French,

Baglio

Using

Systems July

in Data

92/14,

of SPIE

C. Faloustos,

3d medical

for Content-Based

Information [6] J.V.

of the

1994.

Barros,

ages

part

1993.

Cody,

to support

314-325, [5] J.

Proceedings

Satellite

Wash.,D.C.,

[3] W. Niblack, R. Barber, W. Equitz, M. Flickner, Faloustos, and G. Taubin "The QBIC project: texture

for providing

REFERENCES

U.S.Government

[2] J. Dozier,

and

research.

8 [1] "The

algorithms

London,

AVHRR

of Multispectral

Proceedings

Aca-

Data,"

1991.

of the

IEEE

ImData

1993.

of Multidimensional

Information

Kluwer

and

Remotely

Knowledge

Sensed

Management,

1993. Kuo

and

Son

Quantization,"

Dao,

"Content-Based

Proceedings

1995.

114

of SPIE

Retrieval Visual

of Remote

Information

Sensed

Processing

N96- 15457 Machine

Learning

for a Toolkit

Richard

for Image

Mining*

L. Delanoy

_-_; / /

Massachusetts Institute of Technology / Lincoln Laboratory 244 Wood Street, Lexington, MA 02173-9108 (617) 981-2545 [email protected]

/ .z

ABSTRACT A prototype computer

user

skills

environment

to collaborate

is described with

that enables

a computer

algorithm

a user

with very limited

to develop

search

tools

(agents) that can be used for image analysis, creating metadata for tagging images, searching for images in an image database on the basis of image content, or as a component

of computer

logue

between

tion.

The

vision

algorithms.

Agents

the user and the algorithm.

algorithm,

in response,

are learned

The user points

attempts

with the set of objects

to mistakes

to discover

discriminating between objects of interest and clutter. and applies it to an input image, producing an" interest" are consistent

in an ongoing,

and clutter

which

two-way

made in classificaimage

attributes

rapid

trained

appears

convergence

to reasonable

to be competitive

indicated

levels

by the user.

of performance

in discrimination

are

It then builds a candidate agent image highlighting features that The dialogue

peats until the user is satisfied. The prototype environment, called the Toolkit Mining (TIM) is currently capable of learning spectral and textural patterns. exhibits

dia-

and, when

accuracy

with other

re-

for Image Learning thoroughly

classification

techniques.

1. INTRODUCTION Vast image

databases

ers to find particular

being

images.

accumulated

For example,

today much

are overwhelming

of the data currently

and civilian sensors are recorded and never accessed, image search. And the problem is getting worse, given planned

and deployed.

levels of computer debate [1,2]. Computer tadata

expertise

assistance

(key words

to locate

images While

these

data can be made

and hardware

will become

and computed during

resources

increasingly

indices

for some applications.

lected over Florida threshold.

How

very useful,

metadata

available

from military

of a continuing

to facilitate image

with varying

cloud index

and impractical

national

such searches.

as a whole),

one might ask for Landsat

a computed

is insufficient

to researchers

is the subject

necessary

For example,

obtained

of us-

in part because of the difficulty of the fleet of imaging satellites being

used to label each

June 1993 in which

the capabilities

images

was greater for many

Me-

can be used col-

than some other search

problems, such as finding images of moveable targets in military applications, finding tropical storms in weather satellite images, prospecting for mineral deposits, locating potential archeological

sites, or finding

a face in images

generated

by an airport

surveillance

camera.

* The work described here has been supported by the United States Air Force. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Air Force.

115

For these rectly.

and many

Until recently,

other applications,

the only means

the only solution

for searching

through

is to search

image

the image

contents

data on the basis

di-

of content

has been customized algorithms designed to search for a particular signature. While these are often effective, such algorithms are expensive to create and maintain, suitable only for limited

applications,

and require

a computer

vision expert

to build and maintain.

An active

area of current research and product development has been the construction of generalized database retrieval and classification tools with user-friendly environments for users with no experience ample,

in computer

be it a drawn

vision.

shape

These

tools generally

or a box enclosing

allow

a subimage

the user to provide

ofpixels.

tempts to characterize the example as a set of attributes (e.g., edges, spectral eigenvalues) and compares these attributes with those computed for images A "distance"

calculation

comparing

the attributes

a single

The algorithm

of the example

ex-

then at-

characteristics, in the database.

with those

of images

in

the database is computed and used as the criterion for selection. While capable of attentiongetting demonstrations, these tools have exhibited only fair performance and have been applied

only to small databases

developed at Purdue

in limited

single example.

The algorithm

cannot

ample are useful for discrimination a user might get good discrimination the desired

domains.

Examples

of this class of tools are QBIC,

by Niblack, et al. at IBM [3] and Multispec, developed by Landgrebe and Biehl University [4]. The main shortcoming with this approach is that of using only a

results,

decide

which

and therefore from a single

there is no way to make

attribute

values

the agent any better

an algorithm

can learn

which

Using tools of knowledge-based [5,6,7], a prototype Toolkit for Image attributes mistakes

or combinations

attributes

This paper

processing developed at Lincoln Laboratory (TIM) has been implemented that learns which

are discriminating

describes

the intended

from a user

simply

pointing

look and feel of a proposed

environment, along with details of the underlying techniques implement the TIM prototype. An evaluation of the technique applications

from mistakes, incorpoof false alarms. From

are discriminating.

signal Mining

of attributes

of classification.

with that ex-

other than to try a different

example. A more ideal algorithm would be one capable of learning rating not only examples of the object but also counter-examples these,

associated

uses them all. There is some chance that example. But, when the user is not getting

of machine learning used to and a discussion of possible

is also presented.

2. THE TOOLKIT

The premise

ENVIRONMENT

of TIM

is that a user, having

found

one image

containing

some

object

region, would like to find other such images in a large database. An agent is generated collaborative process in which the user indicates mistakes of omission and commission TIM and TIM responds intentions. ly searches

at

TIM

When through

with a visual

finished,

representation

of its current

the agent can be used in a variety

large databases.

116

understanding

of ways,

including

or in a to

of the users autonomous-

Figure 1. Example session of the TIM prototype. On the left is one of 3 infrared bands of an image of a town. Frames labelled SW, MW and LW are the short wave, medium wave and long wave bands of the subimage at tile 40 (near the center of the image on the left), which contains two bridges crossing over a river). The frame marked INTEREST is the result of water from the set of example

of applying an agent and counter-example

constructed for the detection pixels supplied by the user.

These pixels are marked on frame LW in blue and yellow, respectively (appearing as black pixels in the river and white pixels along the highway at lower left in the gray-scale rendering shown here. The first step of an interactive

session

with TIM

would

be to fill in information

about

the sensor and the data available. This information can include filenames of images used for training. Menus would provide options for those attributes that the algorithm consider

as candidates

for learning.

For example,

the value of each spectral band and the value the set of candidate attributes. Following

initialization,

if the data

source

of each ratio of bands

the main user/machine

interface

is multispectral, might

is a screen

to be is to then

be considered

that would

as

look sim-

ilar to Figure 1. The screen presents an image consisting of several spectral bands (in this case long, medium and short wave infrared bands) and a feedback image called an interest image (right). With the left mouse button, the user clicks on pixels that are examples of the objects or regions to be sought. With the right mouse button, the user indicates counter-examples rithm

of objects builds

or regions

a new template

plies the template

that should encoding

to the various

bands

be ignored.

the current

With

of the input image,

117

each input

understanding

by the user, the algo-

of the users intentions,

and displays

the resulting

ap-

interest

image.Theprocessrepeats,usingthe sameor otherimages,until the useris satisfiedwith the classificationperformance. Oncesatisfied,theusercanapplytheagentto database searches.Theusermight specify existingmetadata(time,place,or sensor)thatmighthelp constrainthesearch.In addition, theuserwouldselectfrom amenusomecriterionfor imageretrieval. Forexample,theuser mightrequestthattheagentretrieveanyimagefor whichtheaverageinterestvalueis greater thansomethreshold.The userexploresthe imagesretrievedby the agentto assessperformanceandtolook for interestingobjector regionvariantsfoundby theagent.At thispoint, the userhasthe option of reenteringTIM to further refinethe agent. 3. MACHINE

LEARNING

The core technology of TIM is a new machine learning algorithm, based on techniques of knowledge-based image processing, in particular functional template correlation (FTC), developed based

at Lincoln

on fuzzy

processing

Laboratory.

Functional

templates

are generalized

set theory, that can be used to create customized,

operations

[5,6].

FTC can most quickly

matched

filters,

knowledge-based

be understood

as a simple

image

extension

of

autocorrelation. Where the kernel of autocorrelation consists of literal image values (essentially a subimage of the image to be probed), the kernel used in FTC is a set of indices that reference

scoring

functions.

These scoring

functions

encode

a mapping

between

image

pixel values and scores to be returned for each element of the kernel. High scores are returned whenever the input image value falls within the fuzzy limits of expected values. Low scores are returned

whenever

from the scoring

the input value falls outside

functions

are averaged

these

and clipped

limits.

The set of scores

to the continuous

autocorrelation, if the feature being sought can vary in orientation, the maximum average score computed across multiple orientations Scoring uncertainty. edge)

functions allow the encoding In addition, scoring functions

of the object

being

functional templates Functional templates spectral

vision

thinning,

As a result, image

and data fusion.

applications,

including

In order to probe

processing

thin line detection,

operations

in Doppler

images,

spectral band. In this case, each functional having a single element with an associated, bands into a single value,

as inputs and likewise

fuzzy

FTC has been successfully

automatic

multispectral

tandem, producing

generic analogs. shape matching,

morphology,

convolu-

applied to several

target recognition

as

com-

[8,9] and the automatic

weather

radar images

scoring

functions

for the Federal Aviation

are constructed

for each

template consists of a series of kernels, each unique scoring function. Transformations of

such as the ratio of two bands, might also be considered

have a scoring

trfinsformations,

a single average

score is

constructed

function.

In addition,

texture can be decomposed

primitives (e.g., local variance, fractal dimension, contrast, periodicity, orientation) be encoded as pixel values representing a local neighborhood. The scoring functions spectral band, spectral

As in

of ranges of acceptable values, thereby encoding can be used to encode physical properties (knowl-

edge detection,

detection of hazardous wind shears Administration (FAA) [7,10,11 ].

multiple

then the match of the kernel.

generally tend to be more powerful than their standard, have been used in a variety of ways, including 2-D

pattern matching,

tion, homotopic puter

sought.

returned

range (0,1).

and textural transformations (i.e., interest

image.

118

would

into

that can for each

all be applied in

value) for each pixel of the output interest

Giventhis structurefor spectralandtexturalfunctionaltemplatecorrelation,functional templatelearningis implementedby constructingscoringfunctionsfor eachbandvalueor computedattribute.Briefly, the approachwe havetakenis basedon the constructionof distributionhistograms,onefor exampleandanotherfor counter-examplevaluesof each attribute.The scoringfunctionis amappingof thedegreeof overlapbetweenexamplesand counter-examplesateachbin of thetwo histograms.Forexample,high scoresarereturned for input valuesthathavemanyexamplesandfew or nocounter--examples. Low scoresare returnedfor valuesthathavemanycounter--examples andfew or no examples. The N functional single

most discriminating scoring functions are collected by iteratively building test templates that are applied to the set of examples and counter--examples. First, the

scoring

function

and counter-examples. with the first scoring tion.

Additional

is found

that does the best job in discriminating

Then, the second scoring function in a single functional

scoring

functions

lected or until the addition in performance.

are added

function

functions

fails to achieve

Figure 2. Input image and interest image with selected subimages application of the functional template taught to recognize water. 4. EXAMPLES Agent (long

wave,

training medium

AND

have been

an incremental

resulting

colgain

from the

EVALUATION

and use are illustrated wave,

examples

function is found that, when combined template, achieves the best discrimina-

until either N scoring

of any other scoring

between

with two different

and short wave)

infrared

119

images

examples.

In the first,

3-band

of urban

and suburban

scenes

wereused. Imagesweredividedinto 50 by 50pixel tiles andtrainedon onetile ata time. Figure1 showstheendof traininganagent(functionaltemplate)to detectwater. Figure2 showsthe applicationof this agentto the overallimage,resultingin the interestimageon theright. The white boxes enclose tiles that have average interest values exceeding a selected criterion agent

threshold.

In the same way that an agent might

might use the agent with a selection

criterion

select

to select

images

A more difficult task is the classification of vegetation Mapper (TM) images. Huber and Casler [13] have described a variety Their

of simple

area

classification

of study,

a region

metrics in central

and formulas Colorado,

tiles from a large

image,

from an image

the

database.

types from Landsat Thematic work in which they evaluated

developed

by wildlife

was classified

biologists.

pixel-by-pixel

into

14

vegetation types. Actual ground data were collected and used for scoring. They concluded that "using remote sensing information in complex terrains can be a difficult and relatively inaccurate process". With no technique 50%. When they added digital elevation facing, alarms.

they

Using

reached

nearly

70%

did the percentage data to determine

detection,

the same TM data, Augusteijn

(an enhanced

version

although

of correct classifications exceed the direction in which terrain was with

et al. [ 14] used

of back-propagation)

substantial

numbers

a Cascade-Correlation

[ 15] to classify

homogenous

of false

neural

regions

net

on the basis

of spectrum and texture. The data chosen by them for training and testing consisted of 8X8 pixel boxes containing only one vegetation type. Of the 14 vegetation types, only 9 were available in boxes of this size. The task was to train the neural net on half of these boxes and then attempt to identify vegetation were able to achieve 99% correct The same TM image learning.

Using

alarms) trained

was used to perform

the TIM environment,

of the vegetation typically

types in the remaining boxes. Using this approach, classifications using only spectral data.

rapid,

types

a preliminary

agents

in 100 X 100 pixel

with reasonable

detection

evaluation

were trained subimages.

performance

of functional

to identify

adjacent

with 6% false images,

continued.

After

the results several

average performance used for training.

These

results

alarms

on the training

Convergence

were scored,

images

and the process

detections

better than those reported

for 13 thoroughly was 94% correct then applied

classification

the agent reached

and 6% false alarms

by Huber

was

with 5% false

were

of indicating

were used in this fashion,

of 90% correct

are clearly

The templates

each

to a solution

(e.g., 80% detection

image.

template

pixel-by-pixel

in as little as 50 pixel inputs. The average classification accuracy agents (1 vegetation type was not present in the training image)

detections

they

errors

an asymptote

on subimages

and Casler,

to

not every

even when they

incorporated topographic data in addition to the 6 TM bands. It is more difficult our results with those presented by Augusteijn et al. Where the agents generated

to compare using TIM

were

only

to classify

mistakes

made

forced

homogenious

to classify

every

8X8 pixel patches.

TIM were in the boundaries biguous

spectral

characteristics

pixel, Many,

between

the

previous

if not most,

regions.

These

that are likely

well as by TIM.

120

study

attempted

of the classification boundaries

to be misclassified

often have transitional, by the human

truthers

by amas

5. DISCUSSION We have

designed

the training

a way for a user and a machine

of agents

to recognize

particular

with a set of examples of the intended ignored, the learning algorithm creates trained

agent

would

then

have

a simple,

discriminates is believed

readily

Customized

techniques

search

database

representation,

and to edit learned

to collaborate

patterns.

Being

to autonomously

in

provided

agents

to achieve

dedicated

predictable

to extract changes

to the learning

for a single

search

for other

the user can continue to refine Because functional templates

it is possible

can be extended

algorithms

or textural

As mistakes are made, and counter-examples.

interpretable

that these

spectral

algorithm

pattern and counter-examples of patterns to be a functional template of the intended pattern. The

be sent into a large

images containing the same pattern. the agent by adding more examples

learning

how each

agent

in performance.

It

of shapes.

application

may well outperform

agents generated using TIM. However, this approach is expensive and therefore impractical for most potential users of image databases. The interactive training of agents using the TIM environment sonably

allow

good

Agents tent-based

anyone

search

generated image

with an understanding

and classification

of a particular

by TIM might be used for several

retrieval,

as metadata

to tag database

a particular

task when

including images,

bandwidth

image

domain

to build rea-

tools.

(1 .) discovering (2.) discovering

limitations

different

applications

other than con-

equations

for indices

that can be used

which

prevent

spectral

the transmission

bands

are necessary

(3 .) priortization of data (e.g., should data be stored on primary, secondary, or tertiary devices) on the basis of image content, (4.) building components for more complex algorithms, ners.

and (5.) quantitative

It is likely to inexpensively of image

analysis

and trend

that a tool such as TIM, by enabling build useful

computer

vision

analysis

for scientists

users with limited

tools, should

facilitate

for

of all data collected, storage search

and policy

computer

plan-

experience

the exploration

and use

databases.

REFERENCES 1. NASA's EOSDIS development TEC-92-24, February 1992. 2. Broader involvement GAO/IMTEC-92-40,

approach

of the EOSDIS May 1992.

is risky.

community

U. S. General

is needed.

3. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, Taubin. The QBIC project: Querying images by content Science, February 1993. 4. D. Landgrebe and L. Biehl. An Introduction University, 2.15.94 edition, February 1994.

to MultiSpec.

Accounting

U. S. General

Office

Report

Accounting

GAO/IM-

Office

Report

D. Petkovic, R Yanker, C. Faloutsos, and G. using color, texture, and shape. Computer

School

of Electrical

Engineering,

Purdue

5. R. L. Delanoy, J. G. Verly, and D. E. Dudgeon. Functional templates and their application to 3-D object recognition. In Proc. of the Intl. Conf. on Acoustics, Speech, and Signal Processing, San Francisco, CA, March 1992.

121

6.R.L. Delanoy andJ.G. Verly.Computer scoring function.

apparatus and method June 1993.

for fuzzy template

matching

using a

U. S. Patent No. 5,222,155,

7. Richard L. Delanoy and Seth W. Troxel. Journal, 6(I): 187-212, Spring 1993.

Machine

intelligent

gust front detection.

Lincoln

Laboratory

8. R. L. Delanoy, J. G. Verly, and D. E. Dudgeon. Machine intelligent automatic recognition of critical mobile targets in laser radar imagery. Lincoln Laboratory Journal, 6(I): 16 l- 186, Spring 1993. 9. J. G. Verly, R. L. Delanoy, and C. H. Lazott. Principles and evaluation of an automatic target recognition system for synthetic aperture radar imagery based on the use of functional templates. Automatic Target Recognition lli, Proceedings ofSPIE, Vol. 1960, Orlando, FL, 14-16 April 1993, pp. 57-71. 10. Richard L. Delanoy and Seth W. Troxel. Automated gust front detection using knowledge-based processing. In IEEE 1993 National Radar Conference, Boston, MA, April 1993.

signal

l 1. M. M. Wolfson, R. L. Delanoy, M. C. Liepins, B. E. Forman, and R. G. Hallowell. The ITWS microburst prediction algorithm. Preprints 5th Conference on Aviation Weather Systems, Vienna, VA, August, 1993. 12. R. L. Delanoy MIT Lincoln

and R. J. Sasiela. Machine Laboratory, March 1995.

Learning

13. H. P. Huber and K. E. Casler. Initial analysis Sensing, 11(5):907-912, 1990.

for a Toolkit for Image Mining.

of Landsat

TM data for elk habitat

Technical

mapping.

Report

1017,

Int. J. Remote

14. M. F. Augusteijn, L. W. Clemens, and K. A. Shaw. Performance evaluation of texture measurements for ground cover identification in satellite images by means of a neural network classifier. Technical Report EAS-CS-93-8, University of Colorado, Colorado Springs, CO, October 1993. 15. S. E. Fahlman and C. Lebiere. vances in Neural Information

The cascade-correlation learning architecture. In D. Touretzky, Processing Systems 2. Morgan Kaufmann, San Mateo, 1990.

122

editor, Ad-

N96- 15458 within

Dissemination the Navy SPAWAR

of compressed satellite imagery Central Site Product Display environment"

Oleg Kiselyov and Paul Fisher Computer and Information Sciences, 303 N. Carroll, Suite 108 Denton TX 76201

Inc. ./

Phone: (817) 484-1165, FAX: (817) 484-0586 Email: [email protected], [email protected] http ://rep lic ant. c sci. unt. edu/-ole g/SIMDC9 5/P ape r_Summary.html

Abstract The paper presents a case study of integration of compression techniques within a satellite image communication component of an actual tactical weather information dissemination system. The paper describes history and requirements of the project, and discusses the information flow, request/reply protocols, error handling, and, especially, system integration issues: specification of compression parameters and the place and time for compressor/decompressor plug-ins. A case for a non-uniform compression of satellite imagery is presented, and its implementation in the current system is demonstrated. The paper gives special attention to challenges of moving the system towards the use of standard, non-proprietary protocols (smtp and http) and new technologies (OpenDoc), and reports the ongoing work in this direction.

I. History

of the

project

Central Site Product Display (CSPD) system has been developed for the Space and Naval Warfare Systems Command (SPAWAR) for disseminating a whole range of weather and oceanographic information to a variety of clients afleet or ashore [1 ]. The clients (not necessarily related to the Navy) can request/receive data via serial (dial-up), Internet and VSAT connections. The available information includes pressures, temperatures and wind directions on and above land/ocean surfaces, barometric heights, temperatures on the ocean surface and at specific depths, precipitation, primary observational data (weather reports), storm warnings, and a number of satellite images of different kinds (IR, Visual, DMSP). The satellite images are usually 512x512 or 1024x 1024 8-bit graylevel pictures. CSPD is being developed upon the foundation of a very popular software system NODDS built by the Fleet Numerical and Meteorology Center (FNMOC) at Monterey, CA. NODDS has been in operation for more than 4 years. The server part of the system runs on a Cyber mainframe. It was written in FORTRAN with modules dating as far back as 1987. The client portion of NODDS was written in QuickBASIC for PC-DOS. The hardware base of the old system is obviously in need of updating, so is the software. The new CSPD server has to work on modernday workstations (HP 9000, Sun SPARC, Concurrent's MAXION RISCstation) while the clients are to run on almost any platform: UNIX workstations, Macintosh and PC computers. Furthermore, there is a clear need to expand the system to include more, specialized weather products, to enhance display of received information, and decrease communication time. Improving satellite imagery communication is of special priority, especially as far as flexibility of requesting and compressing images is concerned. The new system should allow the user to browse and request all available satellite imagery (rather than a few pre-selected pictures), to choose a compression algorithm and its parameters, and evaluate the anticipated image degradation when a * This work supported in part by US Navy SPAWAR Grant N00039-94-C-0013 "Compression of Geophysical Data" and by US Army Research Office TN 93-461 administered by Battelle Research Office.

123

lossy compression has been selected. The user should compression of certain areas/features that he considers for using compression is to reduce the communication

also be able to specify particularly important. time.

a non-uniform Again, the rationale

Computer and Information Sciences, Inc. (CIS) has been working on this project as one of the contractors. CIS is solely responsible for handling of satellite imagery within CSPD. We have written the software to compress/decompress satellite images using technology developed at CIS, as well as code to convert satellite imagery from DEF to TIFF formats, to view images under CSPD and standalone, and to catalog imagery. We have also completely re-written the server part of CSPD in C++, and perform its maintenance and enhancement.

II.

Request-delivery

The current CSPD dissemination. A NODDS

information

flow

and

protocols

system is client-server, though there are capabilities server distributes weather and satellite information

for a broadcast-type in response to

queries from CSPD clients. The server uses NODDS/TEDS databases (managed by empress database manager) as well as a collection of fiat files to look up the requested data. The net functionality of the server is as follows: reading a NODDS input (request) string, retrieving grid data from a TEDS database, encoding them using Navy encoding routines (rainfom-gold or OTH-gold), and transmitting the encoded data to the requester. reports, and satellite image requests are handled as well.

Text files, synoptic

and upper

air

To formulate a request, a user has to select a geographical area he would like to get weather information for, and pick the products he is interested in for the selected area. Satellite imagery is treated as a regular 2D product. Request for a satellite image always carries in it specifications for a compression method to use, and its parameters. Originally they are set to some "default" compression specifications, which can be evaluated and changed by the user using the compression manager (see below). Whenever the server comes across a request for a satellite image, it queries the database to see if an image of the corresponding type (IR, visual, etc.) covering a specified area (and valid for a specific period of time) is present in the database. If the image exists, the server obtains the name of the corresponding TIFF file and launches a compressor plug-in to compress the file according to the parameters specified in the request. If the compression was successful, the resulting binary code is included in the return packet. Otherwise the server sends a message detailing the error (as a matter of fact, the entire contents of s tderr produced by the compressor). When the client receives the reply, it unpacks and preprocesses it. At this time, the client runs a decompressor plug-in, which decodes the received binary code into a PGM file. The latter is used by a display module to show the picture, and, possibly, lay temperature, pressure, etc., isoplots and other information over the satellite image. It should be stressed that the compressor and decompressor are separately compiled and run executables (plug-ins), which are launched (spawned) whenever there is a need to encode/decode image data. The modules are available for a variety of platforms, and rather versatile: they support very simple and unified interface. The fact that request/reply parsing and compression/decompression are performed by separate executables has several advantages. Indeed, one can modify the compression module (say, by adding more sophisticated compression techniques) without any need to recompile/relink client or server executables. Moreover, the compressor can be run by programs other than the CSPD server, as the functionality the plug-in provides - image compression - can be used in some other contexts, for example, within WWW environments. Another important advantage of having compression/decompression modules as separately run plug-ins is a better error handling and improved robustness to error conditions. The root NODDS server module can only detect very simple errors, mainly related to the syntax of a request

124

string.Most semanticerrors,for example,whena requestedimagefile cannotbefoundor read, thereis no spaceondevicefor the encodedfile, whenthespecifiedcompressionparametersdo not makesense,etc.,- aredetectedwhenactuallyperformingthecompression.Whenthe serverroot modulelaunchesthe compressorthroughthe system()function,it redirectsthecompressor's s tdou t / s tderr into a temporary file. If the compressor succeeded, it writes "OK" into a special file and creates a file with the compressed image data. The latter is included by the NODDS server in its output stream, PACKED.DAT (following appropriate headers). UU-, Base64-, etc., encoding can also be performed if necessary at this stage. If the compressor failed, its standard error stream (which explains in detail compressor's actions and the reasons of failure) is inserted into the PACKED.DAT as a special message to a client. In any event, a problem with locating/compressing of an image would never cause the NODDS server to crash, and the user would always get some answer to his query. As was mentioned before, we have operational compressor/decompressor plug-ins for a variety of platforms: Sun Sparc, HP 9000/Txx, Macintosh. The modules are rather versatile and independent: they can be run as "filters" that translate their input into output (that is, read a TIFF/PGM/XWD file and make a compressed output, or read the compressed image code and prepare a TIFF/PGM/XWD file). This makes it straightforward to integrate the plug-ins into various environments. For example, recently we have cast the decompressor modules as Web browser helpers, with the help of a very simple shell script. This made it possible to view the (compressed) satellite images retrieved from the NODDS/TEDS databases through Mosaic/Netscape. That is, clicking/linking on an item of type image /x- sowlp-gray (extension. s lpg) causes the compressed image be delivered, unpacked, and displayed. From the user's point of view, it looks no different than retrieving a JPEG picture. It is even possible to play with different compression ratios: one merely needs to type the value of a quality parameter on a specified slot of web page [2] and click on the "Compress Now" button. Having received the request, an HTTP server compresses the image and sends the compression ratio, compression time, and the compressed data back to the client. The user thus can note how long it took to communicate the image, and see how good the decompressed picture turns out to be. This interactive compression demo is available from [2]; please contact the authors for decompression helpers. Since http and smtp both use the same MIME protocol to deal with non-textual data, the decompressor modules-helpers can be used just as they are to view encoded images sent through email, that is, asynchronously. The same viewer that helps out Mosaic when image/x-sowlp-gray stream comes in can be used to make sense of an e-mail message with a image/x-sowlp-gray "attachment". The mailers take care of ASCII encoding if necessary, and they do it transparently and automatically. Using electronic mail to send out satellite imagery has a tremendous advantage: one is no longer limited to TCP/IP or any other particular communication protocol or media, one can send a mail from a FIDONET site to a mainframe, from a PC on a Novell network to a Mac, etc. One can also take the full advantage of all e-mail frills and embellishments, like mailing lists, automatic broadcasting, mail aliases, mail exchangers, etc.

III.

The

compression

manager

The NODDS server always compresses satellite images for transmission. The compression method and its parameters are taken from a satellite product request string received from a client. When the user first selects a satellite product in formulating a request, the corresponding request string receives some "default" compression specifications. It is the compression manager that lets the user customize the default settings. Thus the place of the compression manager can be illustrated by the following chart on Figure 1. The compression manager allows interactive selection of a compression method and its parameters. Note, the compression is almost always lossy. To give the user an idea how well the delivered

125

imagewould look like andwhetherimportantfeaturestheuseris lookingfor would be preserved, the compression manager loads up some (relatively small) sample image and displays it along with the compressed and decompressed picture, using the parameters the user has set up. Fi.gure 2 shows a snapshot of the component manager screen. The new version of the compression manager supports non-uniform compression as well, see below. At present, the compression manager implements a few variants of a modified Laplacian pyramid compression method (multi-resolutional analysis with overcomplete frames) discussed papers presented at Data Compression Conferences [3], [4]. See also [5] for a complete list of references.

IV.

Non-uniform

in

compression

The very nature of environmental images, or any image for that matter, suggests that not every detail of the picture is equally important to the observer. For example, the area of the hurricane eye on a satellite image should be of high resolution, while the tight cloud cover of the hurricane body is less informative and may be rendered with a lower resolution, though it cannot be completely discarded. In disseminating weather information over a fleet, a meteorologist at a particular ship needs very accurate data on the cloud cover, wind direction, temperature, etc., just in the vicinity of his ship. The information about what is going on outside that small area is used for prognosis and does not have to be of very high precision. Accordingly, the amount of loss and inaccuracy that can be tolerated during the communication varies not only from one user to another but also from one region of the image to another. This raises the problem of a non-uniform, lossy compression, i.e., compression where the loss varies with the location/features/frequencies, etc., and tailoring such compression to a particular user and circumstances. Preserving the information during compression to the extent the user needs, but not more, helps to drastically reduce the amount of data that has to be transmitted. In the present implementation of non-uniform compression [3], the regions of special interest are specified by assigning to them a degree of relative importance, using an "aplhachannel", or a "mask". The mask is an 8-bit image of the size of the original one with pixel values telling a relative weight of the corresponding area: the bigger the weight, the higher the importance. Note that the mask is necessary only for compression; a non-uniformly compressed image code has the same format and structure as a uniformly compressed one. That is, non-uniform compression does not require a special decompression; the same decoder restores both kinds of compressed imagery. To simplify transmission of the mask in a request to the server, we, at present, have limited specification of the importance area to a single rectangular region. Therefore, only a few extra characters in the request string are necessary to solicit the non-uniform compression. For example, a specification [ 10,( 1,2)-( 127,255)] means that a rectangular area of an image with the upper left comer at (1,2) and the lower right comer at (127,255) is to be assigned a weight of 10 during the compression. The rest of the image (outside this rectangle) is assigned weight 1 by default. Thus, features within the rectangular area would incur 10 times less distortion than features outside the rectangle. Note, xrnin, xmax, yrnin, and ymetx are specified in pixels of the image to compress before

of the target image: therefore, one has to have some idea of the size specifying the non-uniform compression parameters.

126

Pp.r._ian

Giilf

Arp_

f!_req_!.__._.! SAT

... SATV010UW

... 0040...

LAPPRENO

+itt+l

I

+.otl+

_

NODDS

Server

Manager

Compression

SATV011

Find

Image

] Find

Compressor

Fig. 1. Compression

Manager's Change

MIDDLE Original

image

[Lap

Pyr

place

in the overall

Compression

the

(IR)

imaq_

_Compression

,F-Quality

method--

parameters--

1100401 select

flow

Pat+_ms

EAST SATELLITE

_ompress

information

I}I}I}I} a product

compression

11}I_I}I}] to

tune

parameters

[-] Non-uniform

compression

In image of

I Cancel

Fig. 2. Snapshot

of the Component

127

Manager

screen

I

Note, that the region of low distortion is blended smoothly into the somewhat more distorted "background", without any sharp and conspicuous boundaries between the areas of different importance. Thus the picture restored after decompression does not have a "cut-andpaste", or "patched" look at all. These smooth transitions are a notable feature of the present nonuniform compression method. interest, image:

The compression assign a degree

-__%J

manager discussed above can also be used to select a region of importance to it, and preview the effect of the compression

eomp,manager

Fig. 3. Snapshot

of the Component

Manager

- Change

screen

Compression

with preview

Params

of non-uniform

Note that the compression ratio shown on the picture is significantly underestimated. image is subsampled to be as small as 128x 128. It is well known that smaller images compress. In our experience, when the same method is applied to a full-size 512x512 image, the compression ratio would be at least 4 times as much.

V.

Challenges

and

future

of special on a sample

compression The sample are tougher satellite

directions

The new version of the CSPD clients and server is being phased in. However, the system, and especially its satellite image delivery part is under further development. The challenges here are: - Allowing acquired

interactive browsing of new imagery a new image available for retrieving;

128

and letting

a client

know

that the server

has

to

- Keeping local/customary configurations simplify the request protocol and allow

(profiles)

of clients

on the server,

which

would

- Automatic delivery of weather products: each client receives (without an explicit request) periodical updates of information which it "usually" asks for. The information is also tailored to the current location of clients: for example, the region of special importance in nonuniform compression changes as the client moves; - Binding to;

of an image,

- Overlay image.

of relevant

that is pinning geographical

the picture

down

and meteorological

to the exact information

area on the globe on a full-gray-scale

it pertains satellite

Another challenge is integration of the request/delivery of satellite imagery and other grid products within http/smtp protocols: developing weather-conscious Web servers/browsers. One possible scenario: the user starts with a map of the entire world or some predefined region, as in the Xerox Map server [6]. The user can zoom in/out, etc., until he nails down precisely the area he is interested in. Then he can select weather products from a list of products and/or a list of product groups. A click on the submit button brings, after some communication delay, a familiar "weather map" view. Note, the server may have already recognized the user (or his site), so it can present him with a hot-list of his personal requests, that is, requests which were previously defined/asked for. Thus the user can skip on the area definition and communication, and simply say, "I want my usual". Also, it is very easy for the server to determine if there are new products added to the database since the last time the user had talked to the server. The server then can make a list of the new products on user's request. Since the server knows hide some products which are "for your eyes only".

exactly

whom

it is talking

to, it may also

It should be stressed that the scenario above looks very similar to the existing CSPD/NODDS interface, on the surface of it. The challenge now is the employing of standard tools as Mosaic, Netscape and httpd, and standard protocols, instead of proprietary clients and proprietary communication engines. Another difference is that the product lists and other configuration information are now stored on the server, not on a client site. Therefore, the information always stays current, and there is no need to waste bandwidth broadcasting product updates to all clients, even those who have already received updates, or who are not interested. One more difference of the proposed scenario is that the defined areas are also stored on the server (as a site "profile"). This arrangement cuts down on the communication time: the only information sent between the client and the server during the initial dialog is just the names of the cataloged areas/profiles. There is also no need to send the whole list of products being requested to the server, very often all the same list, and all over again: the list of products (with all needed configuration information and preferences) is already on the server. Keeping it there also helps in billing (should such feature be necessary) and accounting. Using standard Web browsers helps in one more respect: caching. Many Web browsers, Netscape for one, do their own caching, and do it quite efficiently and customizably. One can tell Netscape where to keep the cache, how often to purge it and how long to keep data in it (for a single session only, or for certain number of days, etc.). We are also looking into using architecture, for example, CyberDog.

a new generation

References 1. http://www.metoc.navy.mil/metoc/ 2. http://mozart.compsci.com/pictures/sowlp_compress.html

129

of Web browsers,

based

on the OpenDoc

3. Kiselyov,O. andP. Fisher,"WaveletCompressionwith FeaturePreservationand DerivativeDefinition," in Proc. DCC'92, 1992 Data Compression Conference, Snowbird, Utah, p.442,

March

24-27,

1992

DCC'93,

4. Kiselyov, O. and P. Fisher, "Pyramidal Image Decompositions: 1993 Data Compression Conference, Snowbird, Utah, p.479,

5. ftp://replicant.csci.unt.edu/pub/oleg/README.html http://replicant.csci.unt.edu/-oleg/ftp/ 6. http://pubweb.parc.xerox.com/map/

130

or

A New Look," in Proc. March 30-April 2, 1993.

N96- 15459 Data

Management

Atmospheric

and within

Radiation

Scientific the

Integration

Measurement

Deborah

Program

K. Gracio

Larry D. Hatfield Kenneth R. Yates Jimmy Pacific

Northwest

W. Voyles

Laboratory, Joyce

Brookhaven

National

Laboratory,

Mark

1.

Ridge

York

J. Leach

National

National

New

J. Laufersweiler Laboratory,

Paul Oak

Upton,

T. Cederwall

Martin Livermore

Washington

L. Tichler

Richard

Lawrence

Richland,

Livermore,

California

Singley

Laboratory,

Oak Ridge,

Tennessee

Introduction

The Atmospheric Department tion Models this goal,

three

will collect iation

Radiation

experimental

atmospheric

Testbed

of data from the Southem September

used

consists

in the Tropical 400 MB of data

Experiment

were

Program

designed

has been

to drive

the GCMs. available,

Center

Western

processes

observational

these

the ARM data.

Science

Team

A primary acquired.

requirement This function

data

Pacific

streams

of the ARM is performed

U.S.

of instruments

site, known

redundant

data

as a Cloud

system

systems

and in 1997

Laboratory

that and

Rad-

for the collection

and

external

while

basis

providing

data

of Alaska.

the SGP

Washington.

in

site to the The

for the collection, processthe CART sites from a vari-

sources.

to provide a three-month

is to preserve and protect Ridge National Laboratory

131

Slope

from

in Richland,

path and provides are received from

on a continuous

Program at Oak

on the North

per day via the Internet

Northwest

in near real-time

Each

by the

The first CART site was deployed in April 1992 Oklahoma, with the other two sites to follow in

is transferred

at Pacific

developed

capabilities of General Circulatransfer effects. To achieve

for the deployment

of a highly

Center is central to the ARM data and delivery of ARM data. Data

ety of instrumentation, Center

(ARM)

a variety of instrumentation. Great Plains (SGP), Lamont, 1995

Experiment ing, analysis

testbeds

data

(CART),

Approximately ARM

Measurement

of Energy with the goal to improve the predictive (GCMs) in their treatment of clouds and radiative

The

derived

Experiment data

running

products archive

to of

all data produced or where leading edge

technologyis employedfor the long-termstorageof ARM access

to data

for participants

The ARM Program oratories, providing

outside

of the ARM

involves a collaborative multi-disciplinary areas

advances 2.

the ARM

Program

in understanding

The ARM

Archive

provides

effort by teams from various DOE National of expertise. This paper will discuss the

collaborative methods in which the ARM teams translate into data products. By combining atmospheric scientists, engineers,

data.

Program.

has successfully

the scientific goals systems engineers,

designed

the parameterizations

and developed

of GCMs

Lab-

of the Program and software

an environment

where

can be made.

History

Planning is available approach

for the ARM in the ARM of the ARM

by Stokes

Program Program Program

and Schwartz

began

in the fall of 1989;

Plan (U.S.

Department

and the design

a description of Energy

of the CART

of the initial

1990).

The

sites is discussed

program

technical

in more

detail

(1994).

The design of the CART data systems was part of the initial program plan (Melton 1991). The plan called for a distributed environment, the CART Data Environment

et al. (CDE),

which

included

distributed

across

all of the CART

be implemented

an Archive, "based

an Experiment

facilities on use

Center,

for the collection of existing

mentation of functionality, and parallel et al. 1991). Figure 1 shows the flow

and a collection

of data

and processing

solutions

wherever

implementation of data through

of data.

possible,

of independent the CART Data

systems The

CDE

evolutionary

imple-

subsystems" Environment.

(Melton

SourcesData External Southern

Great

Plains

l

Data Management Facility E

Experiment Tropical

Western

Center

Pacific

IT

FacilityData Management

I

Archive North

Slope

of Alaska

Facility

Figure

1:

_

ARM

Data flow

through

132

IT

[

SourcesData External

l

the CART

Data

was to

_1

._ _-_

Environment

__a_'E

The SouthernGreatPlains (SGP)CART site wasthe first to becomeoperational,with instrument deploymentbeginningin April 1992. Figure 2 is a mapof the SGPsite andthe instrumentsdeployedthere. The site is slightly largerthan a GCM grid cell, and is approximately 350 km (N-S) by 250 km (E-W). The centralfacility containsthe greatestnumberandvariety of instruments. The principle objectiveof theseexperimentaltestbedsis to quantitatively describethe spectralradiative energybalanceprofile within a wide rangeof meteorological conditions. Measurementsare takenat a finer spatialresolutionthan GCMsutilize. The purposefor thesemeasurements is to improvethe parameterizations of sub-gridscaleprocessesfor use in GCMs. The siteswill be operatedfor 7-10 years,continuouslycollecting measurementsof atmosphericradiationandassociatedatmosphericandsurfaceproperties (Stokesand Schwartz1994). 3.

How

Science

the Technical

Infrastructure

Facilitates

Interaction between the Data and Science Integration Team (ST) is a primary mechanism that guides data ARM

the Scientific

Activities

of the ARM

Team

Program.

This

interaction

leads

Team (DSIT) and the ARM Science collection and management within the

to the translation

of scientific

goals

and objectives

into

data requirements that define experiments for individual ST members. Figure 3 is a conceptual diagram of the science-related functions of the ARM technical infrastructure. The flow illustrated

in the

figure

begins

Figure

2:

with a scientific

SGP

CART

dialogue

between

site with Instrument

133

the ARM

Locations

scientists,

ARMScientific Objectives

scienceneeds _

datarequirements

0 relateradiative fluxes totemperature, watervapor, cloudsandsurfaceproperties

O developandtestGCM parameterizations for watervapor,clouds,and surfaceproperties

C,ENT, ,C Figure 3 Scientific

Functions

of the ARM Infrastructure

collectively or individually, and the ARM infrastructure to define specifically what data are needed to accomplish the stated scientific objectives. From this point, the infrastructure implements the actions necessary to provide the required data to the scientists. Scientific feedback from users of the data is a key feature for improving the quality and usefulness of the data. ARM active

is unique in its approach of interacting proactively process between the DSIT and the ST is facilitated

understands

the ST member's

of required within data.

ARM

ARM. The

data.

There

acquisition

management of routine outside ARM.

planned

This process

are several of required data.

research

data

In some

and assists

identifies

actions

data

can present

in the identification

requirements

that the ARM cases,

with funded scientists. The interby a liaison from the DSIT, who

Program

difficulties

the required

data

not currently can take

and acquisition met by data

to obtain

the required

that are not encountered may be available

from

in the a source

Several steps are taken to enhance the observational capability of the applicable ARM Cloud and Radiation Testbed sites when data requirements point to the need for more observations. This enhancement can sometimes be as simple as the implementation of an Intensive Observation Period (lOP), where the temporal resolution of the observations is increased (e.g., more frequent radiosonde launches). Enhancement may be gained through the acquisition and

134

deployment of additional testing of new instruments Development In many lated

Program,

cases,

data

quantity

(e.g.,

calculations infrastructure.

One

data

stream

a component

requirements

may

vertical

may

instrumentation at existing facilities. Finally, the development may be required, and this is done within the Instrument

of the strengths

of ARM. can be satisfied

of two or more

velocity

the horizontal

from

by scientists

of the Science

within

Team

interaction

column groups attention possible

been

clear-sky

The DSIT the needs

of "showcase"

data

Data

data

Sources

and

are normally

data

wind

streams.

to produce

divergence). ARM

Algorithms

or developed

is the potential

A calcu-

an unobserved used within

for synergism

in the

through

ST members. This interaction happens most natsets of common interest. To date, the two most instantaneous

radiative

flux

(IRF)

and single-

is a catalyst for developing such interactive working of the working group to the ARM infrastructure. The

sets of common

that increases the benefit by ST members working

ARM

ARM

have

modeling (SCM). and communicating

identification

4.

areas

calculated

observations

or outside

concept

interaction and information exchange among urally around science issues and related data active

by new

be the merger

be defined

and

interest

of the collected in isolation.

data

helps

focus

sets and

ST and

stimulates

infrastructure

progress

not

Products

collected

in an ongoing,

continuous

manner,

punctuated

by IOPs.

These two methods of collecting data complement each other. The "first-generation" data stream is the observations taken directly from the instrumentation in the field.

of a The

quality

delta

of the data

threshold approach defined.

and documented

checks. The ongoing in analyzing the data.

A Derived stream

is assessed

Product

by using

ARM

minimum,

of a procedure

data

streams

which

as inputs

creates

automatically and continuously product") becomes a new data

Prospective

may

derived

derived to science

products.

products team The

members first

be identified

by any part

or any place

in between.

type consists

of data

of the program,

processing,

extrapolation, time synchronization of different data example of this type of product would be a procedure

existing

data,

a new

data

stream.

the data to make it easier for ST members dataset to improve the quality of the data.

These

or models

135

algorithms

as long as there stream.

polation, Another

thus creating

and

a "second-generation"

and applying

them. The procedure is run the output (called a "derived

mentors

maximum

nature of regular ARM operations requires an automatic To this end, the concept of Derived Products has been

is the definition

existing

via simple

products

There

are two

including

data

or models is input

to

data,

and

from

instrtunent

distinct

types

smoothing,

of

inter-

streams, and/or time averaging. that applies new calibrations are designed

to use, or to reprocess

to either

reformat

the original

to

The secondtype of derivedproductgeneratesnew models

driven

by inputs

These algorithms or the models. Quality derived pronged

from

the ARM

data

data streams derived streams, or from data

data

who

(QMEs)

are interested

are a subset

quality

on self-consistency

within

at the time of processing.

somehow

related

against

anomalies,

inconsistent anomalies. To meet

a set of expectations.

sites with

objectives

physical

products

quantities

designed

continuous

data

Network

products

from

from

the NMC

the Arkansas

time,

while

Products

these

data

by the University

streams

the NOAA available,

Wind GOES

Basin

Forecast

Center.

are delayed

until

quality

assurance

on the data

are either

coverage

some

statistics

surface

River

and

external

of these

currently derived

mesonet

sets which

data

to acquire

GMS

from

the

External

at the Southern and from a Kansas

Weather

Service

(NWS)

Network data, including observations, analyses products

obtained

in near real-

organization

has had the time

sets or from

at NASA

Langley

incorporate

and AVHRR

combinations

or are planned.



data

of the external

Examples

from

the various

It is anticipated that the need for external data will increase sites in the Tropical Western Pacific (TWP) and the North underway

of these

sets are acquired

the responsible

generated

by a group

data

or

observations

III precipitation data

to identify

sets.

external

being

Some

that are

implemented

data products.

National

Stage

streams

the root cause

Profiler Demonstration and AVHRR satellite

models

automated

data

the capability

to augment

of Kansas.

acquisitions

data

are already

various

incorrectly

to identify

are called

other

from

integrated stations.

needed

using multiple

provides

instruments,

it is necessary

sources;

Eta and RUC

derived

and ARM

A QME

across

of ARM,

other

operated

from

to perform

data

stream,

compare

which are currently being acquired to augment measurements site include surface observations from the Oklahoma Mesonet

surface and upper air data, RASS data as they become

The

in the derived

ongoing,

data

in contrast

and the information

the scientific

Mesonet

5.

either from physical quality comparisons.

of the derived

by providing

a single

QMEs,

such as inconsistent

algorithms,

data products Great Plains

cloud

ARM

ST members

Experiments

to enhance

focuses

methods

ARM

existing

from the intercomparison of different ARM data streams. QMEs are part of a twoeffort to ensure that ARM data is of known and reasonable quality. The first

approach

data

come

Measurement

specifically

from

include

the GOES

surface

satellite,

and

meteorological

with the addition Slope of Alaska.

for the TWP

the data

of the CART Arrangements

CART

site.

Challenges ARM

anical,

DSIT

and

Members

team

software of the team

is made

up of a diverse

engineers,

mathematicians,

are located

requirements

for enhanced

conferencing, members.

and travel

physicists,

at five different

technological have

set of disciplines

become

national

communications. significant

136

contributors

including

electrical,

and atmospheric laboratories, Electronic

scientists.

thus, mail,

mech-

increasing phone

to communication

the

and video among

team

The

group

number

responsible

for designing

and implementing

Specifically,

the implementation

of challenges.

relatively

short

amount

of time,

with team

members

the ARM

SGP

data

of a complex located

system

distributed

in three

separate

faced

a

system

in a

time zones.

Furthermore, the SGP site was not close to any of the laboratories at which the team worked and since it was located in the middle of farm land, the site was started with no communication facilities. The team was designing a system which was to run continuously for an estimated I0 years and be capable of near real-time delivery of data to its ultimate users, ARM Science Team.

The ARM Program Management developed lems by means of a subgroup, the Problem conference

calls.

problems

A database

via Problem Action

and stored

in the database

6.

Reports

Collaboration

Collaboration outside ARM and

(e.g.,

ideas

with

Researchers

ongoing

beneficial

and the resolution

Quality

ARM

of problems

of data

streams

Reports

(DQRs).

Programs

outside

via

are documented

ARM

level. In many instances, objectives. The exchange

to both parties.

cloud

In some

cases,

ARM

scientists of data

has

products).

programs

Since

is an important

ARM

addresses

can provide

and data that

resources

all reported

by identifying the scientists as adjunct ST members. In funding arrangements for providing desired data to ARM

research

programs

and resolving probmeets via weekly

the PRB by recording

the quality

and Research

objectives.

other

algorithms,

combining

of Data

satellite-derived

its scientific problem,

approaches,

(PIFs)

In addition,

in the form

is mutually

providing

Collaboration climate

Forms

(CARs).

these collaborations ARM has established

NASA

meeting

with

to assist

occurs on both the individual and program conduct research related to ARM scientific

scientific

strengthened other cases,

was developed

Identification

Corrective

a general method of tracking Review Board (PRB), which

the

with those

a specific

scientific

enhance

of the ARM part

of the

understanding,

the results

of other

part

certain

overall

for

global

observational

of ARM-funded

programs,

approach

research.

scientific

goals

Also,

by

can be

achieved that individual programs could not achieve on their own (e.g., ARM and SHEBA interactions in the Arctic). The density of instrumentation and the long-term period of data collection

at ARM

the ARM

sites

Water

CART

sites have

as benchmarks.

Experiment

(GEWEX)

the TOGA-COARE

program

7.

Future

Direction

Translating

science

reasonable

quality

ity to realize

these

Investigators

(PI).

To keep

pace

programmatic

attracted

Examples through

several

programs

of collaborations

its GCIP

in the Tropical

include

subprogram

Western

that wish

to take

the Global

in the Southern

advantage

Energy Great

of

and

Plains,

and

Pacific.

of the DSIT needs

into data

are fundamental principles

will enhance

with the increasing objectives

requirements principles

to improve

capacity GCMs

and delivering of the ARM

the scientific

of ARM

137

productivity

PIs to make

and understand

data

Program.

streams

of known

Maturity

of the ARM

progress

the atmospheric

and

of our capabil-

toward

Principal

the ARM

radiation-cloud

radiative feedback,the DSIT will maintaina vigorouspolicy of upgradingthe softwareand hardwareof our datasystemandoptimizing the critical loop betweenour applied scienceand modeling efforts with thoseof the ARM ST. In essence,to makemore high quality data, information, andderivedproductsavailableto our customers. Focusgroupswithin the DSIT areworking to developeffective methodsfor designingderived products,our systemdevelopmentactivities,andthe operationandmaintenanceof our data environment. These focus groups, along with the infusion of new technologies, and the utilization

and

architecture

re-use

of previously

approach

streams.

to product

In particular,

object-oriented

an integrated display tools

development, will continue

future,

nologies needs

in view. form

the

Wide

techniques

new

system

we will keep

Understanding

basis

are moving tracking, Web

requirements

our principles

the needs

of the operational

meta-data

approach,

an open-

and

and

data

and

analysis

of new

objectives

how well

of the effectiveness

to data

at our archive,

standard

the scientific

of data

access

relations

and plan the integration and

of our customers

measures

us toward

and characterization

for cross-platform

to manage

operations, and maintenance to have a positive impact.

as we develop

and algorithms,

Program

database

tools,

processing,

the use of the World

information,

In the

developed delivery,

tech-

of the ARM

we meet

those

of the DSIT.

References Melton, Design American Stokes, Program: Bulletin

R. B., Campbell, of the CART

Data

Meteorological G. M., and

System

Programmatic

for

Society

Schwartz,

of the American

U.S. Department DOE/ER-0442.

A. P., Edwards, the

S. E.

1994.

of Energy's

The Atmospheric

and Design

Meteorological 1990.

U.S. Department

P., and Tichler,

J. L.

ARM

1991.

Program.

Proceedings.

Background

of Energy.

D. M., Kanciruk,

Society.

Atmospheric

138

Radiation

of the Cloud

Measurement

and Radiation

(ARM)

Testbed.

Vol 75, No. 7, 1201-1221. Radiation

Measurement

Program

Plan.

f,

N96-15460 " /Z "" Object-Oriented

Structures Keith

Global

Wichmann

Science & Technology, Code 935

NASA/Goddard

Supporting

Space

Flight

MD

20771

Greenbelt, kwichmaCbalmure,

Remote

Sensing

Databases

Robert F. Cromp Code 935

Inc.

NASA/Goddard Center

Space

Greenbelt, cromp©sauquoit,

Flight

Center

MD 20771 gsfc. nasa. gov

gsfc. nasa. gov

Abstract Object-oriented databases show promise for modeling the complex interrelationships pervasive in scientific domains. To examine the utility of this approach, we have developed an Intelligent Information Fusion System based on this technology, and apphed it to the problem of managing an active repository of remotely-sensed satellite scenes. The design and implementation of the system is compared and contrasted with conventional relational database techniques, followed by a presentation of the underlying object-oriented data structures used to enable fast indexing into the data holdings.

1

Introduction

In the late 1990s, NASA will launch a series of satellites to study the Earth as a dynamic system. The information system required to manage data from these satellites will be one of the world's largest, requiring a capacity about 1,000 times the size of the Library of Congress. The enormous size of the data holdings and the complexity of the information system pose several challenges to computer scientists, who must apply advanced techniques from software engineering, artificial intelligence, computer graphics, and signal processing if there is any practical chance of success. A data management system should provide users the ability to locate data pertinent to their needs. Limited computational resources, however, directly impact the level of detail users can employ in formulating their queries. In an extensive remote sensing repository, it is both impractical and unnecessary to manage every individual pixel of each satellite scene. Such an extreme level of granularity is best left to individual users who can apply a geographic information system to a restricted number of images. Rather, it is better to record limited information about each scene, and use this metadata to satisfy users' queries. For a number of years, researchers in the Information Sciences and Technology Branch, Code 935, at NASA/Goddard Space Flight Center, have strived to build a fully automated Intelligent Information Fusion System (IIFS) that enables unsophisticated users to access up-to-date satellite imagery [2, 4, 9]. To enable efficient search of large data holdings, the IIFS uses novel methods for extracting image content from the data, specialized data structures for storing data acquired from all over the world, and object-oriented representations to express the complex interrelationships pervasive throughout scientific domains. These advanced techniques are transparent to the users, yet allow for fast and easy access to data which meet the users' needs.

2

Rationale

Behind

an Object-Oriented

Approach

Until recently, designers had no real choice but to apply a relational database when confronted with managing any sizable amount of data. In the late 1980s, however, commercial vendors made it viable for system builders to consider using object-oriented databases for limited purposea, most generally in an evaluative mode. At the same time, educational institutions introduced an object-oriented philosophy in their computer science classes, resulting in a new generation of computer scientists who have been weaned on this approach.

139

Seemingly, object-oriented databases are now an alternative to relational databases. new technology, it is essential that a comparison be drawn with the older, established A relational database management system (RDBMS) uses the relational algebra [3]. This was a calculated a limitation on flexibility,

tradeoff from the flexibility that the CODASYL model RDBMSs provide an increase in performance because

However, as with any technology. as described by Codd provided. In return for of the simplicity of the

relational algebra in query processing. In order to apply the relational algebra, the data must be normalized to a well-defined form. The manual process of applying normalization can be time consuming and error prone. Furthermore, normalization can also lead to performance problems during query processing when dealing with complex relationships [10]. In applications that have simple relationships, the relational database is a good choice. However, in the remote sensing domain, the model can be quite complex to fully express the functionality of the system. In cases like these, the object-oriented database is more likely to perform better. Baroody and DeWitt [1] have previously shown that the object-oriented approach can perform all functions of a relational database. In addition, by allowing the equivalent of pointers from a language such as C to be stored in the database, OODBMSs can store more complex structures and relationships than RDBMSs. This reduces the need for mapping data from their application storage to their database storage (sometimes referred to as the "semantic gap" or an "impedance mismatch"). In many eases, the OODBMS ties in to a host language (e.g., C++) such that there is a direct one-to-one mapping because the structures are identical. Naturally, there are tradeoffs here as well. The mapping causes an extra step whenever an object migrates into memory to set its pointers to something valid. Once an object is in memory, no additional database overhead is necessary until the transaction commits. The database can represent any in-memory data structure because of this representation strategy. While the relational database relies on associative lookups, joins and indices on key attributes, the object-oriented database can employ domain-specific structures for providing fast access directly to the object or objects of interest. In essence, the database uses structures which allow a very quick, navigational search that only involves referencing a small number of objects. RDBMSs must adhere to the table structure and as such have difficulty representing spatial data and other complex structures [5]. It is necessary to write code to convert the in-memory structure to the relational structure and vice versa [8]. OODBMSs can represent the domain-specific search structures without changes to the database server itself. The OODBMS differs in the balance of responsibility afforded the client and the server. An RDBMS, in general, performs all processing on the server side and only communicates the results to the client. An OODBMS tends to perform all of its processing on the client side. This leaves the server to handle accessing the disk and transmitting the raw data to multiple clients, allowing the workload to be better distributed. As a compromise to the two models, hybrid databases improve over RDBMSs by storing complex objects in the nodes of a table, but do nothing to improve on the expressiveness of the relationships. Examples of hybrid databases include Illustra and Postgres. Hybrid models support SQL queries, a standard from the RDBMS community, and researchers/vendors in OODBMSs have begun to embrace this notion. While the hybrid database allows more complex objects, it does so with a cost in performance. It does provide an environment more familiar to RDBMS users, but it does not have the full capabilities of an OODBMS. The IIFS utilizes an OODBMS for several reasons. The remote sensing domain can be very complex. The need for the system to be scalable in the number of items of data that it can handle causes several problems that an OODBMS can solve. First, the table structure of relational data requires that more data be accessed than is absolutely necessary to carry out a search since all data in the table are grouped together on disk. As the number of objects in the database grows, then the amount of data touched also increases. The OODBMS allows a search structure that facilitates only accessing data that match the search attribute for that structure. Next, the OODBMS requires less coding to convert data into the proper form for storage leading to fewer coding errors and less development time. It is necessary pointers map directly into the database. Finally, the distributed nature exploitation of CPU cycles spread across the network.

140

to introduce transactions, of most OODBMSs allows

but the

I

yperCylinder

[Temporal

I

Index El=rnent 1

]

I;

JTemporal Index J

I

SQT-Node

J

Platform

Observation

ConlenI

Index

J

Instrument

Instr-Chan

J Data Prod J

w %

Sensor

[ Browse P.

Conlent Tern •

Channel

Product

Machine

Element

I

Browse

Method

Product Preference

*1

Algorithm

I

Plan

]

1

Figure

3

The

Object

1: The

IIFS

standard

represents

entity

a class.

between

the

relationship

diagrams

A superclass

classes

I

Level

]

Type

]

Resolul_on

1

Object

State

SQT-Node

I

Model

Model

The IIFS employs the object model that is partially from

Form

Condition

Process

I

and

is indicated

the

numbers

illustrated

that by

in Figure 1. The diagram

incorporates

the

a containing

indicate

the

notion

rectangle.

cardinality

of

The

of the

is a minor diversion

inheritance.

lines

Each

indicate

relationship

with

rectangle

the

relationships

a star

representing

"many." The data.

platform, The

sensor,

platform

* In space, .

In the

* On The gathers 5).

Channels

an

a sensor

these

classes

There

such

their are

can

gathers over

the

frequency

when

on

The

temporal

search

uses

hypercylinders,

the the

channel

the

classes

instrument

space

airplane

desired

response

shuttle can

data.

area

combination

It

(e.g.

of the is the

represent

the

source

of the

is mounted.

is a platform.

be a platform.

platforms,

have

ability

sensor.

some

channel of

back

on

looks

carry

each

aboard

on

yet

a specific

instrument.

many program

data:

temporal,

a set

possibly platform,

Of particular

different

calling

of which

LANDSAT

at a different

sensor

a specific

can

to the

of channels, is a sensor

a specific

channel

a platform

report

number

Mapper

Each

combination

and to

has

Thematic

of a specific

different the

which

data.

same

instrument

is the

many

instantiated

particular three

An

an

in-situ

time

channel be

or

gathers

instrument

on

or the

balloon

that

range.

instrument

and unit

as LANDSAT-5

same

represent

note,

atomic

device

at the

frequency

while

instrument

physical,

a platform

is the

data

channel,

a weather

ground,

sensor

overlapping

matches

a satellite

atmosphere,

the

its

is the

sensors.

All

of elements

of

that

instance.

other

major

employs sphere

attributes the

quadtrees

temporal (SQTs)

used index and

to search and sphere

141

for

sets

temporal quadtree

of

index nodes

element [6].

objects. The

content

spatial The search

and

content.

spatial applies

search the

(query

(instrument LANDSAT-6 TM TM-4) (complexdate I I 4 85 2 4 85 4095 IS I 31)) (query (contaiument (latlong 33 21 3S 23)) (content USGS (forest 20 40))) (dump Content) (ingest iifs@balmure:/mass/data LA_DSAT-S TM 5/4/90 33.2 -43.4) (add-template LandUse (.ater urban agriculture rangeland forest)) Figure

2: Sample

s-expressionsfor

comnmnicating

between

modules

ofthe

IIFS

content index, content template and content object classes. Each of these has the capability back the set of data that corresponds with the attribute value. These search structures extract elements from the database that match the given attributes. data structure in the IIFS is an element. Elements in the database are either observations an instrument, or products derived from the derived data, information about the routines to access the data and executable program, and their order or application. In this way,

of reporting The primary directly from

original raw data (observation). Since algorithms produce the stored includes the machines it can run on, the accounts needed any necessary parameters, and a record of applied algorithms if a better algorithm develops, it is possible to regenerate data

products that used the old algorithm. The storage of algorithms allows the implementation of semantic caching. Since storage space will surely never be large enough to store everything, the database will evaluate the data to see what can be deleted safely. The IIFS can recalculate products when needed if the database stores all of the original data as well as all algorithms. Therefore, a good choice would be to remove the least recently used data product from the archive. The system then marks the element as not available. The history facilitates the regeneration of the data by reapplying the algorithms The IIFS also relies on a planner,

in order. scheduler

and

dispatcher

to interact

with

the

data

stored

in the

database. The planner generates the necessary steps to produce data products that do not currently exist or to process newly ingested data sets to get the associated metadata, and in this vein, it acts as a transaction manager. The browse product preferences in the object model lists the algorithms that should run on all new data to generate the metadata. The planner incorporates policies that the user provides such as "accuracy is preferable to shorter compute time" in order to generate the desired product. The scheduler accounts for the resources available in the system and arranges for simultaneous requests to be processed producing an estimate of time required. The dispatcher implements that schedule and monitors the progress of each process. The entire implementation is in C++. An interface to the outside that utilize the database do not have to be clients of the database

world is necessary so that all programs itself. A server exists that accepts a

specialized language of s-expressions that describes the query and searches the database on the behalf of the program. Remote procedure calls (RPCs) carry the s-expression language to the server and the results back to the client. The language allows querying of the elements in the database, ingesting of new data, ingesting of new content templates and this common interface into the database.

4

Query

the extraction of extent information. Multiple interfaces can share Figure 2 shows some examples of the s-expression language.

Objects

The object-oriented domain also allows a useful encapsulation of the query as illustrated in Figure 3. In the IIFS, a query is simply a list of query elements. A query component is an abstract class used to encapsulate the processing of a query and the retrieval of the representation of the query for passing to a query optimizer. It has subclasses that represent each attribute type used for a search such as temporal, spatial, source and content searches. In addition, there is an abstract subclass called query connectors used to provide logical connectives such as AND and OR. Each connector has a list of the query components that it is connecting and

any element

of that

list can be another

query

connector.

142

A connector

object

can also

be one of the

[ Standing,Query'J Query I ,

Spatial Temporal

Component Component

i

J

Source

Component

1

Content

Component

[

I OR Connector

[

QueryConnector

QueryComponent

Figure

3: Query

Objects

items in the list of query components maintained by a different query connector, thus allowing arbitrary combinations of query components. A query assumes a logical AND connector for each of its elements. The IIFS also allows specification of a query as a standing query. This is a query that executes at some specified interval automatically. The results are emailed to a list of users until the expiration date supplied with the original query. Each mail message also includes the information about the query and a unique identifier for specifying a query to remove from the system. The system requires this extra information at the creation of the query. The natural mapping of in memory structures to database structures means that there is no additional work to store this in the database other than maintaining the list that has all the standing queries in it. Each query component can return a string that the OODBMS query optimizer can use as input for a search. The OODBMS generic query processor does not know how to use the navigational search structures built into the database, so it must rely on B-trees, hash tables and simple iteration over the existing objects to determine the query result. Each component can also return the set of data that matches the attributes that it describes by utilizing allows quick implementation multiple attribute searches.

4.1

Spatial

The IIFS providing

the navigational search structures of a variety of search techniques

described in detail below. This flexibility and becomes important in the section on

Searches

employs sphere quadtrees (SQTs) as the spatial search structure [7]. a logarithmic search, better represents the globe by not introducing

The SQT, in addition artifacts at the poles

to as

is the case with latitude/longitude-based structures. The SQT uses an icosahedron (a 20-sided polygon) with triangles for each face as its basis. Each face subdivides into four sub-triangles (trixels) recursively. To achieve a resolution of one kilometer over the face of the Earth requires building a SQT to 14 levels, implying that any cell of this tiling can be indexed by answering 14 4-way questions, or 28 yes-no questions. Furthermore, the resulting geometry uses all integer math that produces much faster results than floating point calculations as might be encountered using latitudes and longitudes. Each trixel that hasno children (leaf trixel) maintains a collection of the elements that fall in its region. The trixel splits into four new subtrixels when there are too many items in the bin. The system then inserts the original trixel's elements into the appropriate sub-trixel's bin, where the bin size limit can be a fixed number or can vary depending on the depth of the tree.

143

O- 10%

0-10% 0-10% 10-20%

10-20%

-

10-20% \

20-30°4

20-30%

20-30% -\

\

30-40%

\

30-40%

30-40%

\

\

e,o==,,s II

\ ,---_/x 40-50%

40-50%

50-60%

5O-60%

50-60%

60-70%

60-70%

60-70%

40-50%

/ /-

/

/

J / .t

70-80%

[ J

70 80%

70-80%

/,

_

_....

./

f

/

J

/ /

1

90-100% Forest

80-90%

80-90%_

Temporal

J

s / 1

90-100%

90-4 00°/(

Urban

Water

Figure

4.2

/

'

"

1

80-90%

,/

4: IIFS

Content

Index

Section

Searches

Each piece of data described in the IIFS has a reference in the temporal index as well. The structure consists of an indexed list of Julian days, each with a list of the elements that fall on that day. The IIFS creates the day object only at the insertion of the first element with that date. A B-tree indexes the days so that range searches perform very quickly. The system performs a variety of temporal searches down to the granularity of a day. Granularities less than a day require accessing the actual data.

4.3

Content

Searches

Searches based on content help fulfill two major requirements of the IIFS. First, the user should to search for data in terms that are familiar to their own domain. The system accomplishes the

be able first by

requiring algorithms to interpret data into that domain. The second major requirement is that the IIFS must provide the capability to return the smallest possible group. As the number of objects in the system grows, the number of objects returned by a given query will also increase. If no additional attributes are available to search on, this requirement will fail. The IIFS system allows the introduction of various content templates. Each template consists of a name for the template and a name for each attribute of the template. The content object stores the data as a vector of numbers that correspond to the order of the attributes in the template representing the percentage of that attribute in the data. For instance, a template might be used for land cover. Its attributes might be agriculture, urban, water, forest, rangeland and barren with the vector the data represents each of those classes. The database maintains a list

144

representing what percentage of of all templates. Each template

listsall elements that havehadits classification algorithmrun.

Another step in precomputing all possible queries would be to break up the range of values (typically 0-100 percent) into bins. The bin points to the data that fall in that range for each attribute in the template. This would require more storage, but would produce quicker results during a query. Figure 4 shows a partial example. It shows three of the attributes in a land use/land cover template divided into ten bins each representing ten percent ranges. There are entries pointing to some of the elements in the system that have the given properties. By dynamically adjusting the bin sizes and adding new bins as they reach a given limit, the search can always be logarithmic. This allows the addition of new templates at any time, but the system will reexamine old data sets to classify them using the new template. This could be a very costly process, but flexibility can be added to the system to dictate when that processing should occur. 4.4

Source

Searches

The source in the IIFS models the platform and sensor that it came from. This is an excellent example of where the complexity of the domain is apparent. There is a many to many relationship between platform and sensor that expresses very easily in OODBMSs but requires a mapping to an extra table in RDBMSs. The platform describes the physical location of the instrument used to get the data. The sensor describes the physical properties of the device, including the wavelengths to which it responds. The same sensor can be on many different platforms, and one platform can have many sensors on it. The search structure is identical to this description. Since there are not many platforms, that part of the search executes very quickly, but a hash function minimizes the search time. Each sensor maintains a list of data that it produced. Each piece of data maintains a list of channels that are missing. This list is normally empty and searches of the list are not necessary. This implies that the sensor gathers all channels simultaneously. 4.5

Multiple

Attribute

Searches

A single attribute search is not expressive enough for most searches. The IIFS allows multiple attribute searches using any combination of attributes. Figure 5 shows the four major search structures with selected connections to the elements. Each attribute as represented by a query element does its own search, each of which is logarithmic. The system then combines the results conjunctively or disjunctively as declared. More elegant strategies are being implemented. The first strategy is to evaluate which attribute will produce the smallest return set. This implies that there exists a quick way to determine the approximate number of items returned, which is not an easy problem. Obviously, the search stops if any attribute returns zero elements. Furthermore, by performing the intersection on the two smallest sets, the system performs the least amount of work. This also reduces the number of comparisons in the following stages. The second strategy involves employing the query mechanism of the database. The database can also apply this in addition to the first strategy, but does not require it. Once the search produces a set that is less than some threshold value, the database's query mechanism iterates over each remaining element testing each member for the remaining attributes simultaneously. Indices will not exist previously because the system created the set dynamically. However, the database could create an index on the set if it is beneficial.

4.6

Correlated

Queries

A correlated query is a query in which the qualification for membership in the result set is a relationship (correlation) to another element in the result set. This kind of query can never return a single hit because a relationship requires two elements. The return elements will have a' relationship with another element in the set. For instance, the query might be for data covering locations where two different sensors passed over in the same hour. This capability allows searches for relationships in the data, and is a very powerful addition to simple searches for attributes. The above search breaks down into three stages. The first stage is a query for the first set of eligible elements from the first platform. The second stage is for the eligible elements from the other platform. The third and final stage tests for the existence of the relationship between the two sets. The IIFS creates and fills in a special bin structure with two bins for each time slot. The reason for two bins

145

J

s

-

s" S

O /

I S

•,"._

",

_

"J._N,.

l

Tm'npora]

Elements



I I

_a,ntent

Index

/ / ;:, ',

].nd.'x

it.

2-.'__:._ _

,t

/

,' ,, ,, •

.... :: iil

Ii

,"...,_



_

_

,, ,,

.,

!i_i_.i_!i:_ _

;)iii

_

==========================

L' _$

::

i

Figure

is to keep hits from the different platforms search to see if there are entries from both

5: IIFS

Search

Indices

separate. It could use a single bin, but then it would have to to determine if the contents of that bin are a hit. By separating

them, two non-empty bins from the same time period constitutes a hit and provides an ordering for the hits. Figure 6 illustrates the structure used. In the case of spatially-correlated queries, the sphere quadtree acts as the bin structure. The IIFS creates two sphere quadtrees and populates them with data from each query. Similarly, the database can correlate queries by source or content. Finally, the system might implement multiply correlated queries. One example would be a search for data from two different platforms that cover the same area and occur in the same period. Also, more complex relationships than equality or proximity would be useful. This would enable a query like two images taken three months apart over the same area. 4.7

Distributed

Queries

Distribution of the data can increase the throughput of the system and decrease response time for a given query. In the lIPS, the spatial search structure has some properties that lend themselves to distribution. Each of the twenty faces of the icosahedron can be distributed to separate CPU nodes. If any machine node approached a predefined saturation level, its data could be redistributed to other nodes. The IIFS maintains a master index so that the distribution system could pass queries to the appropriate Throughput increases because independent queries of physically separated areas Response time increases for queries that cover more than one face of the icosahedron the remaining attributes can occur in parallel.

146

machine for processing. execute simultaneously. because the search for

/---

/---

7

O

Bin 1

Bin 1

Bin 1

O

ml

m

Bin 2

Bin 2

J3¢n2

CY Date

8/4/70

/

Correlated Query Bin

Figure

5

Current

6: Bin Structure

for Correlated

Query

Capabilities

The IIFS in its current form supports queries on the four basic attributes described above: temporal, spatial, source and content. The sphere quadtree stores the center point of each image and allows logarithmic access to those points. The hypercylinder data structure is currently being implemented and will enable the IIFS to return data sets that more accurately match the user's area of interest. The IIFS implements most of the structures discussed. Notable exceptions are all types of correlated queries except temporal, and the algorithm-related classes. The system uses the techniques described for distribution to segment the database internally but no work has been performed on distributing the SQT over multiple CPU nodes. Instrumentation of the database is currently in progress to produce base figures for evMuating changes to algorithms and storage structures. Ingesting large amounts of random data into the database tests the scalability of the system. The system is implemented on a Hewlett Packard 735 workstation using Object Store from Object Design Incorporated. A graphic user interface developed in Tcl/Tk implements a front-end to the system for providing queries and displaying results. It communicates with the database using the s-expression language over an RPC connection. The IIFS also includes an interface using the world wide web as the front end, but it is not quite as flexible. Active research is progressing on techniques for extracting useful content information from remote sensing imagery to capitalize on the content searching capability of the system.

6

Conclusions

and

Future

Work

The IIFS system is serving as an active development and research evolvability goals. The base system is nearing completion/allowing

environment for testing scalability and for future studies of new data structures

and algorithms

implementation.

as well as the effect of minor

changes

in the current

147

The

current

prototype

shows that the techniques implemented are sound and can provide the basis for a more complete system. Future work includes the extension of the spatial search structure to handle the differences in the granularity of the data stored in the database, additions to the content based search structure to increase its performance, and the examination of a number of techniques for handling multiple attribute queries. Furthermore, the system will examine and implement additional types of correlated queries.

Acknowledgments We would like to thank Bill Campbell Larry Roelofs (GST), and Walt Ligon

(NASA/GSFC), Nick Short (NASA/GSFC), (Clemson U.) for their contributions to this

Gyuri work.

Fekete

(Hughes),

References [1] A. J. Baroody and D. J. DeWitt. An object-oriented ACM Transactions on Database Systems, 6(4):576-601, [2] W. J. Campbell, artificial neural [31 E. Codd. 1970.

approach 1981.

to database

system

S. E. Hill, and R. F. Cromp. Automatic labeling and characterization networks. Telematics and lnformatics, 6(3-4):259-271, 1989.

A relational

model

for large

shared

data

banks.

Communications

implementation.

of objects

of the ACM,

using

13(6):377-387,

[4] R. F. Cromp, W. J. Campbell, and N. M. Short, Jr. An intelligent information fusion system for handling the archiving and querying of terabyte-sized spatial databases. In International Space Year Conference on Earth and Space Science Information Systems, pages 586-597. American Institute of Physics, 1993. [5] R. F. Cromp and S. Crook. and lnforma_ics, 6(3/4):299

An intelligent 312, 1989.

user interface

for browsing

satellite

data

I6] R. F. Cromp and E. Dorfman. A spatial data handling system for retrieval regions of user interest. Telematics and lnformatics, 9(3/4):221-241, 1992.

catalogs.

of images

by unrestricted

[7] G. Fekete. Rendering and managing spherical data with sphere quadtrees. In Proceedings IEEE Conference on Visualization, San Francisco, California, October 1990. [8] M. Loomis.

Object

Databases:

[9] N. M. Short, Jr., R. F. Cromp, W. Ligon, and K. Wichmann.

The

Essentials.

Addison-Wesley

Publishing

Telematics

of the First

Co., 1995.

W. J. Campbell, J. C. Tilton, J. L. LeMoigne, G. Fekete, N. S. Netanyahu, Mission to planet Earth: AI views the world. IEEE Expert, to appear.

[10] N. M. Short, Jr. and S. L. Wattawa. Dynamics data information system.

The second generation intelligent user Telematics and Informatics, 5(3):253-268,

148

interface 1988.

for the Crustal

REPORT Public

reporting

gathering collection Davis

burden

and

for

this

mainlaining

of

the

information,

Highway,

1204,

USE

of

needed,

including

Suite

1. AGENCY

collection data

DOCUMENTATION information and

suggestions

Arlington,

ONLY

for

VA

is

this and

blank)

to

and

reducing

22202-4302,

(Leave

estimated

compleling

to

average

reviewing

the

burden, the

to

Office

of

2. REPORT

AND

1

hour

per

collection

response, of

Washington

and

Budgel,

the

Send

Headquarters

Management

time

for

comments

Services,

reviewing

for

Reduction

TYPE

Conference

this

searching

burden

Information

Project

and

DATES

Management

any

data

other

sources,

aspect

Reports,

Washington,

of

1215

DC

this

Jefferson

20503.

COVERED

- October

S. FUNDING

Information

or

Operations

(0704-0188),

AND

existing

estimate

Publication

SUBTITLE

1995 Science

instructions,

regarding

Directorale

Paperwork

3. REPORT

1995

Approved No, 0704-0188

For_ OMB

including

information.

DATE

October 4. TITLE

PAGE

26-27,

1995

NUMBERS

and Data Compression

Workshop

Code

930

6. AUTHOR(S)

James

C. Tilton,

7. PERFORMING

Editor

ORGANIZATION

Goddard

Space

Greenbelt,

NAME(S)

Flight

95B00134

AGENCY

Aeronautics

Washington,

8. PEFORMING ORGANIZATION REPORT NUMBER

(ES)

20771

/ MONITORING

National

ADDRESS

Center

Maryland

9. SPONSORING

AND

NAME(S)

and Space

AND

ADDRESS

10. SPONSORING / MONITORING AGENCY REPORT NUMBER

(ES)

Administration NASA

DC 20546-0001

11. SUPPLEMENTARY

CP-3315

NOTES

Tilton: Goddard Space Flight Center, Greenbelt, Maryland. This workshop was organized and sponsored by the National Aeronautics and Space Administration (NASA) in cooperation with the Washington/Northern Virginia Chapter of the Institute of Electronics Engineers (IEEE) Geoscience and Remote Sensing Society. Support was received from the Office of Mi_inn 12a.

tn Planet

F, arlh_

DISTRIBUTION

Unclassified Subject

NASA

STATEMENT

12b. DISTRIBUTION

CODE

- Unlimited Category

Availability: 13. ABSTRACT

H_adalmrters.

/AVAILABILITY

59

NASA (Maximum

This document Workshop,"

CASI

200

(301)

621-0390.

words)

is tne proceedings

from the Science

Management

which

was held on October The Workshop

collection, missions.

ingestion, archival and retrieval of large quantities of data in future Earth and space science It consisted of fourteen presentations covering a range of information management and data

space

approaches

science

was organized

14. SUBJECT

Data

17.

by James

systems,

C. Tilton

integrated

or that hold promise

and Robert

F. Cromp

Space

Earth

Flight

approaches

into actual

or prototypical

for such an application.

of the NASA

Center,

for handling

Goddard

Space

Earth

Image

Compression,

Information

Management,

Flight

16. PRICE

18.

SECURITY CLASSIFICATION OF THIS PAGE

Unclassified

19.

SECURITY CLASSIFICATION OF ABSTRACT

or

Center.

OF PAGES

162

Space

Science

the

The Workshop

15. NUMBER

SECURITY CLASSIFICATION OF REPORT

7Kdn.n1-9_n-_OO

or have been

computational

TERMS

Unclassified '_1¢_'1_1

data information

Compression,

Science,

that are being

promising

Goddard

Compression

Maryland.

explored

1995, at the NASA

and Data

Greenbelt,

compression

26-27,

Information

CODE

20. LIMITATION

OF ABSTRAC1

UL

Unclassified |naara

Porto z_o

u-tev,

z-uuj