NASA
Conference
Publication
3315
1995 Science Information Management and Data Compression Workshop
James Goddard
Edited by C. Tilton
Space Flight Center Greenbelt, Maryland
Proceedings of a workshop sponsored by the National Aeronautics and Space Administration in cooperation with the WashingtonNorthern Virginia Chapter of the Institute of Electrical and Electronics Engineers Geoscience and Remote Sensing Society and held at NASA Goddard Space Flight Center Greenbelt, Maryland October 26 - 27, 1995
National Aeronautics and Space Administration Goddard
Space Flight Center Greenbelt, Maryland 1995
800 Elkridge Landing Road, Linthicum Heights, MD 21090-2934, I This publication is available from the NASA Center for AeroSpace
(301) 621-0390. Information,
FOREWORD The Science Information Management and Data Compression Workshop was held on October 26-27, 1995, at the NASA Goddard Space Flight Center, Greenbelt, Maryland. This NASA Conference Publication serves as the proceedings for the workshop. The workshop organized by the Information Sciences Technology Branch, Space Data and Computing Division of the NASA Goddard Space Flight Center, and was supported by the Office of Mission to Planet Earth, NASA Headquarters. The workshop was held in cooperation with the Washington/Northern Virginia Chapter of the Institute of Electrical and Electronics Engineers (IEEE) Geoscience and Remote Sensing Society. The goal of the Science Information Management and Data Compression Workshop was to explore promising computational approaches for handling the collection, ingestion, archival and retrieval of large quantities of data in future Earth and space science missions. It consisted of fourteen presentations covering a range of information management and data compression approaches that are being or have been integrated into actual or prototypical Earth or space science data information systems, or that hold promise for such an application. Papers were selected from papers submitted in response to a widely distributed Call for Papers. Fourteen papers were presented in 3 sessions. Discussion was encouraged by scheduling ample time for each paper. The workshop was organized Space Flight Center.
by James
C. Tilton
Workshop James C. Tilton Mail Code 935 NASA GSFC Greenbelt, MD 20771 phone: (301) 286-9510 FAX: (301) 286-1776 Internet: tilton @chrpisis.gsfc.nasa.gov
and Robert
F. Cromp
of the NASA
Organizers Robert F. Cromp Mail Code 935 NASA GSFC Greenblet, MD 20771 phone: (301) 286-4351 FAX: (301) 286-1776 Internet: cromp @ sauquoit, gsfc.nasa.gov
iii
Goddard
CONTENTS A Spatially Adaptive Spectral Re-ordering Technique for Lossless Coding of Hyper-spectral Images .............................. Nasir D. Memon, Northern Illinois University, DeKalb, IL, USA; and Nikolas Galatsanos, Illinois Institute of Technology, Chicago, IL, USA
-/
1
A Comparison of Model-Based VQ Compression with other VQ Approaches Mareboyana Manohar, Bowie State University, Bowie, MD, USA; and James C. Tilton, NASA Goddard Space Flight Center, Greenbelt, MD, USA
...........
13 -2---
Remotely Sensed Image Compression Based on Wavelet Transform ................ Seong W. Kim, Heung K. Lee, Soon D. ChoL Korea Advanced Institute of Science and Technology, Taejon, Korea; and Kyung S. Kim, Korea Institute of Science and Technology, Taejon, Korea
23 _ 3
A Packet Data Compressor ......................................... Mitchell R. Grunes, Allied Signal Technical Services Corp.; and Junho Choi, Naval Research Laboratory, Washington, DC, USA
35 - Z/
Reduction David
and Coding
G. TiIley,
of Synthetic
Elkridge,
Aperture
Radar
Data
with Fourier
Transforms
........
45
- _5"-
55
--_'
65
-7
MD, USA
Selecting a General-Purpose Gary Jason Mathews, NASA
Data Compression Algorithm ...................... Goddard Space Flight Center, Greenbelt,
MD,
USA
Compression Research on the REINAS Project ............................. Glen Langdon, Jr., Alex Pang, Craig M. Wittenbrink, Eric Rosen, William Macy, Bruce R. Montague, Caries Pi-Sunyer, Jim Spring, David Kulp, Dean Long, Bryan Mealy and Patrick Mantey, Baskin Center for Computer Engineering & Information Sciences, University of California-Santa Cruz, Santa Cruz, CA, USA KRESKA: A Compression System for Small Krystyna W. Ohnesorge and Rene Sennhauser, MultiMedia Laboratory, University of Ziirich,
and Very Ziirich,
Large
Images
75
Switzerland
Alternate Physical Formats for Storing Data in HDF .......................... Mike Folk and Quincey Koziol, National Center for Supercomputing University of Illinois, Champaign-Urbana, IL, USA Context-Based Retrieval of Remote Sensed Images Using Asha Vellaikal and Son Dao, Information Sciences Lab, Hughes Research Laboratories, Malibu, CA, USA; and C.-C. Jay Kuo, Signal and Image Processing Institute, University of Southern California, Los Angeles, CA, USA
91 Applications,
a Feature-Based
Machine Learning for a Toolkit for Image Mining .......................... Richard L. Delanoy, Massachusetts Institute of Technology / Lincoln Lexington, MA, USA
V
...............
Approach
.....
103 -
/0
115 - / / Laboratory,
Dissemination of Compressed Satellite Imagery within the Navy SPAWAR Central Site Product Display Environment .................. Oleg KiseIyov and Paul Fisher, Computer and Information Sciences, Inc., Denton, TX, USA Data Management and Scientific Integration within the Atmospheric Radiation Measurement Program ............................ Deborah K. Gracio, Larry D. Ha_ield, Kenneth R. Yates and Jimmy W. Voyles, Pacific Northwest Laboratory, Richland, WA, USA; Joyce L. Tichler, Brookhaven National Laboratory, Upton, NY, USA; Richard T. Cederwall, Mark J. Laufersweiler, Martion J. Leach, Lawrence Livermore National Laboratory, Livermore CA, USA; and Paul Singley, Oak Ridge National Laboratory, Oak Ridge, TN, USA Object-Oriented Structures Supporting Remote Sensing Databases ................ Keith Wichmann, Global Science and Technology, Greenbelt, MD, USA; and Robert F. Cromp, NASA Goddard Space Flight Center, Greenbelt, MD, USA
vi
..
123 ./L
131 " 1_'_
139" i
N96-15447 ),/
A Spatially
Adaptive for Lossless
Technique
/ //
Spectral Re-Ordering Coding of Hyper-spectral
Images Nasir Department
D. Memon
Nikolas
of Computer
Northern
Illinois
Science,
Galatsanos
Electrical
University,
Illinois
Dekalb, IL 60115
[email protected]
Engineering
Institute
of Technology
Chicago,
IL
[email protected]
Abstract In this paper, we propose a new approach, applicable to lossless compression of hyper-spectral images, that alleviates some limitations of linear prediction as applied to this problem. tral components re-ordering
According to this approach, an adaptive re-ordering of the of each pixel is performed prior to prediction and encoding.
adaptively
exploits,
on a pixel-by-pixel
correlations for prediction. Furthermore, spatial correlations, and does not introduce
basis,
the
presence
specThis
of inter-band
the proposed approach takes advantage of any coding overhead to transmit the order
of the spectral bands. This is accomplished by using the assumption that two spatially adjacent pixels are expected to have similar spectral relationships. We thus have a simple technique to exploit spectral and spatial correlations in hyper-spectral data sets, leading to compression performance improvements as compared to our previously reported techniques for lossless compression. We also look at some simple error modeling techniques for further exploiting prior to entropy coding.
1
years
have
of multi-spectral important. bands
This
increase
in the
resulting
a tremendous the
is due
images.
on the
decade, Although
data,
images,
in such
be placed
seen images.
hyper-spectral
with
that
remains
in the prediction
residuals
Introduction
Recent
this
any structure
Earth
is designed bit-rates much
sophisticated
work
With
the advent
need
for efficient
to the
high
data
For example Observing
has been
techniques
rates
that
with
from
towards
exploit
the
the
Imaging
scheduled
images, has
bands
and also
become
large
number
Spectrometer
to be launched
192 spectral
200 Megabits
done
algorithms
resulting
(EOS)
images than
compression
transmission
resolution
High-Resolution
System
to acquire of more
the
generation,
of high spectral
storage known
increasingly of spectral (HIRIS)
by NASA
at 30m spatial
as
to
within
resolution
per second.
developing special
algorithms
nature
for compressing
of multi-spectral
images
image have
started emergingonly recently[6, 13]. More specifically,as noticed by researchersin other image processingareas, the spectral correlation in such imagescannot be assumedto be stationary in nature. In other words, the correlation between bands i and j cannot be assumed based
the same
as that
on a stationarity
2-D images, applied
Furthermore,
and
many
two
spectral
most
popular coding
techniques,
see [11] and
niques
known
be extended
correlations context.
Given
a scheme
becomes
both
spatial
and
Linear
predictive
effectively
use adjacent
band
selection
values
which
in a given
fact that
approach
is that
relationships
They
is taken
into
the
optimal
the
image
predictor
the overhead
techcan
nature
for context
non-stationary resulting
based
nature
bf
implementation
exponentially data
needed
not
of spectral
in the
(10
size
to 16 bits),
to effectively
capture
entire
spatially.
bands
best
significant
for
the
can
question
predicting The
setting,
of
intensity
benefits
to prediction.
theoretic
that
in hyper-spectral
compression
prior
nature
image
Hence
to use
in a graph
entire
in [6] that
correlated.
images,
and then small
ordering
bands. The
That
et.
and
spectral optimal
this
set.
is, it makes
the
ordering
of the albeit
can
be
problem
admitting
in the
ordering
to poorer
objects
prediction
an
context
prediction
that
of lossy spatially,
blocks
are
performance
of the spectral
co-efficients
imaged.
a block-by-block
Large
coefficients
prediction
prediction.
approach
limitation
assumption being
changes on
of this
Another and
inter-band
predictors
as overhead. leads
limitation
data
spectral
al.
major
corresponding
the
to optimize
transmitted and
and
composition
in order
compute
one
through
optimal
[12], Rao
co-efficients are
for the
observed
the
reported,
pass
on the
by
blocks,
Due to the non-stationary
be highly that
were
an entire
spectral
reasonably
context-based
image
context
a predictor
shown
An optimal
account
coefficients
the
compression
unfortunately
specifically,
grows
problem.
are
was formulated
depending
prediction into
are
image
image.
it re-orders
too re-order
images
correlated and
of the
contexts
band(s)
ordering
by making
spatially
More
it has been
of multi-spectral
improvements
do not vary
will change
and
non-stationary
capture
sized
can
the bands
it is two-pass.
computed
coding
they
in terms
up with
apart
In [14], it was
significant
are first
fact
band.
far
is, which
for an n-band
Although
to come
Furthermore,
That
an optimal
O(n _) solution
of
as transform
of lossless
size of hyper-spectral
also face a similar
spectrally
by re-ordering
of computing
way
correlations.
bands.
arises.
large
for a reasonably
it is hard
are
the
images.
of different
alphabet
techniques
correlations,
bands
large
image
optimal
is different.
techniques
to effectively
number
for lossy
the
of spatially
redundancies,
to exploit
needed
infeasible
coding
predictive
in hyper-spectra[
the
the
spectral
of spectral data,
manner
the
compression
such
For a survey
spatial
to be prohibitively
is because
such
linear
and
to capture
compression.
coding.
exploit
context
gets
This
obtained
Although
are present
the size of the
complexity. of the
[8].
based
lossless
compression,
for lossless
context
to adequately
that
techniques,
in lossy
and
images
in
dimension
proposed
in goals,
techniques
correlations
to the third were
difference
for lossy
Therefore, spatial
of hyper-spectral
to lossless
approaches
in a straight-forward
redundancies spectral
used
extended
techniques
of the
correlations
[4]).
for capturing
directly
correlation
applicable
and
used
recently,
because
tools
are not
predictive are
spectral
been
when
Very
However,
powerful
quantization,
have
results sets.
of the
12, 5].
spatial
The
data
nature [13,
exploiting
linear
that
will yield sub-optimal
compression
vector
i+ k and j + k (see for example
assumption,
to hyper-spectral
non-stationary
is the
of bands
This
latter
compression. However, they
partition
basis. required
since The to keep
as compared
to that obtained with a smaller block. Furthermore, they alsofound that using the previous band for prediction givesresults that are very closeto those obtained after computing an optimal ordering. In this paper, we proposea new approach, applicable to losslesscompressionof hyperspectral images.According to this approach,an adaptive re-orderingof the spectral components of eachpixel is performed prior to prediction and encoding. This re-ordering exploits, on a pixel-by-pixel basis,the presenceof inter-band correlationsfor prediction. Furthermore, the proposedapproachtakes advantageof spatial correlations, and doesnot introduce any coding overheadto transmit the order of the spectral bands. This is accomplishedby using the assumption that two spatially adjacent pixels are expected to have similar spectral relationships. We thus have a simple techniqueto exploit spectral and spatial correlations in hyper-spectral data sets, leading to significant performanceimprovementsas compared to our previously reported techniquesfor losslesscompression[10].
2
Spatially
Given
two spatially
band
'adjacent'
multi-spectral
dimension. the
Adaptive image,
Hence
current
X that
prediction
error
are found
and
pixels the
Y by using
optimal been
at pixel used
both
spectral
pixel
in band-interleaved
technique and
that
X and
is some
the
'model'
true
Since
simple
at Y.
the
one
of its
and
adjacent
is performed,
Continuing
be well
manner,
effectively
processes
of it would
the
at Y which
in this
which
n
estimate
spatially
(yl,y2,...,yn)
image,
technique
realizations
of an
in the spectral
below)
compression
values
a multi-spectral
correlations. order,
from lossless
the
relationships
be specified
computed Since
Y = (Yl,y2,.-.,Y,_)
similar
(to
parameters
for encoding
spatial
Y have
using
model
and
order
transmitted.
Y is transmitted,
to update
an adaptive
Modeling
= (xl,x2,...,x_)
image
has already
have
X
we assume
we scan
pixel
neighbors
Spectral
the
data
suited
we
captures pixel
by
for real-time
applications. The
question
parameters that
the
that
from adaptive
process
answer
to these
setting
in a manner
by assigning vertices based
including on the
E)
optimal spanning the root
tree
in the
obtained
It is well known
that
a minimum
graphs
image.
from
[3, 7], where
X and
employed X
vice-versa.
in order
weight
notion
directed number
the Y and
Specifically, root
spanning of vertices
edges and
tree
the
incurred
on the edge
a minimum
weight
of the
the
image
encoding
each
is transmitted E the
the can be
cost'
weight
can be found and
graph
8-neighborhood
'coding
to encode vertex
theoretic
between
cost encoding
order,
such
An optimal
of adjacency
to see that
a minimal
we obtain
a weighted
directed
The
to code
The
image?
4-neighborhood
it in depth-first
its parent.
V is the
two
Y.
do
this choice
in a graph
constructing
it is easy
specifies
above.
traverse
entire
problem
to Y represents
Now,
graph
from
We add
pixels X
by
X
one make
the
the
We begin
from
described
model
applied
by formulating
commonly
and
the optimal
for sparse
over
derived
technique of this
neighbors
when
tree of this weighted
adaptive
of its
causal
going
the
is which
how does
'adjacent'
the
represents
vertex
Y,
Furthermore,
pixel
edge
parameters
Y to X
pixel
at Y?
[9, 14, 12].
to spatially
weight
from
to
for each
model
by using using
a vertex
each
be obtained
similar
The
directed at the
can
on any
by using
for values
remains
questions
corresponding
models. going
arises
to estimate
we start pixel
by
as is.
in O(V
number
image
log E+
of edges.
In our case, essentially
E = 8V of the
while
order
compression
prediction
residuals.
residuals that
were
tractable
treated
that
this
may
model
obtained
could
Different
from
implementation
3
or more still
next thus
that
Modeling diction
seem
dimension. features, form
by
to the
to indicate For
cost
I[i,j,
seem
can
values
two facts
re-ordering In order order
by
(xl,x2,...,x_), ascending order
linear
that
by constructing needed
heuristic
for doing that
Nevertheless,
uses
a
computing bound
of the proposed
that
of the
simple
graph
approach.
lead
modeling
to different
schemes
We also
give
that
we
preliminary
the spectral
sets
components data
it is also are
data,
question
that
for
Recent
along
studies
the spectral
homogeneous
by a multiplicative
factor
(i,j)
in band
k.
bands
which
are
of band
a simple
highest
Pre-
ground [12] of the
+/3_
location
below
Linear
of a pixel.
set are
known
related
= c_rsI[i,j,s]
to the
and
selection
modeling
spectrally
far apart
for prediction.
scheme
that
is based
Keeping on spectral
a given
pixel
X, we first
In other
words,
re-order given
its spectral pixel
X
components
from
an
into ascending
n band
image,
let _ be a permutation on n points that sorts the components and let the resulting vector be X = (21, i2,..., z_,_) where,
to the
yield
prediction.
values.
intuitively
as compared
edges
some
at spatial
leading
be used.
investigations.
_: = x_(1) < i2 = xo(2) _< "It is clear
choice
resources
a simple
principle
for the
in hyper-spectral
we describe
to model
shown
[9]. However,
give us a performance
is the general
bands
value
in mind
intensity
we have
problem
can be found
Re-ordering
data
in different
before,
can
in a multi-spectral
correlated
and
pixels
ongoing
of modeling
intensity
as stated
be highly
these
work
prediction
to be promising.
correlations
k] is the
However,
if the
equivalent
memory cases,
as it would
we describe
in our
I[i,j,r] where
exist
pixels
and
functions
section far
hyper-spectral
intensity
cost
of
heuristic.
Spectral
question
that
simple
and
with
the
In such
neighboring
up to this point
In the results
We now turn
coding
in previous
there
for processing
be of interest
such
schemes
experimented
time
can be the entropy
optimization
statistics
applications.
one
described
modeling
graph
on the
However,
the computational
would of any
we have
implementations. have
tree,
for many
ordering
be expected
What
a bound
of the
combinatorial
Gaussian
ordering
spanning
be prohibitive
an optimal
and
a computation
image.
to the edges
coded.
yielding
[9].
an optimal
weight
entropy
model
multi-spectral
represent
an intractable
for Laplacian
although
a minimum
would
as iid and yields
optimizations
Now,
8-neighborhood
tile cost assigned This
this cost function
we showed
the
n log n for an n band
For lossless the
using
that
the
original
vector vector
)( is more X.
4
=
of X
in
< x_ = Xo(N)
amenable
Furthermore,
X
to approximation the
sorting
step
by a linear brings
closer
model spectral
3000
I
I
i
_
i
I
I
I
'ascending_order' 'original_order' 'spectrally_re-ordered'
28O0
---- ....
_
260O
2400
22O0
2000
1800
>
1600
1400
1200
1000
8OO
600
4OO
2OO
0
25
50
75
100
125
150
175
200
Band Figure
1: The
spectral
components
of a pixel
before
and
after
re-ordering
225
Prediction Best JPEG
Scheme
differencing
Spectral
re-ordering
Spectral
re-ordering,
adaptive
first order
Spectral Prediction
re-ordering, Trees
adaptive
second
components
that
correlated
and
5.52
prediction
order
of prediction
have
with
6.27
differencing
5.36
prediction
5.48
1: Entropy
might
Entropy 6.13 8.19
Spectral
Table
highly
Residual
one
been
widely
another
and
residuals
with
separated
have
different
in the
similar
schemes.
original
data
although
X.
function,
that
f(.)
segments,
such vector
for waveform after
the
(yl,..-,
yn),
from the
X
usually
pixel
plot
of the edge
plot
cost
lossless E.
to prove
a good
compression the
from
In table approach. currently
some above.
and
224 spectral
the
preliminary with
progress.
Mining obtained
in ascending
a derived
1 we also
show
order
and
is difficult
order.
in minimizing
the
by minimizing
necessary
by
a of a neighboring
Y is simply
the
number
of E are zero-mean
E is minimized gives
in figure
in ascending
to encode
interested
components
These
Implementations
Cuprite
the
of E [9]. This
1 we show
the
residuals
we are that
of X
=
If X and
a and
permutation
Y
we denote
error.
parameters
if the
pixel
Y, which
prediction
For example sorting
is almost
given
parameters
the
another
e.
obtained
cost
of encoding
of bits
needed
Laplacian the
cost function
then
sum of absolute
for weighting
to
it is easy values
the
directed
X to Y.
described under
components
cost of encoding
components going
the
image,
to the
I5" and the
)(
to be c_, Now,
to model
words,
tolerance
vector
order
of
algorithm
error
of the
used
fewer
approximation
an optimal
at a specified
between
of Y.
respect
plot of components
If we assume
that
techniques
the
difference
approximation with
given
ascending
of Y, or in other
of spectral
of using
into a are
of a hyper-spectral
re-ordering
from
the
For
encode
provide
function
to do so with
piecewise
[2] have
approximation
of X
o_). The
estimate
expect
to the
al.
linear
permutation
t> = f(a, pixels
et.
linear
components
adjacent
resulting
a piecewise
re-ordering
in our
after
to distinguish Now,
with
as compared
Bhaskaran
of a piece-wise
we get
error
tolerance, work,
spectral
the
then
the
The
error
In recent
the
spatially
pixel.
= -/i -1- e for 1 < i < n, we would
parameters
o_ and
E is then
E.
X.
re-arranging
parameters
Y are
f(i)
approximation
Denote
from an AVIRIS vector X can be
by a linear model using a given number of parameters as compared to For example, if we wish to approximate )( with a piecewise linear
for a prescribed
the original
are
intensities.
In figure 1 we plot the spectral components of a random pixel selected image. It can be clearly seen from this figure that that the re-ordered better approximated the original vector
they
The
District,
bands. after
The using
results
were done more
results Cuprite,
obtained with
the
sophisticated were
entry
Best
the
eight
prediction
obtained
Nevada.
The
JPEG
refers
JPEG
6
with intention
simple
approximations
of testing and
the
re-ordering
a portion
of an AVIRIS
image
contained
200 samples,
to the zero-order on each
entropy band
of our
schemes
from
predictors
of the
validity
are
image
of
200 lines,
of prediction
of the
image
and
selectingthe best. Hereno spectral information is being taken into account, the prediction is purely spatial. The entry 'Spectral Differencing'givesthe zero-orderentropyof prediction residualsafter taking differencesamong pixels in adjacent spectral bands but the same spatial location. In other words, a fixed, first-order spectral predictor is employed with coefficient 1. The third entry 'Spectral Re-orderingand Differencing' givesthe zero-orderentropy of prediction errors, after first re-ordering the spectral componentsby using the permutation that sorts the averageof the horizontal, vertical and left-diagonalneighborsand then taking differences as above. It is seenthat a considerablereduction in entropy is attained by simply re-ordering the spectral componentsprior to differencing. The fourth entry 'Spectral re-ordering,Adaptive First order prediction' refers to a technique that re-ordersthe spectral componentsas above but usesthe value c_, • 2i to predict the
value
was
computed
of the
spectral
component
X_+a of the
re-ordered
pixel
2
= (:/1,...
x_n).
Here
average
intensity
values
o_
by Yi;1
from
the
vector
Y = (yl,-..,
vertical,
horizontal
ordering,
Adaptive
but
the
using
next
and
left-diagonal
second
previous
order
by arranging
neighbors
prediction'
two components
the
in ascending
refers
of the
order.
to a technique
re-ordered
vector
The
entry
of the
'Spectral
identical
to the
to estimate
the
re-
one
above
value
of the
component. Finally,
in order
entropies
reported
to make
tree'
obtained
improve
these
results.
over
phisticated
prediction
obtained new
despite
are
4
Modeling
If the
residual
efficiently
Huffman
coding
sophisticated the
a model
that
in the
ite source sources,
that
residual
The
captures
notion
perform
Hence,
the
structure
which
scheme
that
already
using
more
so-
improvements
memory
usage.
is the order could
called
we can
by
current
zero-order
in which
be used
are
Also,
the
hyper-
in real-time
in hardware.
errors
standard
is treated variable
coding.
Unfortunately, the
in order that
modeling
of a composite
source
residual
to encode
remains
after
use a composite
as an iid source,
length
generally
error
the
entropy
even image
after
source
coding
image Most
model
the
structure efficiently
lossless
to capture
it can be
techniques,
applying
has ample
residual
prediction.
then
most which
we need
compression the structure
image.
modeling, called
of the
techniques,
lid assumption.
the and
a technique
to implement
of prediction any
order, such
prediction
We see
that
the
Errors
or arithmetic
prediction
violates techniques
consisting using
Note
lists
be achievable
complexity
Hence,
enough
band.
should
in band-interleaved
trees'
adaptive
spectral
schemes.
acquired.
'Prediction
a spatially
in computational
it is simple
using
entry
improvements
Prediction
image
coded
used
re-ordering
pixels
usually
provided
that
the
an adjacent
Further
and/or
process
images
applications,
work from
a reduction
techniques
spectral
comparisons
in earlier
a 'prediction
like
Y_) obtained
we model
sub-sources.
the
Each
model data
was
first
described
as an interleaved
sub-source
has
"k
its own
by Berger
sequence model
and
[1].
In compos-
generated
by multiple
associated
probability
distribution.
Each
sub-source Error
and
modeling
by one
schemes
that
either
1. A family
of probability
switching
encode
a particular
We have function, be a static each
symbol.
could
The
referred mass
same
Huffman
coding,
and
only
on what
is available
lnake
switching
decisions
The
switching
this choice both
the
used the
could
four
be available
are
used
to con-
is to be
of the
The
used
probability
for all images. functions
to
to the
technique.
The
coding.
Similarly
encoder
and
receiver/decoder
so it
or backward
were
in form
results
we can make
decoder
be provided
work
as in the
adaptive
only
of previously
reported
here
of probability case
decisions
encoder
or we (forward
as side information.
adaptive. encoded
are in terms
of adaptive
adaptive),
to the
to the decoder
all backward
functions This is
switching
(backward
available
would
we encounter
development
function
mass
This
after
to as an online
encoding
then
of information,
a composite
the
used
• PPS:
• PC:
magnitude
of eight
be present
in a given
just
before
The bands,
of the
reason
for
available
to
entropy
of the
that
in the
will be large,
prediction
more
while
residuals
sources. region
active in the
Each
source
rep-
of an image.
We
regions
of the
quasi-constant
will be smaller.
The
image regions
four
different
level were:
of the
prediction
the current
of the
Recall
residuals
activity
magnitude
neighboring
residual
at the
spectral
component
that
was
component.
prediction
residual
at same
spectral
location
but
at a spatially
pixel. Average
• AM E: Average of the
consisting
could
of the
the
Magnitude
COMB:
model
that
of activity.
prediction
to check
The
encoded
measures
of the
image
measures
source
level of activity
different
lnagnitude
•
following
source. a different
of the
mass
on information
in this
We constructed resents
to both based
decoder.
also
the
studied
wealth
distribution
functions
is referred
adaptive with
decisions
and
that
literature.
of the
the entire image, and construct probability mass then be sent to the decoder as side information.
switching
was the
sub-source,
In terms
mass
probability
This
techniques
encoder
composite
the
arithmetic
could The
in the
consist
particular
component.
would
integrated
based
adaptive).
described
typically
of pixels).
adapt
manner.
adaptive
which
for each
or backward
is often
by the corresponding
manner:
set of probability
we could scan which could
to as an off-line functions
same
information
in the
technique. Finally, for each sub-source,
(or block
We could same
be adapted
techniques model
one for each
indicates
of options
approach.
source source
or explicit
functions,
which
use the
single
generated
code.
pixel
a number
we could
of the
mass
length scheme
as an lid sequence
use a composite
in an implicit
a variable
2. An
is treated
encoded
two components
struct
subsequence
pixel
being
of PPS magnitude currently
and
PC.
of prediction encoded,
error current
8
over the entire pixel.
set of spectral
components
Error SSE
Table
These Ts.
Sub-source
5.19
COMB
5.18
AME
5.31
i was
source
than
the
for encoding
the
is the sum
This
The using
the
four
rather
different
In fact,
a given
image
using
because
our
almost
as opposed
to using
Further,
we have
activity
level
the
incurred eling.
in a larger Our
using
5
image
data
manner. and results
that
What
encoding
again
with
with
different
pixels
prediction
scheme
the
< ...,
less than
Institute
Engineering-Systems
of Southern
Los Angeles,
} _isl.hrl.hac.com
2i
Kuo:_
of Electrical
University
CA 90265
Sensed Approach
Son Daot
Lab
Canyon
Malibu, {asha,son
Vellaikalt'_,
Sciences
Research
Remote
a Feature-Based
Asha t Information
of
/
California
CA 90089
- 2564
[email protected]
ABSTRACT A feature-based image
database
local
pixels,
expected
representation is described
and
the
to occur
model
in this work.
cluster
features
frequently
in the
experimental results tool for content-based
show that access.
portant
and
issue,
coming
since
few years.
terabytes NASA
to allow
Current and
so on.
as given
This
ability
been
retrieval
PRECEDING
sets. issue
in the
is extracted
PAGE
data
in other area
gL_'¢i_
over user
by Dozier
after
based from fields the
types image
a remote
spatially
of queries retrieval.
provides
sensed
which
are
Preliminary
a very
promising
has
EOS
general
sensed the
images
image
flexibility
management
in the
size
image
data
public
with
based
[1], and
sizes
the
in the intent
of
it is important
based
to
good
such
satellite
images
im-
exceeding
to make
on metadata
taken,
to retrieve
data
It is the
to be able
was
is a very
of image
operational.
information
which the
that
and
becomes
to the stored
image
snow.
remote
closest
Map
on its content such
storage
use
as the
information on its content,
[2]:
sensing,
the
images
i'iOT FILI",iE_
and where
time
103
1, 1992, for the Bubbs
snow-covered
area
and
the
utility
enhance
content-based
indexing at the
to April
will greatly
as multimedia
of feature-based from
by clustering
several
increase
remote
Mapper
and
is formed sensed
it is expected
the
index
Thematic
Apart
image
database
give the
is cloud-free
to access
data
a research
not query
from
representation
sensed
of retrieving
location
the Landsat
area that index.
of remote
systems
retrieval
INTRODUCTION
to the image
does
in a sample
to process of remote
will be a tremendous
means
geographical
• Find
has
access
used
feature-based
will be downloaded
database
ard
are
To give an example,
flexible data.
time
sensed
there
everyday
allow very of available
The
flexibility
The representation
context
the
1 Efficiency
for content-based
retrieval
telemedicine features
of insertion.
Indexing
remote
is currently
Most
can aid
grain-size
of the
of images [3,4].
which
snow
Creek
of the
work
content-based
is done
on these
features
and
queries
In this paper, primitives
are first
of these
feature
primitives space.
work
given
from
the
area
the has
and
the
representation the
in Section
feature
Traditional attribute basic
unit
and
it easier
term
will differ
retrieval which analyze image which
each
on
of content sensed
image. in the
is explained
in terms
The
feature
multispectral
et al [5]. material sensed
are given
in Section
about
images.
Details
in Section
4.
feature-
3 and
Experimental
of
query
results
are
dealt
them
exactly.
is the pixel, on
correlation
with
ill response
that
a wide
variety
data,
images
manner
application
computationally
efficient
By using
some
using
should
should
appropriate
be compact
consist
models
the derived
of intensity the
For
This
on each the
as recorded
by
is interpreted
example,
content-based
it is practically
to develop
and
texture
impossible
a representation
to
of the
There are several characteristics representation should be able
types
for an image
representation.
image
so as to minimize
of data
meaning and
such as color
[3]. Since
it is important
semantic conditions
are unstructured
in which
domain.
perception
to a query,
and
clear where
rely on content-descriptors
human
of queries,
have
languages
is just a measurement on the
the
images
data
query
to such
will depend
depending
with
using
Compared
multimedia
image
Models
which content-based queries can be answered. representations should ideally possess. The
determined
is encoded
in remote
images
feature
of the
clustering
by Barros
features.
image
2 gives the background
concept
typically
which
widely,
direct
using these
support exist.
have
"content"
in standard have
entire
the
approach,
BACKGROUNDS
to manipulate
of information, The
based
Representation
can be specified
a sensor.
the
In our
representation
reported
involving
domain.
5.
databases
makes
then
are
Section
primitives
Feature-Based
which
been
to a condition
sensing
and
images
of remote
2 2.1
images
as follows.
models
them
remote
a feature-based
from
is organized
using
to the
to form
in this
representation
by converting
ideas
we extract
feature-based
processing
these
extracted
which
This paper the
processed
primitives
Related
based
are
we extend
for which
class,
the
approach
storage good
content
space
indexing
of the
can be termed
to
and
be
methods
images
can be
as feature-based
representation. Figure image images
1 shows
data
are
which
are
representation
While
the
image
of the images
can be stored
be derived attribute
derived
physical
which
is much from
data
schematic
distinguished
based images
the
the can
logical image data attribute data, the
of a feature-based
into the
physical
physical
physical
in primary
images.
memory
more
compact.
logical
image
be dynamic
images,
images,
can be stored
using
which The
possess
and
that
data
the
logical
actual
images,
methods bases
or class always
and
are
devices,
content models. be
Here
the
logical
a feature-
described
can be applied
different
can
images which
memory
regarding they
model.
characteristics
or tertiary
all access
knowledge sense
are
all the
in secondary
Information
in the
representation
earlier. the. logical
to the logical attributes These
calculated
can content
from
the
and need not always be stored in the system. As opposed to content logical image data are static as they are a permanent representation of
104
KnowledgeBases/
Class Models
_ _ _ _ - -_ -. _.
DYNAMIC
---'-
Feature-based Primary
the
physical
2.2
1: Schematic
image
being
tensity
values
images
gathered are
lie in the
visible/infrared
materials
being
image.
which
The
after
georeferencing
the
the spatial
range.
differently
be explained
for each
classified
image
varied on
for an NOAA
Advanced
Very
ordinates Two
shown)
of each of the
along
individual
main
1. Classification
uses
into
with
projection
information
can
in an
by an image
by
the
latitude
wavelengths
Resolution
Ra-
images had a latitudes and the
co-
are analyzed
so
pixel. of remote
sensed
images
ground
classes:
Here
are
as follows:
the remote
sensed
images
method
that
to cluster
classes
to calculate
one of the ground classes. used to aid classification,
tend
elec-
of pixels
original corner
as to categorize each pixel of the image into features such as texture, shape and size are is one
as the
of the
bands, High
be used
sensing
useful
covered
four
radiometric
different
as given
diometer(AVHRR) image which was used in our experiments. The radiometric resolution of 12 bits which were calibrated to 8 bits.The (not
in-
are
number
area the
measured
wavelength
the
shows
same
resource
between
on earth
2(a)
for the the
ranges
the
In addition,
location
Figure
The
for earth
the geographical
a definite pixel.
data
utilized
These
below.
the
determine
in distinguishing
determine
will span
with
which
depending
is useful
resolution
co-ordinates
corresponding
longitudes
btorage
model.
of interest.
levels
wavelengths
microwave
property
as would
with
and
the
bands
brightness
of the
interact
This
along longitude
data
in nature
wavelength
discrete
majority
and
radiation.
and
representation
multispectral
different
into
interrogated
in an image
image
aeconoary
Images
mainly
through
of the
present
are
quantized
resolution
tromagnetic
a based on b
Actual,Images
IMAGES
in Multispectral
sensed
region
b
tt_
.
of a feature-based
from
data.
"Content" Remote
a- - ->b
"'-'--'"
_
l
Storage Figure
/
I
Representati n_'-----._._._._._ o
a derived
|
which together
makes
use
of the
fact
in multidimensional
a subspace of the multispectral to be snow, land or cloud [6].
space.
pixels For
belonging
to the
same
example,
Figure
2(b)
case using which the AVHRR Quantitatively, methods such
105
Even though the dominant class shows
image can be classified as maximum likelihood
Band
I
Band
2 f_]
Snow
I
Ch:,ud
o land X SnOW = cloud
,h-
i
Land Band
3
Band 4 P,and
Wavelemfl
_8_ °==°=?_ 2_--2= ._
h (nlicrtms}
I
0.58
2
(I.72-
3
3.55 - 3.93
4
10 50 - I I 50
- I).68 1.10
gV.xXe a < xCx
2: (a) Details
for land,
cloud
the
can
[7].
is the
likelihood
(b)
different
bands
be employed
To explain
vector
intensity
decision
let
value
p(wilx)
is the
p(xlwi
.
nl,
and
Calculation
examples
values
images
is that
earth's
surface.
Thus
involve
opposed
image
a small would
to images
(b) classification
band.
such
of the
the
space
and
also Sea
values
classes
are
component
of
the
maximum
x belongs
to as follows:
w, with
the
vector
for
class
of class
-
covariance
ml)t
_i -1
matrix
important
referred
math
each
(X--
data
3 data.
bands
Surface
Temperature(SST)
of these
products
are
in class wi. sensed
These
images products
of an image. and
directly
It
mi)}
use of remote
different
x.
as a multivariate
for the
to as level
on the
given
the
Some
Vegetation
dependent
on
the
pixels.
that
one
characteristic
pixels
are
geographically
queries
will often
subsection be predominant of other
as
each
K classes,
the class which
Another
band
different
where
Given
exp {--1/2(x
vector
for the
vector
distribution
I El I-'/2
products, using
values
a pixel
probability
mean
numerical
to note
sensed
of an
image,
> p(cojlx ) for all j #- i,
probability
Products:
products The
It is important
only
the
secondary
are
if p(wilx)
) = (2rr)-N/2
calculated
Index(NDVI). intensity
the
of Derived
typically
characteristic
x denote
in a particular
conditional
gi are
is to calculate are
AVHRR
rule can be used to determine
is common to represent normal distribution
where
for NOAA
if the
briefly,
x Ccoi, where
1
snow.
classification known
of the
and
xxx Band
(a) Figure
=
refer
which
distinguishes
registered
to a geographic
of a particular
image.
in content-based
kinds.
106
"content"
to a physical location
Thus retrieval
queries
location
and typically regarding
of remote
in remote
sensed
on
the
this would subportions images
as
3
FEATURE-BASED
REPRESENTATION
MULTISPECTRAL There to the
is intrinsically
fact
results
that
in a lot
pression (e.g.
pixels
of adjacent
methods
as to ease
a lot of redundancy
nearby
and
JPEG,wavelets)
pressed
before
suited tor
this
are
selected
for joint
feature
not useful
[8].
Tilton
image compression. However not considered as the emphasis feature-based in the
representation
Let group
the
distributions cluster cluster,
cluster
with
which
larger
than
which
Once
the number
C =
{Ci
image. etc.
of classes
The
m is the
which
were
by
cluster
given
minimize
cluster
mapped
size
so
coding
to be decomis naturally such
as vec-
use of VQ for multispectral
on clustering
trying
of pixels
tend
the
intensity
clusters
are the
of vectors
in each
is dependent the
measures
the
the
number
on
number
the
accuracy
of clusters
in that
X is mapped
some
The
to
part
onto
distortion
measure
covariance
matrix
of the
to be image.
a set of clusters J(X;C). and
Our
the
number
as
mean,
E is the cluster
to that
a distortion
to represent
to be present
mean,
a spatially
above.
is determined,
of the
from typically
to describe
to choose
to minimize
consisted
will
features
of clusters
are expected
vectors
pixels
covariances,
number
N)
pixel
measures
cluster
Ci = Ci(m, where
Com-
file
which
as discussed
natural
It is desirable which
N (n >>
1,2,...,N} of each
per cluster
image
require
is based
These
space are
Typical
band,
be represented.
of clusters
: i =
formed
per
distances,
number
representation
bands.
on clustering,
the
which
image.
irradiance are
variances
X must the
image
multispectral
multispectral
clusters
intercluster
of pixels
multispectral
for the multispectral
means,
as they
be a set of n p-dimensional
sensed
in the
describe
studied
which
on transform
A technique based
is due
class
space.
of a remote
together
that
domain
is that
have
in all the based
This
same
in the
schemes
is possible.
et al. [9,10]
X = {xi : i = 1,2,...,n} region
values
a reduction
compressed
encoding
to the
the strength of VQ techniques with respect to content was was more on compression. In this section, we will develop a of the
multidimensional
local
in the and
intensity
However
or interpretation
images.
of belonging
to achieve
problems.
extraction
quantization(VQ)
similar
redundancy
access
in multispectral
probability
having
transmission
DATA
present
a high
pixels
exploit
storage
have
OF
particular
measure
given
J(X;
_, c})
covariance
matrix
These
clusters
cluster.
and
¢, the
were
number
formed
of vectors
while
trying
to
[11]. Once
the
by 1 - _
C)=
II (x,-
Cj(n_))II:
i=1
where
xi is mapped
clustering nearest can
neighbor
be used There
image.
to cluster
is complete,
In one
Cj using
index
classification
along
are
the
two
with
the
the
to encode image
nearest
I for that
in a manner
C to answer
methods
case,
map
spatial spatial
is segmented
mean
reclassification
particular
similar
region
to VQ-based
rule
can be obtained compression.
by using Note
that
queries. and
spectral
to form
107
information
homogeneous
for a multispectral areas
having
spectral
I
Local Image FEATURE EXTRACTION
_"
1
Clustering of pixels in local image
CLUSTER FEATURES
ENCODING ;_ Global Multispectral
Figure values
close
encoded case,
to each
regions
with
the
perturbations previous
due
section,
subportions regions
to class
in an image. practical
Figure
image
cluster
feature
the feature
Thus
index
our
source map
itself
IMAGE
as sensor with
the
4
The serve
is the
source
local
is then
above.
name,geographical feature
index
QUERY
feature-based
as the
logical
which location
map
I and
etc.
representation in the
scheme
compared
with
as local clusters
as mentioned
spatial
are
involve for local
information
and
of the entire
image.
then
the
and in the
only
separately
for multispectral
representation
images.
clustered
cluster
it is
Each
to get
the
feature
set to form
can be summarized
as follows:
metadata} of various
Each
sector
sectors
S and
s consists
other
of the
metadata
cluster
set
such C along
metadata.
USING MULTISPECTRAL PRIMITIVES
for multispectral data
any
local
distributions
will often
features
a subportion
which
related
PROCESSING FEATURE
image
retain
only
each
k = 1,2,...p),metadata)}
consists
other
Also,
First
into thousands
robust
system
developed
description.
clustering
of local
etc.
sources,
Our
= {S = (sk:
full image
not
cover
s = {C,I, Here
angle, database
representation into
as described 1MAGE
does
is also more
to averaging
to have
I which
divided The
sun image
it is better
feature-based
primitives.
index
moisture,
maps
is first
due
be
homogeneous an image.
image
by
can
In another
the image
the entire
be increased
clustering
out
sensing
Also clustering
to have
3 shows
global
variance,
in a remote
of an image.
Such
regions
not necessarily
in subdividing
to cluster can
of these
map forming
reasons
images.
representation.
but
index
and
efficiency
to spread
boundaries
contiguous
expensive
tend
The
were several
of clusters.
clusters
for multispectral
an approximate
the associated
Clustering
queries
tolerance. to form
There
is computationally
as global
scheme
into spatially and
into a lower number
are tighter
a given features
approach.
is impractical.
subimage
more
within region
parameters
latter
of all, clustering of clusters
the
representation
can be subdivided
feature
We followed
other
with
the image
INDEX MAP
F
Image
3: A feature-based
along
Map each pixel FEATURE MAP to cluster index
i
model
discussed
108
images
developed
in Section
2.1.
in Section Queries
3 can
regarding
content
can be processed
ing queries
at this
anticipated
in advance
that
particular
queries done
are
directly
in the
two cases which
the
perspective, as being
Dozier
but
by
values
that
: Such
extent
of area
a covariance clusters
very
likely
and
the
to belong
query
so on.
to that
of the
class
whether
a cluster
contains
pixels
contains
belong (C(m)
where
m is a parameter
class.
Obviously
Note that covariance
basis
pixels
clusters
and
and
indexed
of area
of a class
in an image,
qualify
and
the
in Section more
then 5.
Note
sophisticated
the
that
the might
will
query.
all C for that
above
spectral
of a mean premise
away
mc
is that which
from
are
the
class
class diminishes. in terms of the rule
to determine
as follows:
to the respect
decision
be needed
can overlap.
109
vector
qualify
value access
unless
the
The
classes.
with
derived
be regarding
pixels
a decision
deviation
clusters
regarding
results
rule
standard
to process
¢ corresponding
Experimental
farther
class
and
Cluster
C
- m_) < (m,:r_)'N['(rno_)
more
than
basic
data feature
characteristic form
our
derived
can
pixels belonging to that mean should be measured
only uses the mean From a database
earlier
of queries
and
Our
to a certain
satellite
if
a_ is the
complex
the
queries,
how the
will contain
move
image
From
classes
in the
mean
we formulate
- mc)'S;-'(C(m)
this decision rule matrix information.
for all types
percentage.
class
that
is available distribution.
belonging
as m increases,
will be computationally calculated
Thus
to a certain
assume
to a class
As the
distribution.
and
classes
mean, the possibility of the cluster containing This distance between a cluster and a class variances
to class to showing
ground
closer
class.
refers
index).
to actual sensor
this
logical
a particular
regarding
We shall
the
grain
queries
_c for a Gaussian lying
references
ourselves
logical
consider
content-based
data(snow which
processing
complexity.
answer
for a given
means
query
we restrict
refer
by which
to be
on the
will not
well defined
This
of ad hoc need
operations
methods
derived
consists for query
paper
are
regarding
processing
to computational
containing
3 can help
class
matrix
with
to the
case
techniques
This
of the
queries
The
attributes
information
the
processing
and
part
second
for which
expensive. regard
content
and
can allow
itself
In this paper,
for a particular
the
case
scenarios
in Section
Class
its presence, and
only
developed
Query
time.
to be a benchmark.
on "content".
representation data.
at run
classes(snow,cloud)
we consider based
and
are two ways of process-
certain
apriori The
queries
too much
with
regarding
separately.
will limit
without
query
location,
former
There
is analyzed
computationally
lack of literature
time,
queries
image
features
as the
representation.
case,
standard
image
more
logical
is stored
from
processing
to the
we consider sensor,
attribute
processing
query
Due
logical
different
much
of query
help
first
the
will differ
are
this
In the and
on the logical
image can
point.
content
which
aspect
by using
value
above
for the decision
of a query
image
is checked
selected
C are
to the above rule
is very
rule
summed
simplistic functions
class.
and not the above query
above
function
can
form
concerning
to find
decision
particular
particular
of the clusters standpoint, the
In case
as distribution
for that
as that
the
the
extent
ones
to find
is
which
the
area
rule
are presented
and
in practice
of several
classes
a
Query by Value: As mentioned in Section2.2, the numerical valuesof the derived products are typically dependentdirectly on the intensity valuesof the pixels. Thus queries about these products can be consideredto be a query regarding the intensity values in the image. These queries can be directly processedusing the feature basedrepresentationas the cluster featuresare directly representativeof the intensity distributions in the image. Data Mining: Data mining refers to algorithms which tries to discoverinteresting conceptsfrom data. Cromp and Campbell [12]discussedsomeimportant applications of data mining to remotesensing.We believethat our representationof remotesensed imagescan be applied to someaspectsof data mining. For instance,anomalydetection is a method by which artifacts, such as outlier values from the standard statistical range, are discoveredto be a manifestationof a previously unknown phenomena.The cluster representationof the multispectral image is very conciseand lends itself to statistical processing.We presentsomepreliminary results towards this end. 5 The the
experiments
National
presented
in Section
georegistered 3 and
used
mean
intensities
gave
very
good
characteristic the mean and For the
which good
depending
x 1000
were
for these on the
to us.
for all the
images.
mean
Thus
intensity
One
stage,
out. on each
clustering
original
image
performance, and
of the
are given
in Figure
for bands
1,4 and
a better
each
sector
primary
between 199 sets
of
America. use
suggested
of these
by Baglio
the final cluster 4(a)-(c)
the image
was
mean where
to which
we had
five sets
of band
and
Since indicator
squared
The MSE
image The
average does
to compare
110
of characteristic
original
per
was
MSE
not take
was measured into
account
performance
1 gives
Clustering To get
was
a measure
between
the
each
pixel
of mapping
performance
for
which These
Table
calculated
is the result
MSE
the
sectors. sector.
not
values image,
classification. classes.
16 equal
error(MSE)
the final to.
into
64 clusters
were
of the differences
1 for a particular
to the
divided
values
because
queries regarding and classes.
to get
it belongs
respectively.
5 respectively.
performance
image
of characteristic
primarily
compared
separately
set
images,
value
performance
bands
already
difference
of Northern
The
was that
had
5. We had
parts
pixels.
5 as the primary
available
classification
classification
clustering
carried
mean
band
The
band
different
scheme
values were also used for answering the variances for the different bands
of the
band,
over
1000
images
of
the pixels
to provide
1 and
stage.
called
taken
classification
2. The
from
details
a maxinmm classes from
images
band
The
and
available Some
for distinguishing
class
to the
monitoring.
were
around
were
classes, viz. snow, land and cloud. For our experiments, we developed classifier for which we estimated the characteristic values for the three
each
then
images
Figure
at this
band
which
Center(NOHRSC).
2.2 and
calibrated
averaged
imagery
Sensing
as a separate
These
images
[6] which
classified
ill the
treated
database.
was for snow
sufficient
radiometrically
4 was
of these
into three likelihood the
already and
band
in our
Holroyd
AVHRR
were
band
images
on NOAA Remote
been
sizes
conducted Hydrologic
images
The
were
RESULTS
Operational
these
images
EXPERIMENTAL
for bands
1,4 and
5
to be 2.82,2.08,2.22 the variance
of each
band
of each would
be
Band
Table 1: Mean and Total Cloud
Total Mean
Variance
Variances Cloud
Mean
for Different Classes. Snow Snow Land
Variance
Mean
Variance
Mean
Land Variance
1
36.9
700.6
58.0
612.9
43.2
481.4
16.8
119.5
2
34.7
468.1
50.0
436.1
38.4
267.6
18.3
103.7
3
86.7
155.8
91.3
111.8
81.5
29.1
90.1
31.0
4
79.1
157.4
71.9
110.6
78.3
27.5
85.6
34.7
5
7.01
73.9
19.4
78.4
3.18
14.2
4.60
16.1
3O 2O I
10
&22
0.05
0i 50
50
100
150
200
o.
50
100
150
.
L_O
MSE
the normalized
MSE
for Band (NMSE),
_1
4 and
is defined NMSEi
The
NMSE
NMSE
was
be seen, better
performance measured
the
NMSE
than
clustering
that based
performance is dominated
complexity formances
[13]. Improved between bands. by
clusters
which
performance
values
lie farther
away
different
200
150
200
100
the
5 (d)-(e)
for the
i band
with
5 are
given
1,4 and
higher
variance(band 4 and
variances. measures
but are
classes
are
expected
4
which
expense
certainly
needed
performance
to find are
how
the
accepted
to be different
111
as the
The
average As can
1) is considerably This
We had
implies
earlier resulted
of higher
that
the
reported
an
in comparacomputational
to give comparable
of the
decision
performance as belonging
was carried out for the three different classes. The is taken to be the class from the maximum likelihood
of the
1, Band
5 respectively.
5).
at the
to see the mean
4(d)-(e).
for bands
larger
bands,
class
for Band
as
in Figure
distortion
tried
NMSE
o?,
with
methods
of m were from
150
variances(bands
different
clustering
100
Band
0.122
band
lower
: We experimented
Different
This experiment for each cluster
with
for the
Class
4.
and
by bands
on clustering
performance
Query
with
200
MSEi
-
1,4 and
for the
bands
ble NMSE
in Section
bands
to be 0.019,0.075
for the
process
approach
for the
.
(f)
1, Band which
,so
(d_
1"_
(c}
Figure 4: (a)-(c) and Band 5.
1110
per-
rule
varies
given
as more
to that
class.
correct classification decision rule. The
covariances
for
the
classes
lo0
)4 '
_" t00 fsO
_ _
.__._.
_-
F - - - _ ......
x
9o
+
'
,. -''"
x
2:_--;
St)o,lv
. Lind - Cloud
"'-.
80
_
' --
:7"_ 31,
m
x
--
80
x
Snow i.nd
x_
5O - " - - _ __
×
- Cloud
40
3O 20
4
30
40
50
60 _1(%)
m
Figure vary
5: (a)Recall
considerably.
efficiency, the
m (b )Efficiency
retrieval
performance
if A represents
number
images
Tile
versus
of images
which
were
the number which
were
missed,
and
recall
-
was
of images
falsely
recall
versus
rn (c) Recall
judged
which
retrieved are
as
-
A + B"
Figure
5(a)-(b)
shows
classes.
expected,
since
resulting
in more
more
efficiency
as more
is shown
in Figure
efficiency.
Thus
low variance
regarding
how
class
sectors
queries
the
of the
percentages.
recall
performances
more
stable
with
the
distributions sectors with
the
condition
retrieval
is dependent
As the decreases.
ranges
to different
on the
for the
However
Since
with.
bands
values
query
conditions.
112
the
much
decrease
and
could again
in
be the could
be that
information
cluster,
which spectral
here divided
queries
were
& condition
the
it is seen
percentages that
It should
queries
would
each of the
As can
conditions are
in the Unlike
image type
on percentages.
values. of the
values values.
the
we had
The
were primarily
original
on the
stringency and
efficiency
It is suggested
queries
performance
in general
worst
of m,
of decreasing
also incorporates
retain
bearing
sector.
values
for this
mixed.
is as
be studied.
of each
on band
corresponding
queries
respect
in each to experiment
which
This
versus
without
the was
could
of m for the
cost
of recall
reason
very simple
direct
performance
plot
The
space
with
will have
checked
rule
multispectral
we experimented which
and
of relevant
for larger
this is at the
performs
a decision
in the
class
to increase case.
of snow
representation
pixels and
cloud
performance
queries
a184
containing the
In comparison, The
number
as m increases.
The
recall
feature-based
which
intensity
like
in this
performance,
: Here,
However
also retrieved.
results
lie nearby
the
we had
performance and
which
well
best
- recall
B represents
as a function
increases
to a certain
retrieved.
are
we would
the
spread.
retrieval
queries
16 sectors,
Ideally
for land.
so as to support
be regarding
gives
the
being
efficiency
recall
as belonging
clusters
gives
by Value
to check the
5(c). land
classes
Query
clusters
and
classes,
qualify
unwanted
to its large
for improving
image
relevant
efficiency
of recall
for all
Efficiency.
measures the
100
A
A + C'
performance
clusters
values
attributed
the
As expected
90
retrieved,
C represents
A
different
two
defined
80
versus
were correctly
and
efficiency
using
70
into
: Find Table
be seen, on
made
the
2 the
bands
narrower,
efficiency
performance
be possible
to increase
is the
Table 2: Retrieval Performancefor Different Queriesby Value. Condition on Bands Condition on Percentages Recall(%) Efficiency(%) Band5
>_ 11
Band5
2 11
3 < Band5
> 20
_< 5
50
it was possible
as avg, variance
efficiency
100.0
Detection):
system,
considerably(four
the answers
the
can in general
(Anomaly
management
20.06
_< 50
_> 50 &
by relaxing
performance of the queries used for representation.
database
99.58
> 20 &
Bandl
recall
95.83
ASC-9350309.
The
113
authors
would
like to thank
Fellowship. Presidential Tim
Szeliga
of NOHRSC AVHRR
for some
images
help
used
with
in the
the
classification
presented
Future
of Remote
OTA-ISC-558,
Global
"How
Change,"
Sensing
from
Space:
Printing
Sequoia
2000
Sequoia
Technical
Civilian Office,
Addresses
Issues
Report
and
shape,"
Databases,
February
[4] M. Arya, dbms
W.
J.
New
W.
the
Advanced
Richards, York,
Tilton,
Marin,
Data
and
J.
Proceedings
"Methods
Remote
High
for
A. Toga, on Data
IEEE
Data
and
J.C.
Tilton,
Using
a Massively
Conference,pp.
[12] R.F.
Cromp
and
Images,"
White,
of 23rd
for
Image
"Qbism:
for
and
Video
Extending
Engineering
AIPR
Radiometer," March
Image
Analysis,
Digital
"Indexing
Operational
Center,
a
(ICDE),
pp.
Multispectral
Workshop
Im-
in Image
and
Snow
Cover
Technical
Area
Report,
Mapping
National
Op-
1989. An Introduction,
Springer-Verlag,
Proceedings
Wash.,D.C, Vellaikal, Using
Orlando,
pp.471-480, C.-C. Vector
SIMD
181-190,
1992.
of the
"Data
Conference
Mining on
with
411-420,
Quantization
Machine,"
Arnold,
Compression,
Experiments
Vector
Parallel
Edward
Signal
Conference,pp.
"Progressive
Campbell,
and
"Compression
Compression
Analysis,
W.J.
Quantization
M. Manohar,
of the
Cluster
IV,
Systems
1994.
Resolution
Sensing
Sensing
and
Everitt,
A.
Retrieval
and
P. Kelly
Holroyd, Remote
[11] B.S.
Images
and
Conf.
Oct.
Very
D. Han
Compression
[13]
Information
1993.
M. Manohar age
Storage
Int.
Wash.,D.C.,
E.W.
Hydrologic
Proceedings [10]
Applications,"
E. Glasman, D. Petkovic, P. Yanker, C. Querying images by content using color,
Tenth
[8] A. Gersho and R.M. Gray, Vector demic Publishers, Boston, 1992. [9] J.C.
and
and
1993.
1992.
J. Richardson,
images,"
Retrieval,"
Systems, and
erational [7] J.A.
French,
Baglio
Using
Systems July
in Data
92/14,
of SPIE
C. Faloustos,
3d medical
for Content-Based
Information [6] J.V.
of the
1994.
Barros,
ages
part
1993.
Cody,
to support
314-325, [5] J.
Proceedings
Satellite
Wash.,D.C.,
[3] W. Niblack, R. Barber, W. Equitz, M. Flickner, Faloustos, and G. Taubin "The QBIC project: texture
for providing
REFERENCES
U.S.Government
[2] J. Dozier,
and
research.
8 [1] "The
algorithms
London,
AVHRR
of Multispectral
Proceedings
Aca-
Data,"
1991.
of the
IEEE
ImData
1993.
of Multidimensional
Information
Kluwer
and
Remotely
Knowledge
Sensed
Management,
1993. Kuo
and
Son
Quantization,"
Dao,
"Content-Based
Proceedings
1995.
114
of SPIE
Retrieval Visual
of Remote
Information
Sensed
Processing
N96- 15457 Machine
Learning
for a Toolkit
Richard
for Image
Mining*
L. Delanoy
_-_; / /
Massachusetts Institute of Technology / Lincoln Laboratory 244 Wood Street, Lexington, MA 02173-9108 (617) 981-2545
[email protected]
/ .z
ABSTRACT A prototype computer
user
skills
environment
to collaborate
is described with
that enables
a computer
algorithm
a user
with very limited
to develop
search
tools
(agents) that can be used for image analysis, creating metadata for tagging images, searching for images in an image database on the basis of image content, or as a component
of computer
logue
between
tion.
The
vision
algorithms.
Agents
the user and the algorithm.
algorithm,
in response,
are learned
The user points
attempts
with the set of objects
to mistakes
to discover
discriminating between objects of interest and clutter. and applies it to an input image, producing an" interest" are consistent
in an ongoing,
and clutter
which
two-way
made in classificaimage
attributes
rapid
trained
appears
convergence
to reasonable
to be competitive
indicated
levels
by the user.
of performance
in discrimination
are
It then builds a candidate agent image highlighting features that The dialogue
peats until the user is satisfied. The prototype environment, called the Toolkit Mining (TIM) is currently capable of learning spectral and textural patterns. exhibits
dia-
and, when
accuracy
with other
re-
for Image Learning thoroughly
classification
techniques.
1. INTRODUCTION Vast image
databases
ers to find particular
being
images.
accumulated
For example,
today much
are overwhelming
of the data currently
and civilian sensors are recorded and never accessed, image search. And the problem is getting worse, given planned
and deployed.
levels of computer debate [1,2]. Computer tadata
expertise
assistance
(key words
to locate
images While
these
data can be made
and hardware
will become
and computed during
resources
increasingly
indices
for some applications.
lected over Florida threshold.
How
very useful,
metadata
available
from military
of a continuing
to facilitate image
with varying
cloud index
and impractical
national
such searches.
as a whole),
one might ask for Landsat
a computed
is insufficient
to researchers
is the subject
necessary
For example,
obtained
of us-
in part because of the difficulty of the fleet of imaging satellites being
used to label each
June 1993 in which
the capabilities
images
was greater for many
Me-
can be used col-
than some other search
problems, such as finding images of moveable targets in military applications, finding tropical storms in weather satellite images, prospecting for mineral deposits, locating potential archeological
sites, or finding
a face in images
generated
by an airport
surveillance
camera.
* The work described here has been supported by the United States Air Force. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Air Force.
115
For these rectly.
and many
Until recently,
other applications,
the only means
the only solution
for searching
through
is to search
image
the image
contents
data on the basis
di-
of content
has been customized algorithms designed to search for a particular signature. While these are often effective, such algorithms are expensive to create and maintain, suitable only for limited
applications,
and require
a computer
vision expert
to build and maintain.
An active
area of current research and product development has been the construction of generalized database retrieval and classification tools with user-friendly environments for users with no experience ample,
in computer
be it a drawn
vision.
shape
These
tools generally
or a box enclosing
allow
a subimage
the user to provide
ofpixels.
tempts to characterize the example as a set of attributes (e.g., edges, spectral eigenvalues) and compares these attributes with those computed for images A "distance"
calculation
comparing
the attributes
a single
The algorithm
of the example
ex-
then at-
characteristics, in the database.
with those
of images
in
the database is computed and used as the criterion for selection. While capable of attentiongetting demonstrations, these tools have exhibited only fair performance and have been applied
only to small databases
developed at Purdue
in limited
single example.
The algorithm
cannot
ample are useful for discrimination a user might get good discrimination the desired
domains.
Examples
of this class of tools are QBIC,
by Niblack, et al. at IBM [3] and Multispec, developed by Landgrebe and Biehl University [4]. The main shortcoming with this approach is that of using only a
results,
decide
which
and therefore from a single
there is no way to make
attribute
values
the agent any better
an algorithm
can learn
which
Using tools of knowledge-based [5,6,7], a prototype Toolkit for Image attributes mistakes
or combinations
attributes
This paper
processing developed at Lincoln Laboratory (TIM) has been implemented that learns which
are discriminating
describes
the intended
from a user
simply
pointing
look and feel of a proposed
environment, along with details of the underlying techniques implement the TIM prototype. An evaluation of the technique applications
from mistakes, incorpoof false alarms. From
are discriminating.
signal Mining
of attributes
of classification.
with that ex-
other than to try a different
example. A more ideal algorithm would be one capable of learning rating not only examples of the object but also counter-examples these,
associated
uses them all. There is some chance that example. But, when the user is not getting
of machine learning used to and a discussion of possible
is also presented.
2. THE TOOLKIT
The premise
ENVIRONMENT
of TIM
is that a user, having
found
one image
containing
some
object
region, would like to find other such images in a large database. An agent is generated collaborative process in which the user indicates mistakes of omission and commission TIM and TIM responds intentions. ly searches
at
TIM
When through
with a visual
finished,
representation
of its current
the agent can be used in a variety
large databases.
116
understanding
of ways,
including
or in a to
of the users autonomous-
Figure 1. Example session of the TIM prototype. On the left is one of 3 infrared bands of an image of a town. Frames labelled SW, MW and LW are the short wave, medium wave and long wave bands of the subimage at tile 40 (near the center of the image on the left), which contains two bridges crossing over a river). The frame marked INTEREST is the result of water from the set of example
of applying an agent and counter-example
constructed for the detection pixels supplied by the user.
These pixels are marked on frame LW in blue and yellow, respectively (appearing as black pixels in the river and white pixels along the highway at lower left in the gray-scale rendering shown here. The first step of an interactive
session
with TIM
would
be to fill in information
about
the sensor and the data available. This information can include filenames of images used for training. Menus would provide options for those attributes that the algorithm consider
as candidates
for learning.
For example,
the value of each spectral band and the value the set of candidate attributes. Following
initialization,
if the data
source
of each ratio of bands
the main user/machine
interface
is multispectral, might
is a screen
to be is to then
be considered
that would
as
look sim-
ilar to Figure 1. The screen presents an image consisting of several spectral bands (in this case long, medium and short wave infrared bands) and a feedback image called an interest image (right). With the left mouse button, the user clicks on pixels that are examples of the objects or regions to be sought. With the right mouse button, the user indicates counter-examples rithm
of objects builds
or regions
a new template
plies the template
that should encoding
to the various
bands
be ignored.
the current
With
of the input image,
117
each input
understanding
by the user, the algo-
of the users intentions,
and displays
the resulting
ap-
interest
image.Theprocessrepeats,usingthe sameor otherimages,until the useris satisfiedwith the classificationperformance. Oncesatisfied,theusercanapplytheagentto database searches.Theusermight specify existingmetadata(time,place,or sensor)thatmighthelp constrainthesearch.In addition, theuserwouldselectfrom amenusomecriterionfor imageretrieval. Forexample,theuser mightrequestthattheagentretrieveanyimagefor whichtheaverageinterestvalueis greater thansomethreshold.The userexploresthe imagesretrievedby the agentto assessperformanceandtolook for interestingobjector regionvariantsfoundby theagent.At thispoint, the userhasthe option of reenteringTIM to further refinethe agent. 3. MACHINE
LEARNING
The core technology of TIM is a new machine learning algorithm, based on techniques of knowledge-based image processing, in particular functional template correlation (FTC), developed based
at Lincoln
on fuzzy
processing
Laboratory.
Functional
templates
are generalized
set theory, that can be used to create customized,
operations
[5,6].
FTC can most quickly
matched
filters,
knowledge-based
be understood
as a simple
image
extension
of
autocorrelation. Where the kernel of autocorrelation consists of literal image values (essentially a subimage of the image to be probed), the kernel used in FTC is a set of indices that reference
scoring
functions.
These scoring
functions
encode
a mapping
between
image
pixel values and scores to be returned for each element of the kernel. High scores are returned whenever the input image value falls within the fuzzy limits of expected values. Low scores are returned
whenever
from the scoring
the input value falls outside
functions
are averaged
these
and clipped
limits.
The set of scores
to the continuous
autocorrelation, if the feature being sought can vary in orientation, the maximum average score computed across multiple orientations Scoring uncertainty. edge)
functions allow the encoding In addition, scoring functions
of the object
being
functional templates Functional templates spectral
vision
thinning,
As a result, image
and data fusion.
applications,
including
In order to probe
processing
thin line detection,
operations
in Doppler
images,
spectral band. In this case, each functional having a single element with an associated, bands into a single value,
as inputs and likewise
fuzzy
FTC has been successfully
automatic
multispectral
tandem, producing
generic analogs. shape matching,
morphology,
convolu-
applied to several
target recognition
as
com-
[8,9] and the automatic
weather
radar images
scoring
functions
for the Federal Aviation
are constructed
for each
template consists of a series of kernels, each unique scoring function. Transformations of
such as the ratio of two bands, might also be considered
have a scoring
trfinsformations,
a single average
score is
constructed
function.
In addition,
texture can be decomposed
primitives (e.g., local variance, fractal dimension, contrast, periodicity, orientation) be encoded as pixel values representing a local neighborhood. The scoring functions spectral band, spectral
As in
of ranges of acceptable values, thereby encoding can be used to encode physical properties (knowl-
edge detection,
detection of hazardous wind shears Administration (FAA) [7,10,11 ].
multiple
then the match of the kernel.
generally tend to be more powerful than their standard, have been used in a variety of ways, including 2-D
pattern matching,
tion, homotopic puter
sought.
returned
range (0,1).
and textural transformations (i.e., interest
image.
118
would
into
that can for each
all be applied in
value) for each pixel of the output interest
Giventhis structurefor spectralandtexturalfunctionaltemplatecorrelation,functional templatelearningis implementedby constructingscoringfunctionsfor eachbandvalueor computedattribute.Briefly, the approachwe havetakenis basedon the constructionof distributionhistograms,onefor exampleandanotherfor counter-examplevaluesof each attribute.The scoringfunctionis amappingof thedegreeof overlapbetweenexamplesand counter-examplesateachbin of thetwo histograms.Forexample,high scoresarereturned for input valuesthathavemanyexamplesandfew or nocounter--examples. Low scoresare returnedfor valuesthathavemanycounter--examples andfew or no examples. The N functional single
most discriminating scoring functions are collected by iteratively building test templates that are applied to the set of examples and counter--examples. First, the
scoring
function
and counter-examples. with the first scoring tion.
Additional
is found
that does the best job in discriminating
Then, the second scoring function in a single functional
scoring
functions
lected or until the addition in performance.
are added
function
functions
fails to achieve
Figure 2. Input image and interest image with selected subimages application of the functional template taught to recognize water. 4. EXAMPLES Agent (long
wave,
training medium
AND
have been
an incremental
resulting
colgain
from the
EVALUATION
and use are illustrated wave,
examples
function is found that, when combined template, achieves the best discrimina-
until either N scoring
of any other scoring
between
with two different
and short wave)
infrared
119
images
examples.
In the first,
3-band
of urban
and suburban
scenes
wereused. Imagesweredividedinto 50 by 50pixel tiles andtrainedon onetile ata time. Figure1 showstheendof traininganagent(functionaltemplate)to detectwater. Figure2 showsthe applicationof this agentto the overallimage,resultingin the interestimageon theright. The white boxes enclose tiles that have average interest values exceeding a selected criterion agent
threshold.
In the same way that an agent might
might use the agent with a selection
criterion
select
to select
images
A more difficult task is the classification of vegetation Mapper (TM) images. Huber and Casler [13] have described a variety Their
of simple
area
classification
of study,
a region
metrics in central
and formulas Colorado,
tiles from a large
image,
from an image
the
database.
types from Landsat Thematic work in which they evaluated
developed
by wildlife
was classified
biologists.
pixel-by-pixel
into
14
vegetation types. Actual ground data were collected and used for scoring. They concluded that "using remote sensing information in complex terrains can be a difficult and relatively inaccurate process". With no technique 50%. When they added digital elevation facing, alarms.
they
Using
reached
nearly
70%
did the percentage data to determine
detection,
the same TM data, Augusteijn
(an enhanced
version
although
of correct classifications exceed the direction in which terrain was with
et al. [ 14] used
of back-propagation)
substantial
numbers
a Cascade-Correlation
[ 15] to classify
homogenous
of false
neural
regions
net
on the basis
of spectrum and texture. The data chosen by them for training and testing consisted of 8X8 pixel boxes containing only one vegetation type. Of the 14 vegetation types, only 9 were available in boxes of this size. The task was to train the neural net on half of these boxes and then attempt to identify vegetation were able to achieve 99% correct The same TM image learning.
Using
alarms) trained
was used to perform
the TIM environment,
of the vegetation typically
types in the remaining boxes. Using this approach, classifications using only spectral data.
rapid,
types
a preliminary
agents
in 100 X 100 pixel
with reasonable
detection
evaluation
were trained subimages.
performance
of functional
to identify
adjacent
with 6% false images,
continued.
After
the results several
average performance used for training.
These
results
alarms
on the training
Convergence
were scored,
images
and the process
detections
better than those reported
for 13 thoroughly was 94% correct then applied
classification
the agent reached
and 6% false alarms
by Huber
was
with 5% false
were
of indicating
were used in this fashion,
of 90% correct
are clearly
The templates
each
to a solution
(e.g., 80% detection
image.
template
pixel-by-pixel
in as little as 50 pixel inputs. The average classification accuracy agents (1 vegetation type was not present in the training image)
detections
they
errors
an asymptote
on subimages
and Casler,
to
not every
even when they
incorporated topographic data in addition to the 6 TM bands. It is more difficult our results with those presented by Augusteijn et al. Where the agents generated
to compare using TIM
were
only
to classify
mistakes
made
forced
homogenious
to classify
every
8X8 pixel patches.
TIM were in the boundaries biguous
spectral
characteristics
pixel, Many,
between
the
previous
if not most,
regions.
These
that are likely
well as by TIM.
120
study
attempted
of the classification boundaries
to be misclassified
often have transitional, by the human
truthers
by amas
5. DISCUSSION We have
designed
the training
a way for a user and a machine
of agents
to recognize
particular
with a set of examples of the intended ignored, the learning algorithm creates trained
agent
would
then
have
a simple,
discriminates is believed
readily
Customized
techniques
search
database
representation,
and to edit learned
to collaborate
patterns.
Being
to autonomously
in
provided
agents
to achieve
dedicated
predictable
to extract changes
to the learning
for a single
search
for other
the user can continue to refine Because functional templates
it is possible
can be extended
algorithms
or textural
As mistakes are made, and counter-examples.
interpretable
that these
spectral
algorithm
pattern and counter-examples of patterns to be a functional template of the intended pattern. The
be sent into a large
images containing the same pattern. the agent by adding more examples
learning
how each
agent
in performance.
It
of shapes.
application
may well outperform
agents generated using TIM. However, this approach is expensive and therefore impractical for most potential users of image databases. The interactive training of agents using the TIM environment sonably
allow
good
Agents tent-based
anyone
search
generated image
with an understanding
and classification
of a particular
by TIM might be used for several
retrieval,
as metadata
to tag database
a particular
task when
including images,
bandwidth
image
domain
to build rea-
tools.
(1 .) discovering (2.) discovering
limitations
different
applications
other than con-
equations
for indices
that can be used
which
prevent
spectral
the transmission
bands
are necessary
(3 .) priortization of data (e.g., should data be stored on primary, secondary, or tertiary devices) on the basis of image content, (4.) building components for more complex algorithms, ners.
and (5.) quantitative
It is likely to inexpensively of image
analysis
and trend
that a tool such as TIM, by enabling build useful
computer
vision
analysis
for scientists
users with limited
tools, should
facilitate
for
of all data collected, storage search
and policy
computer
plan-
experience
the exploration
and use
databases.
REFERENCES 1. NASA's EOSDIS development TEC-92-24, February 1992. 2. Broader involvement GAO/IMTEC-92-40,
approach
of the EOSDIS May 1992.
is risky.
community
U. S. General
is needed.
3. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, Taubin. The QBIC project: Querying images by content Science, February 1993. 4. D. Landgrebe and L. Biehl. An Introduction University, 2.15.94 edition, February 1994.
to MultiSpec.
Accounting
U. S. General
Office
Report
Accounting
GAO/IM-
Office
Report
D. Petkovic, R Yanker, C. Faloutsos, and G. using color, texture, and shape. Computer
School
of Electrical
Engineering,
Purdue
5. R. L. Delanoy, J. G. Verly, and D. E. Dudgeon. Functional templates and their application to 3-D object recognition. In Proc. of the Intl. Conf. on Acoustics, Speech, and Signal Processing, San Francisco, CA, March 1992.
121
6.R.L. Delanoy andJ.G. Verly.Computer scoring function.
apparatus and method June 1993.
for fuzzy template
matching
using a
U. S. Patent No. 5,222,155,
7. Richard L. Delanoy and Seth W. Troxel. Journal, 6(I): 187-212, Spring 1993.
Machine
intelligent
gust front detection.
Lincoln
Laboratory
8. R. L. Delanoy, J. G. Verly, and D. E. Dudgeon. Machine intelligent automatic recognition of critical mobile targets in laser radar imagery. Lincoln Laboratory Journal, 6(I): 16 l- 186, Spring 1993. 9. J. G. Verly, R. L. Delanoy, and C. H. Lazott. Principles and evaluation of an automatic target recognition system for synthetic aperture radar imagery based on the use of functional templates. Automatic Target Recognition lli, Proceedings ofSPIE, Vol. 1960, Orlando, FL, 14-16 April 1993, pp. 57-71. 10. Richard L. Delanoy and Seth W. Troxel. Automated gust front detection using knowledge-based processing. In IEEE 1993 National Radar Conference, Boston, MA, April 1993.
signal
l 1. M. M. Wolfson, R. L. Delanoy, M. C. Liepins, B. E. Forman, and R. G. Hallowell. The ITWS microburst prediction algorithm. Preprints 5th Conference on Aviation Weather Systems, Vienna, VA, August, 1993. 12. R. L. Delanoy MIT Lincoln
and R. J. Sasiela. Machine Laboratory, March 1995.
Learning
13. H. P. Huber and K. E. Casler. Initial analysis Sensing, 11(5):907-912, 1990.
for a Toolkit for Image Mining.
of Landsat
TM data for elk habitat
Technical
mapping.
Report
1017,
Int. J. Remote
14. M. F. Augusteijn, L. W. Clemens, and K. A. Shaw. Performance evaluation of texture measurements for ground cover identification in satellite images by means of a neural network classifier. Technical Report EAS-CS-93-8, University of Colorado, Colorado Springs, CO, October 1993. 15. S. E. Fahlman and C. Lebiere. vances in Neural Information
The cascade-correlation learning architecture. In D. Touretzky, Processing Systems 2. Morgan Kaufmann, San Mateo, 1990.
122
editor, Ad-
N96- 15458 within
Dissemination the Navy SPAWAR
of compressed satellite imagery Central Site Product Display environment"
Oleg Kiselyov and Paul Fisher Computer and Information Sciences, 303 N. Carroll, Suite 108 Denton TX 76201
Inc. ./
Phone: (817) 484-1165, FAX: (817) 484-0586 Email:
[email protected],
[email protected] http ://rep lic ant. c sci. unt. edu/-ole g/SIMDC9 5/P ape r_Summary.html
Abstract The paper presents a case study of integration of compression techniques within a satellite image communication component of an actual tactical weather information dissemination system. The paper describes history and requirements of the project, and discusses the information flow, request/reply protocols, error handling, and, especially, system integration issues: specification of compression parameters and the place and time for compressor/decompressor plug-ins. A case for a non-uniform compression of satellite imagery is presented, and its implementation in the current system is demonstrated. The paper gives special attention to challenges of moving the system towards the use of standard, non-proprietary protocols (smtp and http) and new technologies (OpenDoc), and reports the ongoing work in this direction.
I. History
of the
project
Central Site Product Display (CSPD) system has been developed for the Space and Naval Warfare Systems Command (SPAWAR) for disseminating a whole range of weather and oceanographic information to a variety of clients afleet or ashore [1 ]. The clients (not necessarily related to the Navy) can request/receive data via serial (dial-up), Internet and VSAT connections. The available information includes pressures, temperatures and wind directions on and above land/ocean surfaces, barometric heights, temperatures on the ocean surface and at specific depths, precipitation, primary observational data (weather reports), storm warnings, and a number of satellite images of different kinds (IR, Visual, DMSP). The satellite images are usually 512x512 or 1024x 1024 8-bit graylevel pictures. CSPD is being developed upon the foundation of a very popular software system NODDS built by the Fleet Numerical and Meteorology Center (FNMOC) at Monterey, CA. NODDS has been in operation for more than 4 years. The server part of the system runs on a Cyber mainframe. It was written in FORTRAN with modules dating as far back as 1987. The client portion of NODDS was written in QuickBASIC for PC-DOS. The hardware base of the old system is obviously in need of updating, so is the software. The new CSPD server has to work on modernday workstations (HP 9000, Sun SPARC, Concurrent's MAXION RISCstation) while the clients are to run on almost any platform: UNIX workstations, Macintosh and PC computers. Furthermore, there is a clear need to expand the system to include more, specialized weather products, to enhance display of received information, and decrease communication time. Improving satellite imagery communication is of special priority, especially as far as flexibility of requesting and compressing images is concerned. The new system should allow the user to browse and request all available satellite imagery (rather than a few pre-selected pictures), to choose a compression algorithm and its parameters, and evaluate the anticipated image degradation when a * This work supported in part by US Navy SPAWAR Grant N00039-94-C-0013 "Compression of Geophysical Data" and by US Army Research Office TN 93-461 administered by Battelle Research Office.
123
lossy compression has been selected. The user should compression of certain areas/features that he considers for using compression is to reduce the communication
also be able to specify particularly important. time.
a non-uniform Again, the rationale
Computer and Information Sciences, Inc. (CIS) has been working on this project as one of the contractors. CIS is solely responsible for handling of satellite imagery within CSPD. We have written the software to compress/decompress satellite images using technology developed at CIS, as well as code to convert satellite imagery from DEF to TIFF formats, to view images under CSPD and standalone, and to catalog imagery. We have also completely re-written the server part of CSPD in C++, and perform its maintenance and enhancement.
II.
Request-delivery
The current CSPD dissemination. A NODDS
information
flow
and
protocols
system is client-server, though there are capabilities server distributes weather and satellite information
for a broadcast-type in response to
queries from CSPD clients. The server uses NODDS/TEDS databases (managed by empress database manager) as well as a collection of fiat files to look up the requested data. The net functionality of the server is as follows: reading a NODDS input (request) string, retrieving grid data from a TEDS database, encoding them using Navy encoding routines (rainfom-gold or OTH-gold), and transmitting the encoded data to the requester. reports, and satellite image requests are handled as well.
Text files, synoptic
and upper
air
To formulate a request, a user has to select a geographical area he would like to get weather information for, and pick the products he is interested in for the selected area. Satellite imagery is treated as a regular 2D product. Request for a satellite image always carries in it specifications for a compression method to use, and its parameters. Originally they are set to some "default" compression specifications, which can be evaluated and changed by the user using the compression manager (see below). Whenever the server comes across a request for a satellite image, it queries the database to see if an image of the corresponding type (IR, visual, etc.) covering a specified area (and valid for a specific period of time) is present in the database. If the image exists, the server obtains the name of the corresponding TIFF file and launches a compressor plug-in to compress the file according to the parameters specified in the request. If the compression was successful, the resulting binary code is included in the return packet. Otherwise the server sends a message detailing the error (as a matter of fact, the entire contents of s tderr produced by the compressor). When the client receives the reply, it unpacks and preprocesses it. At this time, the client runs a decompressor plug-in, which decodes the received binary code into a PGM file. The latter is used by a display module to show the picture, and, possibly, lay temperature, pressure, etc., isoplots and other information over the satellite image. It should be stressed that the compressor and decompressor are separately compiled and run executables (plug-ins), which are launched (spawned) whenever there is a need to encode/decode image data. The modules are available for a variety of platforms, and rather versatile: they support very simple and unified interface. The fact that request/reply parsing and compression/decompression are performed by separate executables has several advantages. Indeed, one can modify the compression module (say, by adding more sophisticated compression techniques) without any need to recompile/relink client or server executables. Moreover, the compressor can be run by programs other than the CSPD server, as the functionality the plug-in provides - image compression - can be used in some other contexts, for example, within WWW environments. Another important advantage of having compression/decompression modules as separately run plug-ins is a better error handling and improved robustness to error conditions. The root NODDS server module can only detect very simple errors, mainly related to the syntax of a request
124
string.Most semanticerrors,for example,whena requestedimagefile cannotbefoundor read, thereis no spaceondevicefor the encodedfile, whenthespecifiedcompressionparametersdo not makesense,etc.,- aredetectedwhenactuallyperformingthecompression.Whenthe serverroot modulelaunchesthe compressorthroughthe system()function,it redirectsthecompressor's s tdou t / s tderr into a temporary file. If the compressor succeeded, it writes "OK" into a special file and creates a file with the compressed image data. The latter is included by the NODDS server in its output stream, PACKED.DAT (following appropriate headers). UU-, Base64-, etc., encoding can also be performed if necessary at this stage. If the compressor failed, its standard error stream (which explains in detail compressor's actions and the reasons of failure) is inserted into the PACKED.DAT as a special message to a client. In any event, a problem with locating/compressing of an image would never cause the NODDS server to crash, and the user would always get some answer to his query. As was mentioned before, we have operational compressor/decompressor plug-ins for a variety of platforms: Sun Sparc, HP 9000/Txx, Macintosh. The modules are rather versatile and independent: they can be run as "filters" that translate their input into output (that is, read a TIFF/PGM/XWD file and make a compressed output, or read the compressed image code and prepare a TIFF/PGM/XWD file). This makes it straightforward to integrate the plug-ins into various environments. For example, recently we have cast the decompressor modules as Web browser helpers, with the help of a very simple shell script. This made it possible to view the (compressed) satellite images retrieved from the NODDS/TEDS databases through Mosaic/Netscape. That is, clicking/linking on an item of type image /x- sowlp-gray (extension. s lpg) causes the compressed image be delivered, unpacked, and displayed. From the user's point of view, it looks no different than retrieving a JPEG picture. It is even possible to play with different compression ratios: one merely needs to type the value of a quality parameter on a specified slot of web page [2] and click on the "Compress Now" button. Having received the request, an HTTP server compresses the image and sends the compression ratio, compression time, and the compressed data back to the client. The user thus can note how long it took to communicate the image, and see how good the decompressed picture turns out to be. This interactive compression demo is available from [2]; please contact the authors for decompression helpers. Since http and smtp both use the same MIME protocol to deal with non-textual data, the decompressor modules-helpers can be used just as they are to view encoded images sent through email, that is, asynchronously. The same viewer that helps out Mosaic when image/x-sowlp-gray stream comes in can be used to make sense of an e-mail message with a image/x-sowlp-gray "attachment". The mailers take care of ASCII encoding if necessary, and they do it transparently and automatically. Using electronic mail to send out satellite imagery has a tremendous advantage: one is no longer limited to TCP/IP or any other particular communication protocol or media, one can send a mail from a FIDONET site to a mainframe, from a PC on a Novell network to a Mac, etc. One can also take the full advantage of all e-mail frills and embellishments, like mailing lists, automatic broadcasting, mail aliases, mail exchangers, etc.
III.
The
compression
manager
The NODDS server always compresses satellite images for transmission. The compression method and its parameters are taken from a satellite product request string received from a client. When the user first selects a satellite product in formulating a request, the corresponding request string receives some "default" compression specifications. It is the compression manager that lets the user customize the default settings. Thus the place of the compression manager can be illustrated by the following chart on Figure 1. The compression manager allows interactive selection of a compression method and its parameters. Note, the compression is almost always lossy. To give the user an idea how well the delivered
125
imagewould look like andwhetherimportantfeaturestheuseris lookingfor would be preserved, the compression manager loads up some (relatively small) sample image and displays it along with the compressed and decompressed picture, using the parameters the user has set up. Fi.gure 2 shows a snapshot of the component manager screen. The new version of the compression manager supports non-uniform compression as well, see below. At present, the compression manager implements a few variants of a modified Laplacian pyramid compression method (multi-resolutional analysis with overcomplete frames) discussed papers presented at Data Compression Conferences [3], [4]. See also [5] for a complete list of references.
IV.
Non-uniform
in
compression
The very nature of environmental images, or any image for that matter, suggests that not every detail of the picture is equally important to the observer. For example, the area of the hurricane eye on a satellite image should be of high resolution, while the tight cloud cover of the hurricane body is less informative and may be rendered with a lower resolution, though it cannot be completely discarded. In disseminating weather information over a fleet, a meteorologist at a particular ship needs very accurate data on the cloud cover, wind direction, temperature, etc., just in the vicinity of his ship. The information about what is going on outside that small area is used for prognosis and does not have to be of very high precision. Accordingly, the amount of loss and inaccuracy that can be tolerated during the communication varies not only from one user to another but also from one region of the image to another. This raises the problem of a non-uniform, lossy compression, i.e., compression where the loss varies with the location/features/frequencies, etc., and tailoring such compression to a particular user and circumstances. Preserving the information during compression to the extent the user needs, but not more, helps to drastically reduce the amount of data that has to be transmitted. In the present implementation of non-uniform compression [3], the regions of special interest are specified by assigning to them a degree of relative importance, using an "aplhachannel", or a "mask". The mask is an 8-bit image of the size of the original one with pixel values telling a relative weight of the corresponding area: the bigger the weight, the higher the importance. Note that the mask is necessary only for compression; a non-uniformly compressed image code has the same format and structure as a uniformly compressed one. That is, non-uniform compression does not require a special decompression; the same decoder restores both kinds of compressed imagery. To simplify transmission of the mask in a request to the server, we, at present, have limited specification of the importance area to a single rectangular region. Therefore, only a few extra characters in the request string are necessary to solicit the non-uniform compression. For example, a specification [ 10,( 1,2)-( 127,255)] means that a rectangular area of an image with the upper left comer at (1,2) and the lower right comer at (127,255) is to be assigned a weight of 10 during the compression. The rest of the image (outside this rectangle) is assigned weight 1 by default. Thus, features within the rectangular area would incur 10 times less distortion than features outside the rectangle. Note, xrnin, xmax, yrnin, and ymetx are specified in pixels of the image to compress before
of the target image: therefore, one has to have some idea of the size specifying the non-uniform compression parameters.
126
Pp.r._ian
Giilf
Arp_
f!_req_!.__._.! SAT
... SATV010UW
... 0040...
LAPPRENO
+itt+l
I
+.otl+
_
NODDS
Server
Manager
Compression
SATV011
Find
Image
] Find
Compressor
Fig. 1. Compression
Manager's Change
MIDDLE Original
image
[Lap
Pyr
place
in the overall
Compression
the
(IR)
imaq_
_Compression
,F-Quality
method--
parameters--
1100401 select
flow
Pat+_ms
EAST SATELLITE
_ompress
information
I}I}I}I} a product
compression
11}I_I}I}] to
tune
parameters
[-] Non-uniform
compression
In image of
I Cancel
Fig. 2. Snapshot
of the Component
127
Manager
screen
I
Note, that the region of low distortion is blended smoothly into the somewhat more distorted "background", without any sharp and conspicuous boundaries between the areas of different importance. Thus the picture restored after decompression does not have a "cut-andpaste", or "patched" look at all. These smooth transitions are a notable feature of the present nonuniform compression method. interest, image:
The compression assign a degree
-__%J
manager discussed above can also be used to select a region of importance to it, and preview the effect of the compression
eomp,manager
Fig. 3. Snapshot
of the Component
Manager
- Change
screen
Compression
with preview
Params
of non-uniform
Note that the compression ratio shown on the picture is significantly underestimated. image is subsampled to be as small as 128x 128. It is well known that smaller images compress. In our experience, when the same method is applied to a full-size 512x512 image, the compression ratio would be at least 4 times as much.
V.
Challenges
and
future
of special on a sample
compression The sample are tougher satellite
directions
The new version of the CSPD clients and server is being phased in. However, the system, and especially its satellite image delivery part is under further development. The challenges here are: - Allowing acquired
interactive browsing of new imagery a new image available for retrieving;
128
and letting
a client
know
that the server
has
to
- Keeping local/customary configurations simplify the request protocol and allow
(profiles)
of clients
on the server,
which
would
- Automatic delivery of weather products: each client receives (without an explicit request) periodical updates of information which it "usually" asks for. The information is also tailored to the current location of clients: for example, the region of special importance in nonuniform compression changes as the client moves; - Binding to;
of an image,
- Overlay image.
of relevant
that is pinning geographical
the picture
down
and meteorological
to the exact information
area on the globe on a full-gray-scale
it pertains satellite
Another challenge is integration of the request/delivery of satellite imagery and other grid products within http/smtp protocols: developing weather-conscious Web servers/browsers. One possible scenario: the user starts with a map of the entire world or some predefined region, as in the Xerox Map server [6]. The user can zoom in/out, etc., until he nails down precisely the area he is interested in. Then he can select weather products from a list of products and/or a list of product groups. A click on the submit button brings, after some communication delay, a familiar "weather map" view. Note, the server may have already recognized the user (or his site), so it can present him with a hot-list of his personal requests, that is, requests which were previously defined/asked for. Thus the user can skip on the area definition and communication, and simply say, "I want my usual". Also, it is very easy for the server to determine if there are new products added to the database since the last time the user had talked to the server. The server then can make a list of the new products on user's request. Since the server knows hide some products which are "for your eyes only".
exactly
whom
it is talking
to, it may also
It should be stressed that the scenario above looks very similar to the existing CSPD/NODDS interface, on the surface of it. The challenge now is the employing of standard tools as Mosaic, Netscape and httpd, and standard protocols, instead of proprietary clients and proprietary communication engines. Another difference is that the product lists and other configuration information are now stored on the server, not on a client site. Therefore, the information always stays current, and there is no need to waste bandwidth broadcasting product updates to all clients, even those who have already received updates, or who are not interested. One more difference of the proposed scenario is that the defined areas are also stored on the server (as a site "profile"). This arrangement cuts down on the communication time: the only information sent between the client and the server during the initial dialog is just the names of the cataloged areas/profiles. There is also no need to send the whole list of products being requested to the server, very often all the same list, and all over again: the list of products (with all needed configuration information and preferences) is already on the server. Keeping it there also helps in billing (should such feature be necessary) and accounting. Using standard Web browsers helps in one more respect: caching. Many Web browsers, Netscape for one, do their own caching, and do it quite efficiently and customizably. One can tell Netscape where to keep the cache, how often to purge it and how long to keep data in it (for a single session only, or for certain number of days, etc.). We are also looking into using architecture, for example, CyberDog.
a new generation
References 1. http://www.metoc.navy.mil/metoc/ 2. http://mozart.compsci.com/pictures/sowlp_compress.html
129
of Web browsers,
based
on the OpenDoc
3. Kiselyov,O. andP. Fisher,"WaveletCompressionwith FeaturePreservationand DerivativeDefinition," in Proc. DCC'92, 1992 Data Compression Conference, Snowbird, Utah, p.442,
March
24-27,
1992
DCC'93,
4. Kiselyov, O. and P. Fisher, "Pyramidal Image Decompositions: 1993 Data Compression Conference, Snowbird, Utah, p.479,
5. ftp://replicant.csci.unt.edu/pub/oleg/README.html http://replicant.csci.unt.edu/-oleg/ftp/ 6. http://pubweb.parc.xerox.com/map/
130
or
A New Look," in Proc. March 30-April 2, 1993.
N96- 15459 Data
Management
Atmospheric
and within
Radiation
Scientific the
Integration
Measurement
Deborah
Program
K. Gracio
Larry D. Hatfield Kenneth R. Yates Jimmy Pacific
Northwest
W. Voyles
Laboratory, Joyce
Brookhaven
National
Laboratory,
Mark
1.
Ridge
York
J. Leach
National
National
New
J. Laufersweiler Laboratory,
Paul Oak
Upton,
T. Cederwall
Martin Livermore
Washington
L. Tichler
Richard
Lawrence
Richland,
Livermore,
California
Singley
Laboratory,
Oak Ridge,
Tennessee
Introduction
The Atmospheric Department tion Models this goal,
three
will collect iation
Radiation
experimental
atmospheric
Testbed
of data from the Southem September
used
consists
in the Tropical 400 MB of data
Experiment
were
Program
designed
has been
to drive
the GCMs. available,
Center
Western
processes
observational
these
the ARM data.
Science
Team
A primary acquired.
requirement This function
data
Pacific
streams
of the ARM is performed
U.S.
of instruments
site, known
redundant
data
as a Cloud
system
systems
and in 1997
Laboratory
that and
Rad-
for the collection
and
external
while
basis
providing
data
of Alaska.
the SGP
Washington.
in
site to the The
for the collection, processthe CART sites from a vari-
sources.
to provide a three-month
is to preserve and protect Ridge National Laboratory
131
Slope
from
in Richland,
path and provides are received from
on a continuous
Program at Oak
on the North
per day via the Internet
Northwest
in near real-time
Each
by the
The first CART site was deployed in April 1992 Oklahoma, with the other two sites to follow in
is transferred
at Pacific
developed
capabilities of General Circulatransfer effects. To achieve
for the deployment
of a highly
Center is central to the ARM data and delivery of ARM data. Data
ety of instrumentation, Center
(ARM)
a variety of instrumentation. Great Plains (SGP), Lamont, 1995
Experiment ing, analysis
testbeds
data
(CART),
Approximately ARM
Measurement
of Energy with the goal to improve the predictive (GCMs) in their treatment of clouds and radiative
The
derived
Experiment data
running
products archive
to of
all data produced or where leading edge
technologyis employedfor the long-termstorageof ARM access
to data
for participants
The ARM Program oratories, providing
outside
of the ARM
involves a collaborative multi-disciplinary areas
advances 2.
the ARM
Program
in understanding
The ARM
Archive
provides
effort by teams from various DOE National of expertise. This paper will discuss the
collaborative methods in which the ARM teams translate into data products. By combining atmospheric scientists, engineers,
data.
Program.
has successfully
the scientific goals systems engineers,
designed
the parameterizations
and developed
of GCMs
Lab-
of the Program and software
an environment
where
can be made.
History
Planning is available approach
for the ARM in the ARM of the ARM
by Stokes
Program Program Program
and Schwartz
began
in the fall of 1989;
Plan (U.S.
Department
and the design
a description of Energy
of the CART
of the initial
1990).
The
sites is discussed
program
technical
in more
detail
(1994).
The design of the CART data systems was part of the initial program plan (Melton 1991). The plan called for a distributed environment, the CART Data Environment
et al. (CDE),
which
included
distributed
across
all of the CART
be implemented
an Archive, "based
an Experiment
facilities on use
Center,
for the collection of existing
mentation of functionality, and parallel et al. 1991). Figure 1 shows the flow
and a collection
of data
and processing
solutions
wherever
implementation of data through
of data.
possible,
of independent the CART Data
systems The
CDE
evolutionary
imple-
subsystems" Environment.
(Melton
SourcesData External Southern
Great
Plains
l
Data Management Facility E
Experiment Tropical
Western
Center
Pacific
IT
FacilityData Management
I
Archive North
Slope
of Alaska
Facility
Figure
1:
_
ARM
Data flow
through
132
IT
[
SourcesData External
l
the CART
Data
was to
_1
._ _-_
Environment
__a_'E
The SouthernGreatPlains (SGP)CART site wasthe first to becomeoperational,with instrument deploymentbeginningin April 1992. Figure 2 is a mapof the SGPsite andthe instrumentsdeployedthere. The site is slightly largerthan a GCM grid cell, and is approximately 350 km (N-S) by 250 km (E-W). The centralfacility containsthe greatestnumberandvariety of instruments. The principle objectiveof theseexperimentaltestbedsis to quantitatively describethe spectralradiative energybalanceprofile within a wide rangeof meteorological conditions. Measurementsare takenat a finer spatialresolutionthan GCMsutilize. The purposefor thesemeasurements is to improvethe parameterizations of sub-gridscaleprocessesfor use in GCMs. The siteswill be operatedfor 7-10 years,continuouslycollecting measurementsof atmosphericradiationandassociatedatmosphericandsurfaceproperties (Stokesand Schwartz1994). 3.
How
Science
the Technical
Infrastructure
Facilitates
Interaction between the Data and Science Integration Team (ST) is a primary mechanism that guides data ARM
the Scientific
Activities
of the ARM
Team
Program.
This
interaction
leads
Team (DSIT) and the ARM Science collection and management within the
to the translation
of scientific
goals
and objectives
into
data requirements that define experiments for individual ST members. Figure 3 is a conceptual diagram of the science-related functions of the ARM technical infrastructure. The flow illustrated
in the
figure
begins
Figure
2:
with a scientific
SGP
CART
dialogue
between
site with Instrument
133
the ARM
Locations
scientists,
ARMScientific Objectives
scienceneeds _
datarequirements
0 relateradiative fluxes totemperature, watervapor, cloudsandsurfaceproperties
O developandtestGCM parameterizations for watervapor,clouds,and surfaceproperties
C,ENT, ,C Figure 3 Scientific
Functions
of the ARM Infrastructure
collectively or individually, and the ARM infrastructure to define specifically what data are needed to accomplish the stated scientific objectives. From this point, the infrastructure implements the actions necessary to provide the required data to the scientists. Scientific feedback from users of the data is a key feature for improving the quality and usefulness of the data. ARM active
is unique in its approach of interacting proactively process between the DSIT and the ST is facilitated
understands
the ST member's
of required within data.
ARM
ARM. The
data.
There
acquisition
management of routine outside ARM.
planned
This process
are several of required data.
research
data
In some
and assists
identifies
actions
data
can present
in the identification
requirements
that the ARM cases,
with funded scientists. The interby a liaison from the DSIT, who
Program
difficulties
the required
data
not currently can take
and acquisition met by data
to obtain
the required
that are not encountered may be available
from
in the a source
Several steps are taken to enhance the observational capability of the applicable ARM Cloud and Radiation Testbed sites when data requirements point to the need for more observations. This enhancement can sometimes be as simple as the implementation of an Intensive Observation Period (lOP), where the temporal resolution of the observations is increased (e.g., more frequent radiosonde launches). Enhancement may be gained through the acquisition and
134
deployment of additional testing of new instruments Development In many lated
Program,
cases,
data
quantity
(e.g.,
calculations infrastructure.
One
data
stream
a component
requirements
may
vertical
may
instrumentation at existing facilities. Finally, the development may be required, and this is done within the Instrument
of the strengths
of ARM. can be satisfied
of two or more
velocity
the horizontal
from
by scientists
of the Science
within
Team
interaction
column groups attention possible
been
clear-sky
The DSIT the needs
of "showcase"
data
Data
data
Sources
and
are normally
data
wind
streams.
to produce
divergence). ARM
Algorithms
or developed
is the potential
A calcu-
an unobserved used within
for synergism
in the
through
ST members. This interaction happens most natsets of common interest. To date, the two most instantaneous
radiative
flux
(IRF)
and single-
is a catalyst for developing such interactive working of the working group to the ARM infrastructure. The
sets of common
that increases the benefit by ST members working
ARM
ARM
have
modeling (SCM). and communicating
identification
4.
areas
calculated
observations
or outside
concept
interaction and information exchange among urally around science issues and related data active
by new
be the merger
be defined
and
interest
of the collected in isolation.
data
helps
focus
sets and
ST and
stimulates
infrastructure
progress
not
Products
collected
in an ongoing,
continuous
manner,
punctuated
by IOPs.
These two methods of collecting data complement each other. The "first-generation" data stream is the observations taken directly from the instrumentation in the field.
of a The
quality
delta
of the data
threshold approach defined.
and documented
checks. The ongoing in analyzing the data.
A Derived stream
is assessed
Product
by using
ARM
minimum,
of a procedure
data
streams
which
as inputs
creates
automatically and continuously product") becomes a new data
Prospective
may
derived
derived to science
products.
products team The
members first
be identified
by any part
or any place
in between.
type consists
of data
of the program,
processing,
extrapolation, time synchronization of different data example of this type of product would be a procedure
existing
data,
a new
data
stream.
the data to make it easier for ST members dataset to improve the quality of the data.
These
or models
135
algorithms
as long as there stream.
polation, Another
thus creating
and
a "second-generation"
and applying
them. The procedure is run the output (called a "derived
mentors
maximum
nature of regular ARM operations requires an automatic To this end, the concept of Derived Products has been
is the definition
existing
via simple
products
There
are two
including
data
or models is input
to
data,
and
from
instrtunent
distinct
types
smoothing,
of
inter-
streams, and/or time averaging. that applies new calibrations are designed
to use, or to reprocess
to either
reformat
the original
to
The secondtype of derivedproductgeneratesnew models
driven
by inputs
These algorithms or the models. Quality derived pronged
from
the ARM
data
data streams derived streams, or from data
data
who
(QMEs)
are interested
are a subset
quality
on self-consistency
within
at the time of processing.
somehow
related
against
anomalies,
inconsistent anomalies. To meet
a set of expectations.
sites with
objectives
physical
products
quantities
designed
continuous
data
Network
products
from
from
the NMC
the Arkansas
time,
while
Products
these
data
by the University
streams
the NOAA available,
Wind GOES
Basin
Forecast
Center.
are delayed
until
quality
assurance
on the data
are either
coverage
some
statistics
surface
River
and
external
of these
currently derived
mesonet
sets which
data
to acquire
GMS
from
the
External
at the Southern and from a Kansas
Weather
Service
(NWS)
Network data, including observations, analyses products
obtained
in near real-
organization
has had the time
sets or from
at NASA
Langley
incorporate
and AVHRR
combinations
or are planned.
•
data
of the external
Examples
from
the various
It is anticipated that the need for external data will increase sites in the Tropical Western Pacific (TWP) and the North underway
of these
sets are acquired
the responsible
generated
by a group
data
or
observations
III precipitation data
to identify
sets.
external
being
Some
that are
implemented
data products.
National
Stage
streams
the root cause
Profiler Demonstration and AVHRR satellite
models
automated
data
the capability
to augment
of Kansas.
acquisitions
data
are already
various
incorrectly
to identify
are called
other
from
integrated stations.
needed
using multiple
provides
instruments,
it is necessary
sources;
Eta and RUC
derived
and ARM
A QME
across
of ARM,
other
operated
from
to perform
data
stream,
compare
which are currently being acquired to augment measurements site include surface observations from the Oklahoma Mesonet
surface and upper air data, RASS data as they become
The
in the derived
ongoing,
data
in contrast
and the information
the scientific
Mesonet
5.
either from physical quality comparisons.
of the derived
by providing
a single
QMEs,
such as inconsistent
algorithms,
data products Great Plains
cloud
ARM
ST members
Experiments
to enhance
focuses
methods
ARM
existing
from the intercomparison of different ARM data streams. QMEs are part of a twoeffort to ensure that ARM data is of known and reasonable quality. The first
approach
data
come
Measurement
specifically
from
include
the GOES
surface
satellite,
and
meteorological
with the addition Slope of Alaska.
for the TWP
the data
of the CART Arrangements
CART
site.
Challenges ARM
anical,
DSIT
and
Members
team
software of the team
is made
up of a diverse
engineers,
mathematicians,
are located
requirements
for enhanced
conferencing, members.
and travel
physicists,
at five different
technological have
set of disciplines
become
national
communications. significant
136
contributors
including
electrical,
and atmospheric laboratories, Electronic
scientists.
thus, mail,
mech-
increasing phone
to communication
the
and video among
team
The
group
number
responsible
for designing
and implementing
Specifically,
the implementation
of challenges.
relatively
short
amount
of time,
with team
members
the ARM
SGP
data
of a complex located
system
distributed
in three
separate
faced
a
system
in a
time zones.
Furthermore, the SGP site was not close to any of the laboratories at which the team worked and since it was located in the middle of farm land, the site was started with no communication facilities. The team was designing a system which was to run continuously for an estimated I0 years and be capable of near real-time delivery of data to its ultimate users, ARM Science Team.
The ARM Program Management developed lems by means of a subgroup, the Problem conference
calls.
problems
A database
via Problem Action
and stored
in the database
6.
Reports
Collaboration
Collaboration outside ARM and
(e.g.,
ideas
with
Researchers
ongoing
beneficial
and the resolution
Quality
ARM
of problems
of data
streams
Reports
(DQRs).
Programs
outside
via
are documented
ARM
level. In many instances, objectives. The exchange
to both parties.
cloud
In some
cases,
ARM
scientists of data
has
products).
programs
Since
is an important
ARM
addresses
can provide
and data that
resources
all reported
by identifying the scientists as adjunct ST members. In funding arrangements for providing desired data to ARM
research
programs
and resolving probmeets via weekly
the PRB by recording
the quality
and Research
objectives.
other
algorithms,
combining
of Data
satellite-derived
its scientific problem,
approaches,
(PIFs)
In addition,
in the form
is mutually
providing
Collaboration climate
Forms
(CARs).
these collaborations ARM has established
NASA
meeting
with
to assist
occurs on both the individual and program conduct research related to ARM scientific
scientific
strengthened other cases,
was developed
Identification
Corrective
a general method of tracking Review Board (PRB), which
the
with those
a specific
scientific
enhance
of the ARM part
of the
understanding,
the results
of other
part
certain
overall
for
global
observational
of ARM-funded
programs,
approach
research.
scientific
goals
Also,
by
can be
achieved that individual programs could not achieve on their own (e.g., ARM and SHEBA interactions in the Arctic). The density of instrumentation and the long-term period of data collection
at ARM
the ARM
sites
Water
CART
sites have
as benchmarks.
Experiment
(GEWEX)
the TOGA-COARE
program
7.
Future
Direction
Translating
science
reasonable
quality
ity to realize
these
Investigators
(PI).
To keep
pace
programmatic
attracted
Examples through
several
programs
of collaborations
its GCIP
in the Tropical
include
subprogram
Western
that wish
to take
the Global
in the Southern
advantage
Energy Great
of
and
Plains,
and
Pacific.
of the DSIT needs
into data
are fundamental principles
will enhance
with the increasing objectives
requirements principles
to improve
capacity GCMs
and delivering of the ARM
the scientific
of ARM
137
productivity
PIs to make
and understand
data
Program.
streams
of known
Maturity
of the ARM
progress
the atmospheric
and
of our capabil-
toward
Principal
the ARM
radiation-cloud
radiative feedback,the DSIT will maintaina vigorouspolicy of upgradingthe softwareand hardwareof our datasystemandoptimizing the critical loop betweenour applied scienceand modeling efforts with thoseof the ARM ST. In essence,to makemore high quality data, information, andderivedproductsavailableto our customers. Focusgroupswithin the DSIT areworking to developeffective methodsfor designingderived products,our systemdevelopmentactivities,andthe operationandmaintenanceof our data environment. These focus groups, along with the infusion of new technologies, and the utilization
and
architecture
re-use
of previously
approach
streams.
to product
In particular,
object-oriented
an integrated display tools
development, will continue
future,
nologies needs
in view. form
the
Wide
techniques
new
system
we will keep
Understanding
basis
are moving tracking, Web
requirements
our principles
the needs
of the operational
meta-data
approach,
an open-
and
and
data
and
analysis
of new
objectives
how well
of the effectiveness
to data
at our archive,
standard
the scientific
of data
access
relations
and plan the integration and
of our customers
measures
us toward
and characterization
for cross-platform
to manage
operations, and maintenance to have a positive impact.
as we develop
and algorithms,
Program
database
tools,
processing,
the use of the World
information,
In the
developed delivery,
tech-
of the ARM
we meet
those
of the DSIT.
References Melton, Design American Stokes, Program: Bulletin
R. B., Campbell, of the CART
Data
Meteorological G. M., and
System
Programmatic
for
Society
Schwartz,
of the American
U.S. Department DOE/ER-0442.
A. P., Edwards, the
S. E.
1994.
of Energy's
The Atmospheric
and Design
Meteorological 1990.
U.S. Department
P., and Tichler,
J. L.
ARM
1991.
Program.
Proceedings.
Background
of Energy.
D. M., Kanciruk,
Society.
Atmospheric
138
Radiation
of the Cloud
Measurement
and Radiation
(ARM)
Testbed.
Vol 75, No. 7, 1201-1221. Radiation
Measurement
Program
Plan.
f,
N96-15460 " /Z "" Object-Oriented
Structures Keith
Global
Wichmann
Science & Technology, Code 935
NASA/Goddard
Supporting
Space
Flight
MD
20771
Greenbelt, kwichmaCbalmure,
Remote
Sensing
Databases
Robert F. Cromp Code 935
Inc.
NASA/Goddard Center
Space
Greenbelt, cromp©sauquoit,
Flight
Center
MD 20771 gsfc. nasa. gov
gsfc. nasa. gov
Abstract Object-oriented databases show promise for modeling the complex interrelationships pervasive in scientific domains. To examine the utility of this approach, we have developed an Intelligent Information Fusion System based on this technology, and apphed it to the problem of managing an active repository of remotely-sensed satellite scenes. The design and implementation of the system is compared and contrasted with conventional relational database techniques, followed by a presentation of the underlying object-oriented data structures used to enable fast indexing into the data holdings.
1
Introduction
In the late 1990s, NASA will launch a series of satellites to study the Earth as a dynamic system. The information system required to manage data from these satellites will be one of the world's largest, requiring a capacity about 1,000 times the size of the Library of Congress. The enormous size of the data holdings and the complexity of the information system pose several challenges to computer scientists, who must apply advanced techniques from software engineering, artificial intelligence, computer graphics, and signal processing if there is any practical chance of success. A data management system should provide users the ability to locate data pertinent to their needs. Limited computational resources, however, directly impact the level of detail users can employ in formulating their queries. In an extensive remote sensing repository, it is both impractical and unnecessary to manage every individual pixel of each satellite scene. Such an extreme level of granularity is best left to individual users who can apply a geographic information system to a restricted number of images. Rather, it is better to record limited information about each scene, and use this metadata to satisfy users' queries. For a number of years, researchers in the Information Sciences and Technology Branch, Code 935, at NASA/Goddard Space Flight Center, have strived to build a fully automated Intelligent Information Fusion System (IIFS) that enables unsophisticated users to access up-to-date satellite imagery [2, 4, 9]. To enable efficient search of large data holdings, the IIFS uses novel methods for extracting image content from the data, specialized data structures for storing data acquired from all over the world, and object-oriented representations to express the complex interrelationships pervasive throughout scientific domains. These advanced techniques are transparent to the users, yet allow for fast and easy access to data which meet the users' needs.
2
Rationale
Behind
an Object-Oriented
Approach
Until recently, designers had no real choice but to apply a relational database when confronted with managing any sizable amount of data. In the late 1980s, however, commercial vendors made it viable for system builders to consider using object-oriented databases for limited purposea, most generally in an evaluative mode. At the same time, educational institutions introduced an object-oriented philosophy in their computer science classes, resulting in a new generation of computer scientists who have been weaned on this approach.
139
Seemingly, object-oriented databases are now an alternative to relational databases. new technology, it is essential that a comparison be drawn with the older, established A relational database management system (RDBMS) uses the relational algebra [3]. This was a calculated a limitation on flexibility,
tradeoff from the flexibility that the CODASYL model RDBMSs provide an increase in performance because
However, as with any technology. as described by Codd provided. In return for of the simplicity of the
relational algebra in query processing. In order to apply the relational algebra, the data must be normalized to a well-defined form. The manual process of applying normalization can be time consuming and error prone. Furthermore, normalization can also lead to performance problems during query processing when dealing with complex relationships [10]. In applications that have simple relationships, the relational database is a good choice. However, in the remote sensing domain, the model can be quite complex to fully express the functionality of the system. In cases like these, the object-oriented database is more likely to perform better. Baroody and DeWitt [1] have previously shown that the object-oriented approach can perform all functions of a relational database. In addition, by allowing the equivalent of pointers from a language such as C to be stored in the database, OODBMSs can store more complex structures and relationships than RDBMSs. This reduces the need for mapping data from their application storage to their database storage (sometimes referred to as the "semantic gap" or an "impedance mismatch"). In many eases, the OODBMS ties in to a host language (e.g., C++) such that there is a direct one-to-one mapping because the structures are identical. Naturally, there are tradeoffs here as well. The mapping causes an extra step whenever an object migrates into memory to set its pointers to something valid. Once an object is in memory, no additional database overhead is necessary until the transaction commits. The database can represent any in-memory data structure because of this representation strategy. While the relational database relies on associative lookups, joins and indices on key attributes, the object-oriented database can employ domain-specific structures for providing fast access directly to the object or objects of interest. In essence, the database uses structures which allow a very quick, navigational search that only involves referencing a small number of objects. RDBMSs must adhere to the table structure and as such have difficulty representing spatial data and other complex structures [5]. It is necessary to write code to convert the in-memory structure to the relational structure and vice versa [8]. OODBMSs can represent the domain-specific search structures without changes to the database server itself. The OODBMS differs in the balance of responsibility afforded the client and the server. An RDBMS, in general, performs all processing on the server side and only communicates the results to the client. An OODBMS tends to perform all of its processing on the client side. This leaves the server to handle accessing the disk and transmitting the raw data to multiple clients, allowing the workload to be better distributed. As a compromise to the two models, hybrid databases improve over RDBMSs by storing complex objects in the nodes of a table, but do nothing to improve on the expressiveness of the relationships. Examples of hybrid databases include Illustra and Postgres. Hybrid models support SQL queries, a standard from the RDBMS community, and researchers/vendors in OODBMSs have begun to embrace this notion. While the hybrid database allows more complex objects, it does so with a cost in performance. It does provide an environment more familiar to RDBMS users, but it does not have the full capabilities of an OODBMS. The IIFS utilizes an OODBMS for several reasons. The remote sensing domain can be very complex. The need for the system to be scalable in the number of items of data that it can handle causes several problems that an OODBMS can solve. First, the table structure of relational data requires that more data be accessed than is absolutely necessary to carry out a search since all data in the table are grouped together on disk. As the number of objects in the database grows, then the amount of data touched also increases. The OODBMS allows a search structure that facilitates only accessing data that match the search attribute for that structure. Next, the OODBMS requires less coding to convert data into the proper form for storage leading to fewer coding errors and less development time. It is necessary pointers map directly into the database. Finally, the distributed nature exploitation of CPU cycles spread across the network.
140
to introduce transactions, of most OODBMSs allows
but the
I
yperCylinder
[Temporal
I
Index El=rnent 1
]
I;
JTemporal Index J
I
SQT-Node
J
Platform
Observation
ConlenI
Index
J
Instrument
Instr-Chan
J Data Prod J
w %
Sensor
[ Browse P.
Conlent Tern •
Channel
Product
Machine
Element
I
Browse
Method
Product Preference
*1
Algorithm
I
Plan
]
1
Figure
3
The
Object
1: The
IIFS
standard
represents
entity
a class.
between
the
relationship
diagrams
A superclass
classes
I
Level
]
Type
]
Resolul_on
1
Object
State
SQT-Node
I
Model
Model
The IIFS employs the object model that is partially from
Form
Condition
Process
I
and
is indicated
the
numbers
illustrated
that by
in Figure 1. The diagram
incorporates
the
a containing
indicate
the
notion
rectangle.
cardinality
of
The
of the
is a minor diversion
inheritance.
lines
Each
indicate
relationship
with
rectangle
the
relationships
a star
representing
"many." The data.
platform, The
sensor,
platform
* In space, .
In the
* On The gathers 5).
Channels
an
a sensor
these
classes
There
such
their are
can
gathers over
the
frequency
when
on
The
temporal
search
uses
hypercylinders,
the the
channel
the
classes
instrument
space
airplane
desired
response
shuttle can
data.
area
combination
It
(e.g.
of the is the
represent
the
source
of the
is mounted.
is a platform.
be a platform.
platforms,
have
ability
sensor.
some
channel of
back
on
looks
carry
each
aboard
on
yet
a specific
instrument.
many program
data:
temporal,
a set
possibly platform,
Of particular
different
calling
of which
LANDSAT
at a different
sensor
a specific
can
to the
of channels, is a sensor
a specific
channel
a platform
report
number
Mapper
Each
combination
and to
has
Thematic
of a specific
different the
which
data.
same
instrument
is the
many
instantiated
particular three
An
an
in-situ
time
channel be
or
gathers
instrument
on
or the
balloon
that
range.
instrument
and unit
as LANDSAT-5
same
represent
note,
atomic
device
at the
frequency
while
instrument
physical,
a platform
is the
data
channel,
a weather
ground,
sensor
overlapping
matches
a satellite
atmosphere,
the
its
is the
sensors.
All
of elements
of
that
instance.
other
major
employs sphere
attributes the
quadtrees
temporal (SQTs)
used index and
to search and sphere
141
for
sets
temporal quadtree
of
index nodes
element [6].
objects. The
content
spatial The search
and
content.
spatial applies
search the
(query
(instrument LANDSAT-6 TM TM-4) (complexdate I I 4 85 2 4 85 4095 IS I 31)) (query (contaiument (latlong 33 21 3S 23)) (content USGS (forest 20 40))) (dump Content) (ingest iifs@balmure:/mass/data LA_DSAT-S TM 5/4/90 33.2 -43.4) (add-template LandUse (.ater urban agriculture rangeland forest)) Figure
2: Sample
s-expressionsfor
comnmnicating
between
modules
ofthe
IIFS
content index, content template and content object classes. Each of these has the capability back the set of data that corresponds with the attribute value. These search structures extract elements from the database that match the given attributes. data structure in the IIFS is an element. Elements in the database are either observations an instrument, or products derived from the derived data, information about the routines to access the data and executable program, and their order or application. In this way,
of reporting The primary directly from
original raw data (observation). Since algorithms produce the stored includes the machines it can run on, the accounts needed any necessary parameters, and a record of applied algorithms if a better algorithm develops, it is possible to regenerate data
products that used the old algorithm. The storage of algorithms allows the implementation of semantic caching. Since storage space will surely never be large enough to store everything, the database will evaluate the data to see what can be deleted safely. The IIFS can recalculate products when needed if the database stores all of the original data as well as all algorithms. Therefore, a good choice would be to remove the least recently used data product from the archive. The system then marks the element as not available. The history facilitates the regeneration of the data by reapplying the algorithms The IIFS also relies on a planner,
in order. scheduler
and
dispatcher
to interact
with
the
data
stored
in the
database. The planner generates the necessary steps to produce data products that do not currently exist or to process newly ingested data sets to get the associated metadata, and in this vein, it acts as a transaction manager. The browse product preferences in the object model lists the algorithms that should run on all new data to generate the metadata. The planner incorporates policies that the user provides such as "accuracy is preferable to shorter compute time" in order to generate the desired product. The scheduler accounts for the resources available in the system and arranges for simultaneous requests to be processed producing an estimate of time required. The dispatcher implements that schedule and monitors the progress of each process. The entire implementation is in C++. An interface to the outside that utilize the database do not have to be clients of the database
world is necessary so that all programs itself. A server exists that accepts a
specialized language of s-expressions that describes the query and searches the database on the behalf of the program. Remote procedure calls (RPCs) carry the s-expression language to the server and the results back to the client. The language allows querying of the elements in the database, ingesting of new data, ingesting of new content templates and this common interface into the database.
4
Query
the extraction of extent information. Multiple interfaces can share Figure 2 shows some examples of the s-expression language.
Objects
The object-oriented domain also allows a useful encapsulation of the query as illustrated in Figure 3. In the IIFS, a query is simply a list of query elements. A query component is an abstract class used to encapsulate the processing of a query and the retrieval of the representation of the query for passing to a query optimizer. It has subclasses that represent each attribute type used for a search such as temporal, spatial, source and content searches. In addition, there is an abstract subclass called query connectors used to provide logical connectives such as AND and OR. Each connector has a list of the query components that it is connecting and
any element
of that
list can be another
query
connector.
142
A connector
object
can also
be one of the
[ Standing,Query'J Query I ,
Spatial Temporal
Component Component
i
J
Source
Component
1
Content
Component
[
I OR Connector
[
QueryConnector
QueryComponent
Figure
3: Query
Objects
items in the list of query components maintained by a different query connector, thus allowing arbitrary combinations of query components. A query assumes a logical AND connector for each of its elements. The IIFS also allows specification of a query as a standing query. This is a query that executes at some specified interval automatically. The results are emailed to a list of users until the expiration date supplied with the original query. Each mail message also includes the information about the query and a unique identifier for specifying a query to remove from the system. The system requires this extra information at the creation of the query. The natural mapping of in memory structures to database structures means that there is no additional work to store this in the database other than maintaining the list that has all the standing queries in it. Each query component can return a string that the OODBMS query optimizer can use as input for a search. The OODBMS generic query processor does not know how to use the navigational search structures built into the database, so it must rely on B-trees, hash tables and simple iteration over the existing objects to determine the query result. Each component can also return the set of data that matches the attributes that it describes by utilizing allows quick implementation multiple attribute searches.
4.1
Spatial
The IIFS providing
the navigational search structures of a variety of search techniques
described in detail below. This flexibility and becomes important in the section on
Searches
employs sphere quadtrees (SQTs) as the spatial search structure [7]. a logarithmic search, better represents the globe by not introducing
The SQT, in addition artifacts at the poles
to as
is the case with latitude/longitude-based structures. The SQT uses an icosahedron (a 20-sided polygon) with triangles for each face as its basis. Each face subdivides into four sub-triangles (trixels) recursively. To achieve a resolution of one kilometer over the face of the Earth requires building a SQT to 14 levels, implying that any cell of this tiling can be indexed by answering 14 4-way questions, or 28 yes-no questions. Furthermore, the resulting geometry uses all integer math that produces much faster results than floating point calculations as might be encountered using latitudes and longitudes. Each trixel that hasno children (leaf trixel) maintains a collection of the elements that fall in its region. The trixel splits into four new subtrixels when there are too many items in the bin. The system then inserts the original trixel's elements into the appropriate sub-trixel's bin, where the bin size limit can be a fixed number or can vary depending on the depth of the tree.
143
O- 10%
0-10% 0-10% 10-20%
10-20%
-
10-20% \
20-30°4
20-30%
20-30% -\
\
30-40%
\
30-40%
30-40%
\
\
e,o==,,s II
\ ,---_/x 40-50%
40-50%
50-60%
5O-60%
50-60%
60-70%
60-70%
60-70%
40-50%
/ /-
/
/
J / .t
70-80%
[ J
70 80%
70-80%
/,
_
_....
./
f
/
J
/ /
1
90-100% Forest
80-90%
80-90%_
Temporal
J
s / 1
90-100%
90-4 00°/(
Urban
Water
Figure
4.2
/
'
"
1
80-90%
,/
4: IIFS
Content
Index
Section
Searches
Each piece of data described in the IIFS has a reference in the temporal index as well. The structure consists of an indexed list of Julian days, each with a list of the elements that fall on that day. The IIFS creates the day object only at the insertion of the first element with that date. A B-tree indexes the days so that range searches perform very quickly. The system performs a variety of temporal searches down to the granularity of a day. Granularities less than a day require accessing the actual data.
4.3
Content
Searches
Searches based on content help fulfill two major requirements of the IIFS. First, the user should to search for data in terms that are familiar to their own domain. The system accomplishes the
be able first by
requiring algorithms to interpret data into that domain. The second major requirement is that the IIFS must provide the capability to return the smallest possible group. As the number of objects in the system grows, the number of objects returned by a given query will also increase. If no additional attributes are available to search on, this requirement will fail. The IIFS system allows the introduction of various content templates. Each template consists of a name for the template and a name for each attribute of the template. The content object stores the data as a vector of numbers that correspond to the order of the attributes in the template representing the percentage of that attribute in the data. For instance, a template might be used for land cover. Its attributes might be agriculture, urban, water, forest, rangeland and barren with the vector the data represents each of those classes. The database maintains a list
144
representing what percentage of of all templates. Each template
listsall elements that havehadits classification algorithmrun.
Another step in precomputing all possible queries would be to break up the range of values (typically 0-100 percent) into bins. The bin points to the data that fall in that range for each attribute in the template. This would require more storage, but would produce quicker results during a query. Figure 4 shows a partial example. It shows three of the attributes in a land use/land cover template divided into ten bins each representing ten percent ranges. There are entries pointing to some of the elements in the system that have the given properties. By dynamically adjusting the bin sizes and adding new bins as they reach a given limit, the search can always be logarithmic. This allows the addition of new templates at any time, but the system will reexamine old data sets to classify them using the new template. This could be a very costly process, but flexibility can be added to the system to dictate when that processing should occur. 4.4
Source
Searches
The source in the IIFS models the platform and sensor that it came from. This is an excellent example of where the complexity of the domain is apparent. There is a many to many relationship between platform and sensor that expresses very easily in OODBMSs but requires a mapping to an extra table in RDBMSs. The platform describes the physical location of the instrument used to get the data. The sensor describes the physical properties of the device, including the wavelengths to which it responds. The same sensor can be on many different platforms, and one platform can have many sensors on it. The search structure is identical to this description. Since there are not many platforms, that part of the search executes very quickly, but a hash function minimizes the search time. Each sensor maintains a list of data that it produced. Each piece of data maintains a list of channels that are missing. This list is normally empty and searches of the list are not necessary. This implies that the sensor gathers all channels simultaneously. 4.5
Multiple
Attribute
Searches
A single attribute search is not expressive enough for most searches. The IIFS allows multiple attribute searches using any combination of attributes. Figure 5 shows the four major search structures with selected connections to the elements. Each attribute as represented by a query element does its own search, each of which is logarithmic. The system then combines the results conjunctively or disjunctively as declared. More elegant strategies are being implemented. The first strategy is to evaluate which attribute will produce the smallest return set. This implies that there exists a quick way to determine the approximate number of items returned, which is not an easy problem. Obviously, the search stops if any attribute returns zero elements. Furthermore, by performing the intersection on the two smallest sets, the system performs the least amount of work. This also reduces the number of comparisons in the following stages. The second strategy involves employing the query mechanism of the database. The database can also apply this in addition to the first strategy, but does not require it. Once the search produces a set that is less than some threshold value, the database's query mechanism iterates over each remaining element testing each member for the remaining attributes simultaneously. Indices will not exist previously because the system created the set dynamically. However, the database could create an index on the set if it is beneficial.
4.6
Correlated
Queries
A correlated query is a query in which the qualification for membership in the result set is a relationship (correlation) to another element in the result set. This kind of query can never return a single hit because a relationship requires two elements. The return elements will have a' relationship with another element in the set. For instance, the query might be for data covering locations where two different sensors passed over in the same hour. This capability allows searches for relationships in the data, and is a very powerful addition to simple searches for attributes. The above search breaks down into three stages. The first stage is a query for the first set of eligible elements from the first platform. The second stage is for the eligible elements from the other platform. The third and final stage tests for the existence of the relationship between the two sets. The IIFS creates and fills in a special bin structure with two bins for each time slot. The reason for two bins
145
J
s
-
s" S
O /
I S
•,"._
",
_
"J._N,.
l
Tm'npora]
Elements
•
I I
_a,ntent
Index
/ / ;:, ',
].nd.'x
it.
2-.'__:._ _
,t
/
,' ,, ,, •
.... :: iil
Ii
,"...,_
•
_
_
,, ,,
.,
!i_i_.i_!i:_ _
;)iii
_
==========================
L' _$
::
i
Figure
is to keep hits from the different platforms search to see if there are entries from both
5: IIFS
Search
Indices
separate. It could use a single bin, but then it would have to to determine if the contents of that bin are a hit. By separating
them, two non-empty bins from the same time period constitutes a hit and provides an ordering for the hits. Figure 6 illustrates the structure used. In the case of spatially-correlated queries, the sphere quadtree acts as the bin structure. The IIFS creates two sphere quadtrees and populates them with data from each query. Similarly, the database can correlate queries by source or content. Finally, the system might implement multiply correlated queries. One example would be a search for data from two different platforms that cover the same area and occur in the same period. Also, more complex relationships than equality or proximity would be useful. This would enable a query like two images taken three months apart over the same area. 4.7
Distributed
Queries
Distribution of the data can increase the throughput of the system and decrease response time for a given query. In the lIPS, the spatial search structure has some properties that lend themselves to distribution. Each of the twenty faces of the icosahedron can be distributed to separate CPU nodes. If any machine node approached a predefined saturation level, its data could be redistributed to other nodes. The IIFS maintains a master index so that the distribution system could pass queries to the appropriate Throughput increases because independent queries of physically separated areas Response time increases for queries that cover more than one face of the icosahedron the remaining attributes can occur in parallel.
146
machine for processing. execute simultaneously. because the search for
/---
/---
7
O
Bin 1
Bin 1
Bin 1
O
ml
m
Bin 2
Bin 2
J3¢n2
CY Date
8/4/70
/
Correlated Query Bin
Figure
5
Current
6: Bin Structure
for Correlated
Query
Capabilities
The IIFS in its current form supports queries on the four basic attributes described above: temporal, spatial, source and content. The sphere quadtree stores the center point of each image and allows logarithmic access to those points. The hypercylinder data structure is currently being implemented and will enable the IIFS to return data sets that more accurately match the user's area of interest. The IIFS implements most of the structures discussed. Notable exceptions are all types of correlated queries except temporal, and the algorithm-related classes. The system uses the techniques described for distribution to segment the database internally but no work has been performed on distributing the SQT over multiple CPU nodes. Instrumentation of the database is currently in progress to produce base figures for evMuating changes to algorithms and storage structures. Ingesting large amounts of random data into the database tests the scalability of the system. The system is implemented on a Hewlett Packard 735 workstation using Object Store from Object Design Incorporated. A graphic user interface developed in Tcl/Tk implements a front-end to the system for providing queries and displaying results. It communicates with the database using the s-expression language over an RPC connection. The IIFS also includes an interface using the world wide web as the front end, but it is not quite as flexible. Active research is progressing on techniques for extracting useful content information from remote sensing imagery to capitalize on the content searching capability of the system.
6
Conclusions
and
Future
Work
The IIFS system is serving as an active development and research evolvability goals. The base system is nearing completion/allowing
environment for testing scalability and for future studies of new data structures
and algorithms
implementation.
as well as the effect of minor
changes
in the current
147
The
current
prototype
shows that the techniques implemented are sound and can provide the basis for a more complete system. Future work includes the extension of the spatial search structure to handle the differences in the granularity of the data stored in the database, additions to the content based search structure to increase its performance, and the examination of a number of techniques for handling multiple attribute queries. Furthermore, the system will examine and implement additional types of correlated queries.
Acknowledgments We would like to thank Bill Campbell Larry Roelofs (GST), and Walt Ligon
(NASA/GSFC), Nick Short (NASA/GSFC), (Clemson U.) for their contributions to this
Gyuri work.
Fekete
(Hughes),
References [1] A. J. Baroody and D. J. DeWitt. An object-oriented ACM Transactions on Database Systems, 6(4):576-601, [2] W. J. Campbell, artificial neural [31 E. Codd. 1970.
approach 1981.
to database
system
S. E. Hill, and R. F. Cromp. Automatic labeling and characterization networks. Telematics and lnformatics, 6(3-4):259-271, 1989.
A relational
model
for large
shared
data
banks.
Communications
implementation.
of objects
of the ACM,
using
13(6):377-387,
[4] R. F. Cromp, W. J. Campbell, and N. M. Short, Jr. An intelligent information fusion system for handling the archiving and querying of terabyte-sized spatial databases. In International Space Year Conference on Earth and Space Science Information Systems, pages 586-597. American Institute of Physics, 1993. [5] R. F. Cromp and S. Crook. and lnforma_ics, 6(3/4):299
An intelligent 312, 1989.
user interface
for browsing
satellite
data
I6] R. F. Cromp and E. Dorfman. A spatial data handling system for retrieval regions of user interest. Telematics and lnformatics, 9(3/4):221-241, 1992.
catalogs.
of images
by unrestricted
[7] G. Fekete. Rendering and managing spherical data with sphere quadtrees. In Proceedings IEEE Conference on Visualization, San Francisco, California, October 1990. [8] M. Loomis.
Object
Databases:
[9] N. M. Short, Jr., R. F. Cromp, W. Ligon, and K. Wichmann.
The
Essentials.
Addison-Wesley
Publishing
Telematics
of the First
Co., 1995.
W. J. Campbell, J. C. Tilton, J. L. LeMoigne, G. Fekete, N. S. Netanyahu, Mission to planet Earth: AI views the world. IEEE Expert, to appear.
[10] N. M. Short, Jr. and S. L. Wattawa. Dynamics data information system.
The second generation intelligent user Telematics and Informatics, 5(3):253-268,
148
interface 1988.
for the Crustal
REPORT Public
reporting
gathering collection Davis
burden
and
for
this
mainlaining
of
the
information,
Highway,
1204,
USE
of
needed,
including
Suite
1. AGENCY
collection data
DOCUMENTATION information and
suggestions
Arlington,
ONLY
for
VA
is
this and
blank)
to
and
reducing
22202-4302,
(Leave
estimated
compleling
to
average
reviewing
the
burden, the
to
Office
of
2. REPORT
AND
1
hour
per
collection
response, of
Washington
and
Budgel,
the
Send
Headquarters
Management
time
for
comments
Services,
reviewing
for
Reduction
TYPE
Conference
this
searching
burden
Information
Project
and
DATES
Management
any
data
other
sources,
aspect
Reports,
Washington,
of
1215
DC
this
Jefferson
20503.
COVERED
- October
S. FUNDING
Information
or
Operations
(0704-0188),
AND
existing
estimate
Publication
SUBTITLE
1995 Science
instructions,
regarding
Directorale
Paperwork
3. REPORT
1995
Approved No, 0704-0188
For_ OMB
including
information.
DATE
October 4. TITLE
PAGE
26-27,
1995
NUMBERS
and Data Compression
Workshop
Code
930
6. AUTHOR(S)
James
C. Tilton,
7. PERFORMING
Editor
ORGANIZATION
Goddard
Space
Greenbelt,
NAME(S)
Flight
95B00134
AGENCY
Aeronautics
Washington,
8. PEFORMING ORGANIZATION REPORT NUMBER
(ES)
20771
/ MONITORING
National
ADDRESS
Center
Maryland
9. SPONSORING
AND
NAME(S)
and Space
AND
ADDRESS
10. SPONSORING / MONITORING AGENCY REPORT NUMBER
(ES)
Administration NASA
DC 20546-0001
11. SUPPLEMENTARY
CP-3315
NOTES
Tilton: Goddard Space Flight Center, Greenbelt, Maryland. This workshop was organized and sponsored by the National Aeronautics and Space Administration (NASA) in cooperation with the Washington/Northern Virginia Chapter of the Institute of Electronics Engineers (IEEE) Geoscience and Remote Sensing Society. Support was received from the Office of Mi_inn 12a.
tn Planet
F, arlh_
DISTRIBUTION
Unclassified Subject
NASA
STATEMENT
12b. DISTRIBUTION
CODE
- Unlimited Category
Availability: 13. ABSTRACT
H_adalmrters.
/AVAILABILITY
59
NASA (Maximum
This document Workshop,"
CASI
200
(301)
621-0390.
words)
is tne proceedings
from the Science
Management
which
was held on October The Workshop
collection, missions.
ingestion, archival and retrieval of large quantities of data in future Earth and space science It consisted of fourteen presentations covering a range of information management and data
space
approaches
science
was organized
14. SUBJECT
Data
17.
by James
systems,
C. Tilton
integrated
or that hold promise
and Robert
F. Cromp
Space
Earth
Flight
approaches
into actual
or prototypical
for such an application.
of the NASA
Center,
for handling
Goddard
Space
Earth
Image
Compression,
Information
Management,
Flight
16. PRICE
18.
SECURITY CLASSIFICATION OF THIS PAGE
Unclassified
19.
SECURITY CLASSIFICATION OF ABSTRACT
or
Center.
OF PAGES
162
Space
Science
the
The Workshop
15. NUMBER
SECURITY CLASSIFICATION OF REPORT
7Kdn.n1-9_n-_OO
or have been
computational
TERMS
Unclassified '_1¢_'1_1
data information
Compression,
Science,
that are being
promising
Goddard
Compression
Maryland.
explored
1995, at the NASA
and Data
Greenbelt,
compression
26-27,
Information
CODE
20. LIMITATION
OF ABSTRAC1
UL
Unclassified |naara
Porto z_o
u-tev,
z-uuj