Practical Aspects of Symbolisation and Subsequent Analysis of Weather Data J. Walter Larson1,2, Peter Briggs3, and Michael Tobis4 1.ANU Supercomputer Facility, The Australian National University 2.Mathematics and Computer Science Division, Argonne National Laboratory 3.CSIRO Marine and Atmospheric Research 4.Institute for Geophysics, University of Texas Presented at Complex07, Surfers Paradise, Australia 4 July 2007

Copyright Permission is granted for this material, presented at the 8th Asia-Pacific Complex Systems Conference (Complex’07), 2-5 July 2007, Surfers Paradise Marriott Resort, Queensland, to be available on the Complex’07 website to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced material, and notice is given that the copying is by permission of the author(s). To disseminate otherwise or to republish requires written permission from the author(s).

ARC Centre for Complex Systems School of ITEE | The University of Queensland | ST LUCIA QLD 4069 | AUSTRALIA T: +61 7 3365 1003 | F: +61 7 3365 1533 | E: [email protected]

www.complex07.org

Overview

Overview • Motivation

• Weather and Climate Data • Information Theory, Complexity, Block Entropy and Entropy Rates of Convergence

• “Natural” Symbolisation Strategies

• Practical Limits • Case Study: Rainfall data • Another Problem • Conclusions / Future Work

Motivation •

Place meteorological and climate data in a form easily analysed using information theory

• • •

Verification and validation purposes Objective intercomparison between

• •

climate models (e.g., IPCC AR4 model outputs) Reanalysis data sets

Explore dynamics using a symbolic approach

• •

Noise reduction

•

Amenable to emerging high-performance computing architectures such as FPGAs

More compact data representation (e.g., 50 years of 6-hourly data in as little as ~750 MB for a 2-D 2.5o field)

Weather and Climate Data • • • • • •

Meteorological Station Records Time sampling typically daily (longest stretching back to 1722 for Central England), hourly (longest approx. 50+years), or higher sampling rates (e.g., ARM with 1-minute) for shorter periods Climate Model Output Time sampling user-controlled (typical datasets are monthly averages or daily history files that might include semi-daily sampling) Examples include any of the models in the Intergovernmental Panel on Climate Change (IPCC) assessment, et cetera Reanalyses (produced by data assimilation) Typical sampling time 6 hours US National Center for Environmental Prediction (NCEP) products NCEP-1 (1948-present) and NCEP/DOE AMIP-II (1979-present) European Centre for Medium-Range Weather Forecasting ERA-40 analysis (1957-present)

Symbolisation •

What it is: The conversion of a set of observations from a continuum into a discrete, finite set of symbols or an alphabet A

•

How do we pick symbols?

• • •

For an excellent discussion, see Daw et al. (2003)

•

Is a given symbolisation a generating partition? That is, is a trajectory from the continuum uniquely identified by the symbol sequence? Probably not.

•

Normally, this is done heuristically

Obviously, we want fewer symbols than observations! We want some variety; i.e., not tremendously long repeats of the same symbol

“Natural” Climate Symbolisations •

Anything - partitioning based on mean or median, or quantiles partitioning based on long-term mean filtering (e.g., filter out diurnal and annual cycles) Anything - 3 symbol alphabet derived from seasonal forecasting, which often labels state information as “below normal,” “normal,” and “above normal”

• •

• • • • •

Rainfall - Binary ‘0’ for no rain, ‘1’ for rain Total cloud cover

• •

Octas: 0 for no cloud, 8 for overcast (9 symbols) Tenths: 0 for no cloud, 10 for overcast (11 symbols) Wind Direction - Wind Roses (typically 8 symbols for compass points N,S,E,W,NW,NE,SE,SW) Boundary Layer - Pasquill Stability Classes (6 or 7 symbols) Any other useful state-space partitioning scheme

(1) C

and

(2) C Complexity

•

Hierarchical complexity classification scheme defined by d’Alessandro and Politi (1990)

• •

Symbol set A with alphabet size |A| C(1) Complexity - Compute number of admissible words Na(L) for

C

•

log|A| [Na (L)] = lim L→∞ L

C(2) Complexity - Compute number of irreducible forbidden words Na(L) for blocks of length L and scaling behaviour in the limit of large L

C

•

(1)

(2)

log|A| [Nif (L)] = lim L→∞ L

C(1) is the topological entropy and (for 2-D maps) C(2) ~ λ+

•

(1) C

and

First application to climate by Elsner and Tsonis (1993) to hourly precipitation station data (approx. 40 years)

C(1) stations periodic

•

(2) C Complexity

C(2)

stations

random Figures from Elsner and Tsonis (1993)

Also applied by Dickens and Larson (2004) to classify causes of packet loss in a high-speed UDP data transfer scheme

Block Entropy • • •

Symbol stream of Ns symbols Symbol set A with alphabet size |A| Compute Block Shannon Entropy H(L) for blocks of length L

H(L) = −

•

!

p(x) log p(x)

x∈AL

Look at scaling behaviour of H(L) - for finitary processes

H(L) = E + hµ L “Excess Entropy”

“Entropy Rate”

• • •

Entropy Rates of Convergence Crutchfield and Feldman (2003) Hierarchy of discrete derivatives and their integrals to study the convergence properties of block entropy w.r.t. block size L Discrete derivatives

•

Entropy gain ΔH(L)

•

Predictability gain Δ2H(L)

•

∆H(L) = H(L) − H(L − 1), ∆H(0) = log2 |A| ∆H(1) = H(1)

L>1

∆2 H(L) = ∆H(L) − ∆H(L − 1), ∆2 H(1) = H(1) − log2 |A| Higher-order derivatives: lim ∆n H(L) = 0, L→∞

L≥2 n>1

Entropy Rates of Conv., Cont’d •

Discrete integrals

•

Excess Entropy E (bits) E=

∞ !

L=1

•

[∆H(L) − hµ ]

Total Predictability G (bits/symbol) G=

∞ !

∆2 H(L)

L=1

•

Transient Information T (bits × symbols) T =

∞ !

L=0

[E + hµ L − H(L)]

Relationship between G and entropy rate

log2 |A| = |G| + hµ

Finite-L Estimators • •

Excess Entropy E(L) E(L) = H(L) − L∆H(L)

Entropy Rate hμ(L)

hµ (L) = ∆H(L)

lim E(L) = E

L→∞

E(L) ≤ E

lim ∆H(L) = hµ L→∞ hµ (L) ≥ hµ

•

Total Predictability G - Compute finite-L sum, and track the rate of change Δ2H(L)

•

Transient Information T - If one has assumed values of E and , one can examine both the sum up to finite L of the terms in T and also look at the values of the individual terms in this sum

What Good is This?

From Crutchfield and Feldman (2003)

Above: Coin toss--a random, Independent Identically Distributed (IID)processes with E = 0 Also capable of detecting hidden Markov models of finite order and infinitary processes (e.g., Thue-Morse)

Below: Period 16 process (1010111011101110)∞ E = log|A| p = log2 (16) = 4

hµ = 0

Practical Limits

•

The alphabet size |A| and the sample size Ns impose an upper practical limit on the block size L

L ≤ log|A| Ns

•

Define a quality ratio Q(L,Ns) that represents the expectation of the frequency a given block of length L under an assumed uniform distribution

Q(L, Ns ) =

•

Ns |A|

L

One can think of Q(L) as a safety ratio or confidence ratio

•

View results with higher Q values as being more likely to be correct than those with lower Q values

Number of Observations Ns Required to support Q = 5 L

|A|

2

4

8

12

2

20

80

1280

20480

3

45

405

32805

2657205

6

180

6480

8398080 1.09x1010

8

320

20480

8.39x107 3.44x1011

10

500

50000

5x108

5x1012

Case Study Rainfall Data

Rainfall Data •

“Patched” daily station data from Queensland Department of Natural Resources Patched Point Dataset (PPD)

•

For stations, “patched” refers to interpolation to fill in missing days and redistribute accumulations from weekends and holidays (in some cases using nearby station data as available)

•

Data available 1 January 1890 - present for a sample size of ~43200 observations with (thanks to patching) no missing values

•

Choose station locations with perfect or near-perfect records based on BoM fact sheets for each station

•

Coarse-grain data: assign ‘1’ for positive daily precipitation values, ‘0’ for no measured precipitation

•

Compute C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L) complexity and predictability metrics

1 Jan 1889 - 18 April 2007 43206 Observations

Linear Fit H(L) = E + hμ L E = 0.14895 hμ = 0.87288 Looks random, but •E ≠ 0 this process is not random IID

•This large-scale fitted estimate of E is even

smaller than lower bound for E from E(L)!

First Problem - Error Bars? •

The graphs you just saw have no error bars on the values of C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L).

•

For smaller values of L it is possible to subdivide the timeseries into a number of Nb segments of length M= Ns/Nb, and compute Nb values of C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L)

•

The distribution of the individual values would allow one to construct confidence intervals associated with a desired value of Q

•

A key question is: Are there scaling laws for these confidence intervals w.r.t. Q?

•

Topic of future investigation

Another Problem - Data Quality •

•

Chief Problem: Missing days due to lack of observer diligence i.e., missing observations for weekend (i.e., more frequent rainy Mondays than Saturdays or Sundays) Thus, some of our 0’s should be 1’s How sensitive are C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L) to such errors? Could address uncertainty due to errors through Monte Carlo methods Perturb symbol stream by randomly changing a fraction of the symbols, creating an ensemble of “neighbouring” symbol streams Compute confidence intervals for block entropy (and other symbolbased) quantities Also an area of future investigation

• • • • • •

Conclusions / Future Work Conclusions

•

Symbolic dynamics is a useful alternative approach to analysis of climate and weather data

• •

Some symbolisations may be more useful than others

•

But, block entropy methods are data-hungry, and explore only short-term predictability

Block entropy methods based on entropy rates of convergence are a promising dynamics classifier

Future Work

•

Develop error models for both sample size limitations and symbol errors

•

Apply this enhanced method more widely

References 1. C.S. Daw, C.E.A. Finney, and E.R. Tracy, “A Review of Symbolic Analysis of Experimental Data,” Review of Scientific Instruments, 74(2), 915-930 (2003).

2. G. D’Alessandro and A. Politi, “Hierarchical Approach to Complexity with Applications to Dynamical Systems,” Physics Review Letters 64(14), 1609-1612 (1990).

3. J. P. Crutchfield and D. P. Feldman, “Regularities Unseen, Randomness Observed: Levels of Entropy Convergence,” Chaos 13(1)1 25-54 (2003).

4. J.B. Elsner and A. Tsonis, “Complexity and Predictability of Hourly Precipitation,” Journal of the Atmospheric Sciences, 50(3), 400-405 (1993).

5. P.M. Dickens and J.W. Larson, "Classifiers for Causes of Data Loss Using Packet-Loss

Signatures," Proceedings of CCGrid2004-The Fourth IEEE/ACM International Symposium on Cluster Computing and the Grid (2004).

6. S.J. Jeffrey, J.O. Carter, K.B. Moodie and A.R. Beswick, "Using Spatial Interpolation to Construct a Comprehensive Archive of Australian Climate Data", Environmental Modelling and Software, 16(4), 309-330 (2001).

7. N.R.Viney and B.C. Bates, “It Never Rains on Sunday: The Prevalence and Relevance of Untagged Multi-day Rainfall Accumulations in the Australian High Quality Data Set,” International Journal of Climatology, 24 1172-1192 (2004).

Copyright Permission is granted for this material, presented at the 8th Asia-Pacific Complex Systems Conference (Complex’07), 2-5 July 2007, Surfers Paradise Marriott Resort, Queensland, to be available on the Complex’07 website to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced material, and notice is given that the copying is by permission of the author(s). To disseminate otherwise or to republish requires written permission from the author(s).

ARC Centre for Complex Systems School of ITEE | The University of Queensland | ST LUCIA QLD 4069 | AUSTRALIA T: +61 7 3365 1003 | F: +61 7 3365 1533 | E: [email protected]

www.complex07.org

Overview

Overview • Motivation

• Weather and Climate Data • Information Theory, Complexity, Block Entropy and Entropy Rates of Convergence

• “Natural” Symbolisation Strategies

• Practical Limits • Case Study: Rainfall data • Another Problem • Conclusions / Future Work

Motivation •

Place meteorological and climate data in a form easily analysed using information theory

• • •

Verification and validation purposes Objective intercomparison between

• •

climate models (e.g., IPCC AR4 model outputs) Reanalysis data sets

Explore dynamics using a symbolic approach

• •

Noise reduction

•

Amenable to emerging high-performance computing architectures such as FPGAs

More compact data representation (e.g., 50 years of 6-hourly data in as little as ~750 MB for a 2-D 2.5o field)

Weather and Climate Data • • • • • •

Meteorological Station Records Time sampling typically daily (longest stretching back to 1722 for Central England), hourly (longest approx. 50+years), or higher sampling rates (e.g., ARM with 1-minute) for shorter periods Climate Model Output Time sampling user-controlled (typical datasets are monthly averages or daily history files that might include semi-daily sampling) Examples include any of the models in the Intergovernmental Panel on Climate Change (IPCC) assessment, et cetera Reanalyses (produced by data assimilation) Typical sampling time 6 hours US National Center for Environmental Prediction (NCEP) products NCEP-1 (1948-present) and NCEP/DOE AMIP-II (1979-present) European Centre for Medium-Range Weather Forecasting ERA-40 analysis (1957-present)

Symbolisation •

What it is: The conversion of a set of observations from a continuum into a discrete, finite set of symbols or an alphabet A

•

How do we pick symbols?

• • •

For an excellent discussion, see Daw et al. (2003)

•

Is a given symbolisation a generating partition? That is, is a trajectory from the continuum uniquely identified by the symbol sequence? Probably not.

•

Normally, this is done heuristically

Obviously, we want fewer symbols than observations! We want some variety; i.e., not tremendously long repeats of the same symbol

“Natural” Climate Symbolisations •

Anything - partitioning based on mean or median, or quantiles partitioning based on long-term mean filtering (e.g., filter out diurnal and annual cycles) Anything - 3 symbol alphabet derived from seasonal forecasting, which often labels state information as “below normal,” “normal,” and “above normal”

• •

• • • • •

Rainfall - Binary ‘0’ for no rain, ‘1’ for rain Total cloud cover

• •

Octas: 0 for no cloud, 8 for overcast (9 symbols) Tenths: 0 for no cloud, 10 for overcast (11 symbols) Wind Direction - Wind Roses (typically 8 symbols for compass points N,S,E,W,NW,NE,SE,SW) Boundary Layer - Pasquill Stability Classes (6 or 7 symbols) Any other useful state-space partitioning scheme

(1) C

and

(2) C Complexity

•

Hierarchical complexity classification scheme defined by d’Alessandro and Politi (1990)

• •

Symbol set A with alphabet size |A| C(1) Complexity - Compute number of admissible words Na(L) for

C

•

log|A| [Na (L)] = lim L→∞ L

C(2) Complexity - Compute number of irreducible forbidden words Na(L) for blocks of length L and scaling behaviour in the limit of large L

C

•

(1)

(2)

log|A| [Nif (L)] = lim L→∞ L

C(1) is the topological entropy and (for 2-D maps) C(2) ~ λ+

•

(1) C

and

First application to climate by Elsner and Tsonis (1993) to hourly precipitation station data (approx. 40 years)

C(1) stations periodic

•

(2) C Complexity

C(2)

stations

random Figures from Elsner and Tsonis (1993)

Also applied by Dickens and Larson (2004) to classify causes of packet loss in a high-speed UDP data transfer scheme

Block Entropy • • •

Symbol stream of Ns symbols Symbol set A with alphabet size |A| Compute Block Shannon Entropy H(L) for blocks of length L

H(L) = −

•

!

p(x) log p(x)

x∈AL

Look at scaling behaviour of H(L) - for finitary processes

H(L) = E + hµ L “Excess Entropy”

“Entropy Rate”

• • •

Entropy Rates of Convergence Crutchfield and Feldman (2003) Hierarchy of discrete derivatives and their integrals to study the convergence properties of block entropy w.r.t. block size L Discrete derivatives

•

Entropy gain ΔH(L)

•

Predictability gain Δ2H(L)

•

∆H(L) = H(L) − H(L − 1), ∆H(0) = log2 |A| ∆H(1) = H(1)

L>1

∆2 H(L) = ∆H(L) − ∆H(L − 1), ∆2 H(1) = H(1) − log2 |A| Higher-order derivatives: lim ∆n H(L) = 0, L→∞

L≥2 n>1

Entropy Rates of Conv., Cont’d •

Discrete integrals

•

Excess Entropy E (bits) E=

∞ !

L=1

•

[∆H(L) − hµ ]

Total Predictability G (bits/symbol) G=

∞ !

∆2 H(L)

L=1

•

Transient Information T (bits × symbols) T =

∞ !

L=0

[E + hµ L − H(L)]

Relationship between G and entropy rate

log2 |A| = |G| + hµ

Finite-L Estimators • •

Excess Entropy E(L) E(L) = H(L) − L∆H(L)

Entropy Rate hμ(L)

hµ (L) = ∆H(L)

lim E(L) = E

L→∞

E(L) ≤ E

lim ∆H(L) = hµ L→∞ hµ (L) ≥ hµ

•

Total Predictability G - Compute finite-L sum, and track the rate of change Δ2H(L)

•

Transient Information T - If one has assumed values of E and , one can examine both the sum up to finite L of the terms in T and also look at the values of the individual terms in this sum

What Good is This?

From Crutchfield and Feldman (2003)

Above: Coin toss--a random, Independent Identically Distributed (IID)processes with E = 0 Also capable of detecting hidden Markov models of finite order and infinitary processes (e.g., Thue-Morse)

Below: Period 16 process (1010111011101110)∞ E = log|A| p = log2 (16) = 4

hµ = 0

Practical Limits

•

The alphabet size |A| and the sample size Ns impose an upper practical limit on the block size L

L ≤ log|A| Ns

•

Define a quality ratio Q(L,Ns) that represents the expectation of the frequency a given block of length L under an assumed uniform distribution

Q(L, Ns ) =

•

Ns |A|

L

One can think of Q(L) as a safety ratio or confidence ratio

•

View results with higher Q values as being more likely to be correct than those with lower Q values

Number of Observations Ns Required to support Q = 5 L

|A|

2

4

8

12

2

20

80

1280

20480

3

45

405

32805

2657205

6

180

6480

8398080 1.09x1010

8

320

20480

8.39x107 3.44x1011

10

500

50000

5x108

5x1012

Case Study Rainfall Data

Rainfall Data •

“Patched” daily station data from Queensland Department of Natural Resources Patched Point Dataset (PPD)

•

For stations, “patched” refers to interpolation to fill in missing days and redistribute accumulations from weekends and holidays (in some cases using nearby station data as available)

•

Data available 1 January 1890 - present for a sample size of ~43200 observations with (thanks to patching) no missing values

•

Choose station locations with perfect or near-perfect records based on BoM fact sheets for each station

•

Coarse-grain data: assign ‘1’ for positive daily precipitation values, ‘0’ for no measured precipitation

•

Compute C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L) complexity and predictability metrics

1 Jan 1889 - 18 April 2007 43206 Observations

Linear Fit H(L) = E + hμ L E = 0.14895 hμ = 0.87288 Looks random, but •E ≠ 0 this process is not random IID

•This large-scale fitted estimate of E is even

smaller than lower bound for E from E(L)!

First Problem - Error Bars? •

The graphs you just saw have no error bars on the values of C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L).

•

For smaller values of L it is possible to subdivide the timeseries into a number of Nb segments of length M= Ns/Nb, and compute Nb values of C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L)

•

The distribution of the individual values would allow one to construct confidence intervals associated with a desired value of Q

•

A key question is: Are there scaling laws for these confidence intervals w.r.t. Q?

•

Topic of future investigation

Another Problem - Data Quality •

•

Chief Problem: Missing days due to lack of observer diligence i.e., missing observations for weekend (i.e., more frequent rainy Mondays than Saturdays or Sundays) Thus, some of our 0’s should be 1’s How sensitive are C(1), C(2), H(L), hμ(L), Δ2H(L), and G(L) to such errors? Could address uncertainty due to errors through Monte Carlo methods Perturb symbol stream by randomly changing a fraction of the symbols, creating an ensemble of “neighbouring” symbol streams Compute confidence intervals for block entropy (and other symbolbased) quantities Also an area of future investigation

• • • • • •

Conclusions / Future Work Conclusions

•

Symbolic dynamics is a useful alternative approach to analysis of climate and weather data

• •

Some symbolisations may be more useful than others

•

But, block entropy methods are data-hungry, and explore only short-term predictability

Block entropy methods based on entropy rates of convergence are a promising dynamics classifier

Future Work

•

Develop error models for both sample size limitations and symbol errors

•

Apply this enhanced method more widely

References 1. C.S. Daw, C.E.A. Finney, and E.R. Tracy, “A Review of Symbolic Analysis of Experimental Data,” Review of Scientific Instruments, 74(2), 915-930 (2003).

2. G. D’Alessandro and A. Politi, “Hierarchical Approach to Complexity with Applications to Dynamical Systems,” Physics Review Letters 64(14), 1609-1612 (1990).

3. J. P. Crutchfield and D. P. Feldman, “Regularities Unseen, Randomness Observed: Levels of Entropy Convergence,” Chaos 13(1)1 25-54 (2003).

4. J.B. Elsner and A. Tsonis, “Complexity and Predictability of Hourly Precipitation,” Journal of the Atmospheric Sciences, 50(3), 400-405 (1993).

5. P.M. Dickens and J.W. Larson, "Classifiers for Causes of Data Loss Using Packet-Loss

Signatures," Proceedings of CCGrid2004-The Fourth IEEE/ACM International Symposium on Cluster Computing and the Grid (2004).

6. S.J. Jeffrey, J.O. Carter, K.B. Moodie and A.R. Beswick, "Using Spatial Interpolation to Construct a Comprehensive Archive of Australian Climate Data", Environmental Modelling and Software, 16(4), 309-330 (2001).

7. N.R.Viney and B.C. Bates, “It Never Rains on Sunday: The Prevalence and Relevance of Untagged Multi-day Rainfall Accumulations in the Australian High Quality Data Set,” International Journal of Climatology, 24 1172-1192 (2004).