Internship Report Financial time series modelling with

0 downloads 0 Views 1MB Size Report
fitted model, and compare it to the backtesting-Sharpe ratio. ..... Low autocorrelation: autocorrelation of asset returns is often insignificant, which stand for low ...
Internship Report

Financial time series modelling with multivariate ARMA, ARCH, GARCH and extensions Application to the robustness of systematic strategies

Linda Chamakh Mathematics Applied to Finance MSc thesis

Training period: April 2017 - September 2017 BNP Paribas Supervisor : M. Ahmed Bel Hadj Ayed MsC Directors : M. Aurélien Alfonsi (ENPC), M. Rémi Rhodes(UPEMLV) Jury Members: M. Bernard Lapeyre, M. Mohammed El Rhabbi

Acknowledgements First of all, I would like to thank the Systematic Strategies and Hybrids research team for welcoming me for this end-of-studies internship. Within the Paris’ Research Department of BNP Paribas, this team offers the best environment to lead research-oriented missions, in a financial field I was not familiar with before this internship. I would like to address special thanks to Ahmed Bel Hadj Ayed, who gave me the opportunity to integrate the team, supervised this intership and helped me with good advices through all the internship. I would like to thank Jean-Philippe Lemor, the manager of the team, for his interest in my mission and his help in my prospective research of a PhD Cifre. Finally, I thank particularly Alexandre Husson, who co-supervised my internship and helped me a lot with the implementation aspect of the internship. This internship would not have been as pleasant without the many interns who surrounded me. I am happy to have met Soumaya, Lucas, Felipe, Yang, Valentin, Pierre-Yves and Ken, the other Research Department interns, and the interns from other Departments I had the chance to meet thanks to the networking events organised by the great BNP Global Market Early Careers service.

Abstract : Financial time series shows statistical properties, referred to as "stylized facts". Low autocorrelation, leptokurtic and assymetric distribution, long-range dependence and heteroskedasticity are the most important ones. Time series modelling offers a wide range of models, from ARMA, GARCH to more sophisticated extensions of GARCH models. Their use can be descriptive or predictive. The purpose of this report is to study and implement these classical models and extend them to the multidimensional case, in order to take into account the influence between assets, and rate these models with regards to their good reproduction of stylized facts. The practical application of this modelling in systematic strategies we will explore is measuring the robustness of these strategies under new simulations. Most systematic strategies relies on backtesting, and their profitability is measured via profitability indices such as Sharpe ratio. A trend-following strategy will be considered for our tests. With the best model considered after our research, we will draw a Sharpe ratio distributions from Monte-Carlo distributions following the fitted model, and compare it to the backtesting-Sharpe ratio. Trend-following strategies relies on a strong-autocorrelation. By perturbing the autocorrelation parameters, we will measure the impact on the expected returns. Keywords : Time series, stylized facts, ARMA, ARCH, GARCH, EGARCH, systematic strategies, Markowitz, trend-following.

Contents 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2

Stylized facts in Daily Stock Returns . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1

Stylized empirical properties in Stock Returns . . . . . . . . . . . . . . . . .

7

2.2

Data description and notations . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.3

Stylized facts illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3.1

Low autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3.2

Heavy-tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3.3

Aggregational Gaussianity . . . . . . . . . . . . . . . . . . . . . .

11

2.3.4

Volatility clustering . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3.5

Heteroskedasticity and long-memory of financial markets . . . . .

12

2.3.6

Leverage effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

Autoregressive Moving-Average models . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.1

Notations and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.2

Unidimensional ARMA processes . . . . . . . . . . . . . . . . . . . . . . . .

17

3.2.1

AutoRegressive Processes (AR) . . . . . . . . . . . . . . . . . . . .

17

3.2.2

Moving-Average Processes (MA) . . . . . . . . . . . . . . . . . . .

17

3.2.3

AutoRegressive Moving Average Processes (ARMA) . . . . . . . .

18

3.3

Multidimensional ARMA processes . . . . . . . . . . . . . . . . . . . . . . .

18

3.4

Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.4.1

Likelihood Maximization using gradient descent . . . . . . . . . .

20

3.4.2

State Space representation . . . . . . . . . . . . . . . . . . . . . .

20

3.4.3

EM algorithm with state-space model and Kalman recusion . . . .

21

3.4.4

Fit and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Regression estimation of (V)ARMA . . . . . . . . . . . . . . . . . . . . . .

27

3.5.1

Mean, Autocovariance and Autocorrelations estimation . . . . . .

27

3.5.2

The Hannan-Rissanen procedure . . . . . . . . . . . . . . . . . . .

27

3.5.3

Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.5.4

Fit and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Conclusion and limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.6.1

Conclusion on the algorithms . . . . . . . . . . . . . . . . . . . . .

33

3.6.2

Limits of (V)ARMA models in financial modelling . . . . . . . . .

33

2.4 3

3.5

3.6

3

4

Heteroskedastic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1

Generalized AutoRegressive Conditional Heteroskedacity Model . . . . . . .

37

4.1.1

Heteroskedacity . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1.2

ARCH and GARCH models . . . . . . . . . . . . . . . . . . . . .

37

4.1.3

Stationarity and moments existence . . . . . . . . . . . . . . . . .

38

4.1.4

Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.1.5

Conditional Mean Specification . . . . . . . . . . . . . . . . . . . .

39

Fitting GARCH proccesses . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.2

4.2.1

Likelihood Maximization under gaussian assumption using gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Likelihood Maximization under generalized error distribution using gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.2.3

Other possibilities, explored or not . . . . . . . . . . . . . . . . . .

42

4.2.4

Fit and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Multidimensional GARCH processes . . . . . . . . . . . . . . . . . . . . . .

47

4.3.1

The Flexible MGARCH . . . . . . . . . . . . . . . . . . . . . . . .

47

4.3.2

Compatibility constraints . . . . . . . . . . . . . . . . . . . . . . .

48

Fitting MGARCH processes . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.4.1

Ledoit et al. estimation method . . . . . . . . . . . . . . . . . . .

48

4.4.2

Preliminary estimation under gaussian assumption . . . . . . . . .

48

4.4.3

Transformation to satisfy the compatibility constraints . . . . . .

49

4.4.4

Fit and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Limits of (M)GARCH models in financial modelling . . . . . . . . . . . . .

52

4.5.1

Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.5.2

A symmetric model . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Leveraged Heteroskedastic Models . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.1

Exponential GARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.2

Fitting EGARCH models . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.2.2

4.3

4.4

4.5

5

5.2.1

Likelihood Maximization under gaussian assumption using gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

Likelihood Maximization under generalized error distribution using gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Fit and simulations . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Multidimensional EGARCH processes . . . . . . . . . . . . . . . . . . . . .

63

5.3.1

Parcimonious MEGARCH . . . . . . . . . . . . . . . . . . . . . .

63

5.3.2

Fit and simulations . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Conclusion, limits of MEGARCH, opening . . . . . . . . . . . . . . . . . . .

65

Assessing Robustness of Systematic Strategies . . . . . . . . . . . . . . . . . . . . .

67

6.1

Systematic strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

6.1.1

Markowitz optimal allocation theory . . . . . . . . . . . . . . . . .

67

6.1.2

Trend-following or Mean-reverting? . . . . . . . . . . . . . . . . .

67

5.2.2 5.2.3 5.3

5.4 6

4

6.2

6.3

Profitability indices

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.2.1

Sharpe Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.2.2

Drawdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.2.3

Calmar Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Systematic strategies robustness test . . . . . . . . . . . . . . . . . . . . . .

69

6.3.1

AR(1)-MEGARCH(1,1) underlying model . . . . . . . . . . . . . .

69

6.3.2

AR(1)-MEGARCH(1,1) parameters perturbation . . . . . . . . . .

69

7

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

8

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

8.1

EM-algorithm under state-space representation and Kalman filtering . . . .

72

8.1.1

Proof of maximization step optimal matrices . . . . . . . . . . . .

72

8.1.2

Kalman recursions . . . . . . . . . . . . . . . . . . . . . . . . . .

72

8.2

The Durbin-Levinson / Yule-Walker procedure . . . . . . . . . . . . . . . .

73

8.3

Correlated Multivariate Gaussian Variable . . . . . . . . . . . . . . . . . . .

74

8.3.1

Case dimension N = 2 . . . . . . . . . . . . . . . . . . . . . . . . .

74

8.3.2

Case dimension N > 2 . . . . . . . . . . . . . . . . . . . . . . . . .

74

8.3.3

GARCH Gradient Descent algorithm pseudo-code . . . . . . . . .

76

Generalized Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . .

76

8.4.1

E|z| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

8.4.2

Generating GED variable . . . . . . . . . . . . . . . . . . . . . . .

77

8.4

5

1

Introduction

The question of financial time series modelling has been widely studied over the past decades. A financial model has to make a tradeoff between prediction quality and ease of use. Trying to reproduce statistical properties verified by a serie can lead to complicated models. Conversely, too parsimonious models can divert from reality. First historical stock models were mainly based on independent and identically distributed variables: Bachelier in 1900 proposed to model the stocks with a simple browninan motion martingale. In the sixties, Samuelson extended this tehory to modern markets by applying Bachelier’s model on stock returns, laying the foundation of the Black Scholes model. Fama theorised the efficient capital market theory in 1970. All these theories hinge on the strong hypothesis of market returns gaussianity. In the sixties, Mandelbrot studied stock returns and rejected the gaussian hypothesis. Indeed, financial returns distribution shows fatter tails than the gaussian one. He also derived statistical properties known as "stylized facts" requiring correlation inclusion in models. In the eighties, with the increasing importance of markets, new classes of models arised from the econometricians side. Autoregressive Moving Average (ARMA) models is a new class of models incorporating the correlation between financial yield current and past values. Engle and Bollerslev got interested in modelling time series volatility with a non-linear model, giving the birth to ARCH and GARCH models. These models are still investigated today, with extensions to the multidimensional case and variations. They offer the double advantage of being parsimonious and reproducing statistical properties. This internship takes part in BNP Paribas quantitative research team in systematic strategies. Most systematic strategies are based on rebalancing strategies on equity underlyings. While most of the systematic trading strategies relies on backtesting, it is always a challenging question to know if a promising backtest will lead to a successful trading strategy. The strategy put in place might have well performed on the historical sample used for the calibration, but fail to generate positive returns in the future. It refers to a statistical problem known as over-fitting or over-optimization. After a review of financial time series (namely, equity returns) main characteristics -often referred to as "stylized facts"-, we will have a global idea of the behaviors a good financial time series model have to reproduce. Then, we will study some compatible models, implement and compare algorithms to fit such model on financial time series. Finally, we will select and implement in BNP pricing languaga ("ADA") the best model and fit and generator algorithms associated on the studies class of models, and launch the "robustness tests" (defined later).

6

2

Stylized facts in Daily Stock Returns

2.1

Stylized empirical properties in Stock Returns

Stylized facts are statistical properties, that are empirically observed on every daily asset returns. We can enumerate the following "stylized facts": 1. Low autocorrelation: autocorrelation of asset returns is often insignificant, which stand for low linear dependence 2. Heavy-tails: returns distribution shows fatter tails than the gaussian distribution and shows a power-law tail. 3. Aggregational gaussianity: when increasing the time scale over which returns are calculated, their distribution looks more and more like a guassian one. 4. Volatility clustering: high volatility autocorrelation over days shows that volatility-peaks tend to cluster. 5. Heteroskedasticity - slow decay of autocorrelation in absolute returns: significant absolute returns autocorrelation is a sign of non-linear dependence over time. The autocorrelation of absolute return decay like a power-law, which can be interpreted as a sign of long-range dependence. 6. Leverage effect: volatility is negatively correlated to the returns of the asset. In the following parts of this chapter, we are going to explain the statistical tests that show if a stylized fact is verified or not and apply them to representative financial time series.

2.2

Data description and notations

The data considered in this chapter are daily historical stock prices. Stylized facts will be illustrated on ten-years historical daily S&P 500 and EURO STOXX 50 returns. S&P 500 The Standard & Poor’s 500 is an American stock market index based on the market capitalizations of 500 large companies also listed on the NYSE or NASDAQ. It is a good representation of the US equity market. Euro Stoxx 50 is an European market index since 1998. It takes into account 50 highly capitalized companies, with an update of the composing companies every year. Notations We denote: • S(t) the price level at time t of a financial asset • X(t) the price return defined as

S(t+1)−S(t) S(t)

• δτ (t) the price log-return of lag τ : ln • γt (h) = Cov(X(t), X(t − h)), • ρt (h) =

γt (h) Var(Xt ) ,

(

S(t+τ ) S(t)

)

h ∈ Z the autocovariance function

h ∈ Z the autocorrelation function

7

Figure 1: Euro stoxx returns.

2.3 2.3.1

Figure 2: S&P 500 returns.

Stylized facts illustration Low autocorrelation

In this section, we get interested in financial time series autocorrelation. After plotting them, we will test the presence of autocorrelation via the Portmanteau test. Autocorrelation representation We plotted the empirical autocorrelation function of the Euro Stoxx and the empirical autocorrelation function of a gaussian white noise sample of the same size.

Figure 3: Autocorrelogramme of Euro Stoxx returns and gaussian white noise process. While plotting the empirical autocorrelation function, one can observe that it is not very different from a white noise process autocorrelogramme, i.e. insignificant. It is explainable by the principle of no-arbitrage opportunity in liquid market. Indeed, if strong correlations existed, it would be easy to build arbitrage strategies taking advantage of these correlations. Testing the presence of autocorrelation: the portmanteau test The Ljung-Box test or modified Q-statistic or portmanteau test is a test of autocorrelation. The test statistic is:

QLB = n(n + 2)

h ∑ ρˆ2k n−k

k=1

8

where n is the sample size, ρˆk the sample autocorrelation at lag k and h is the number of lags being tested. If we denote the hypothesis: • H0 : the data are independently distributed until a lag h, i.e: γ(0) = ...γ(h) = 0 • H1 : the data are not independently distributed and show serial autocorrelation, under H0 , QLB follows a χ(h)2 distribution. For significance level α, the critical region for rejection of the hypothesis of randomness is: QLB > χ(h)21−α It comes from the fact that for large n, the sample autocorrelations of an iid sequence Y1 , ...Yn with finite variance are approximately iid with distribution N (0, 1/n). This test is mainly used to test the independence of a model residuals. We applied it to our main raw returns. Results are shown in table 1. Lag Euro Stoxx 50 S&P 500 χ(h)2 1−α

1 2.34 29.11 3.84

5 30.42 43.93 11.07

10 34.54 57.74 18.31

Table 1: Q-statistic and quantile of level α = 5%

→ The conclusion is that even if weak and difficult to prove with statistic tests, sample autocorrelations are not inexistent and data cannot be considered independent for large lag h. 2.3.2

Heavy-tails

Common financial models assume that financial returns follow a gaussian law. However, in practice, financial return distribution show fatter tails than the gaussian one. We are going to illustrate this fact with different concept and tests: 1. the tail index; 2. Skewness and Kurtosis estimation; 3. Jarque-Bera test of gaussianity; 4. Gaussian auto-correlation confidence interval; 5. QQPlots. Extreme values power law and tail index Financial returns tail distributions generally behave like power-law distribution with finite tail index, i.e. fat tails. The power law exponent α is an indicator of how slowly a distribution decreases. Definition 2.1. A power-law probability distribution is distribution defined, for "x large", as: p(x) = Cx−α

f or x ≥ xmin

where C = (α − 1)xα−1 min The power-law complementary cumulative distribution is then: )−α+1 ( x ¯ F (x) = P(X ≥ x) = xmin 9

(1)

• β = α − 1 is called the tail index (exposant de queue). Tails are even fatter than β is small. • γ = 1/β is called the extreme value index. β can also be defined as the maximum order such that X moments are finite. For example, since their moments are finite at every order, the tail index of gaussian and exponential laws is equal to infinity. More generally, a distribution is considered heavy-tailed if it has a finite tail index. By maximizing the correponding log-likelihood with regard to α, one get the following estimation of α: ( n )−1 ∑ xi α ˆ =1+n ) ln( xmax i=1 The complementary cumulative distribution function of Euro stoxx extreme values coincide with a power law distribution of tail index equal to 3.17. With the S&P, we found an estimated α of 2.79. → It means that Euro stoxx and S&P distributions show fat tails. Skewness, Kurtosis and Jarque-Bera test Skewness is an assymetry indicator whereas Kurtosis is, again, generally used as a heavy-tailed indicator or as a gaussiannity indicator. Financial time series generally shows a Kurtosis higher than 3, which corresponds to fat tails, and a non-zero Skewness. Jarque-Bera test is a normality test which is generally not verified for financial time series. Definition 2.2. Let’s X a random variable. Skewness is defined as the third centered moment normalized by the cubed standard-deviation: ( ) E (X − E[X])3 (2) Skew(X) = 3/2 E ((X − E[X])2 ) Kurtosis is defined as the fourth centered moment normalized by the squared standard-deviation: ( ) E (X − E[X])4 Kurt(X) = (3) 2 E ((X − E[X])2 ) For example, the centered reduced gaussian has a null-Skewness and a Kurtosis equals to 3. Jarque-Bera test is a normality test built on Skewness and Kurtosis indicators. Under the gaussian hypothesis, Jarque-Bera statistic follows a χ2 (2) distribution.

JB(X) =

n 6

( ) (Kurt(X) − 3)2 Skew(X)2 + 4

(4)

where n denotes the sample size. We calculated this statistics for Euro stoxx and S&P 500 data, on different time range (2007-2017 and 1998-2017). Results are shown in table 2

Euro Stoxx 50 S&P 500

Mean 0.00298% 0.0232%

Variance 0.0259% 0.0168%

Kurtosis 8.6 13.6

Skewness 0.12 -0.082

Table 2: Description of data - 2007-2017

10

JB statistic 3380 12167

In table 2, the Jarque-Bera statistic is to compared to the quantile at 95%, which is equal approximately to 6. → As the statistic is far higher from this quantile, we reject the gaussian assumption. → On each time range, we also have Kurtosis far higher than 3 and non-null Skewness, which means that the distributions have fat tails and are assymetric. Auto-correlation and gaussian confidence interval In absence of auto-correlation, i.e. if ϵt is an i.i.d white noise, √ nˆ ρ(h) → N (0, 1) √ . For n = 4888 (data from 1998 until The associated confidence interval at 95 % is then ± 1.96 n today), this is equal to 3%.

Figure 4: Financial time series autocorrelation compared to the confidence interval of gaussian autocorrelation. We see that the financial time series autocorrelations are not inside this interval for most time lag. → The autocorrelations do not show a gaussian compatible behavior. Empirical quantiles We compared the empirical quantiles of the financial time series return to gaussian one via qqplot technique.

Figure 5: S&P 500 QQplot.

Figure 6: Euro stoxx QQplot.

→ The graphs show that the financial returns have fatter tails than gaussian distribution. 2.3.3

Aggregational Gaussianity

As mentioned in Rama Cont’s paper [Con01], one noticeable stylized fact shared by financial time series is that lagged returns δτ (t), τ > 1 tend more to the normal distribution as τ increases. 11

Figure 7: Lagged returns distributions. → As shown in figure 10, the distribution effectively tend to a gaussian distribution. → For small lag (1 and 5), we can also notice an assymetric effect between positive and negative returns. Tails are fatter for negative values. It can be linked to the leverage effect we will present in section 2.3.6. 2.3.4

Volatility clustering

High values of |Xt | tend to be followed by high values, and conversely small values by small values. But the moves are not necessarily in the same direction.

Figure 8: S&P500 returns (jan-07 / nov-09) 2 As high value of Xt−1 tend to be followed by high value of Xt2 , the conditional variance of Xt given Xt−1 seems to be time-dependent. This phenomena is called "conditional heteroskedacity".

2.3.5

Heteroskedasticity and long-memory of financial markets

An other important observation is the "long-memory" of financial markets. Even if financial time series show weak autocorrelation, the squared or absolute autocorrelations are non-negligeable. We are going to see that via: 12

1. the autocorrelogramme of absolute returns; 2. the Hurst exponent. Autocorrelation of squared or absolute returns We plotted the autocorrelogramme of the absolute value of Euro stoxx and s&P 500 data.

Figure 9: Autocorrelogramme of absolute value of the returns. → The autocorrelation of the absolute or squared returns are important. → The autocorrelation of the absolute or squared returns decays slowly as a function of the time lag. Empirical studies show that the autocorrelation of absolute returns decays roughly as a power law with an exponent β ∈ [0.2, 0.4]. γabs (h) = Cov(|Xt+h |, |Xt |)∼

1 hβ

(5)

This phenomena is referred to as the "long memory" or "long-range dependence" of financial markets. Absolute value or square returns can be interpreted as a volatility measure. Long-range dependence means that high volatility have an impact on the future volatility. This phenomenon is also called "heteroskedacity". Testing for long-range dependence: the Hurst exponent Inspired by Harold Edwin Hurst’s study in hydrology, the Hurst exponent is used as a long-range dependence indicator in time series. The Hurst exponent, H, is defined as follows: ( ) R(n) E = CnH S(n)

as n → ∞

where: • R(n) is the range of the first n values, S(n) their standard-deviation • n is the time span of the observation Interpretation: • 0.5 < H ≤ 1: long-term positive autocorrelation / trend-following behavior • 0 ≤ H < 0.5: short-term negative autocorrelation and switching behavior / mean-reverting behavior

13

Both intervals indicate long-range dependence, i.e. power law decay of the absolute values of the autocorrelations. H = 0.5 can indicate an uncorrelated serie or an exponential decay of the autocorrelations.

Figure 10: Hurst-exponent derivation on Euro Stoxx data (2007-2017). After a rescaled range (R/S) analysis on Euro Stoxx and S&P 500 data, we get a Hurst exponent of 0.24 for the first one and 0.26 for the second one. → 0 ≤ H < 0.5 : we can reject the uncorrelated hypothesis. It advocates for a mean-reverting behavior. 2.3.6

Leverage effect

On the equity market, there is an assymetry between the impact of positive past returns and the impact of negative past returns on the current volatily level. Black (1976) has noted a negative correlation between current return and future volatility. Negative returns (price decline) lead to higher volatility than positive return of same magnitude. An economic explanation suggested by Black is the "leverage effect": a reduction in the equity value would raise the debt-to-equity ratio, hence raising the riskiness of the firm translated in an invrease in future volatility. It can be empirically illustrated by looking at the correlation between Xt+ = max(Xt , 0) and |Xt+h | and the correlation between Xt− = min(Xt , 0) and |Xt+h |, the latter being higher than the former.

Figure 11: Correlation between positive or negative returns with absolute lagged return as a function of lagged. One can see that the magnitude of the correlation is higher for decreasing price level than for increasing price level. → As seen in the figure 11, the correlation between past negative and future absolute return is higher than the correlation between past positive and future absolute return.

14

2.4

Conclusion

In order to fit well "stylized facts", a good financial model has to reproduce: 1. Low autocorrelation 2. Heavy tails 3. Heteroskedasticity 4. Volatility clustering 5. Leverage effect The non-absence of autocorrelation leads us to study an important class of time-serie model: the Auroregressive Moving Average processes. Yet, heteroskedacity and leverage effects are not captured by this first class of model. So we will explore other classes of models formed by the Generalized Autoregressive Heteroskedastic models. The purpose of this report is to show into details the calculation steps in the maximum likelihood estimation of this models (including gradient calculation that are never detailed in the litterature). We will also extend the models to the multidimensional case. While multidimensional ARMA are quite common, parsimonious multidimensional GARCH or multidimensional exponential GARCH have not been widely explored yet. We propose a parsimonious multidimensional E-GARCH at the end of this report.

15

3

Autoregressive Moving-Average models

3.1

Notations and definitions

Linear expectation The linear expectation of Xt given its past, EL(Xt |Xt−1 ), is defined as the best least square prediction of Xt as a linear combination of Xt−k , k ∈ N Linear innovation: We call linear innovation ϵt the part of Xt that cannot be predicted linearly given the past. ϵt = Xt − EL(Xt |Xt−1 ) If (Xt )t∈Z is second-order stationary, ϵt is a white noise. Partial autocorrelation: We call partial autocorrelation of order h, r(h), the coefficient of Xt−h in the linear regression of Xt on 1, Xt−1 , . . . , Xt−h : Xt = α0 +

h−1 ∑

αi Xt−i + r(h)Xt−h + ϵt

i=1

Lag or Backward operator We deconte B the backward operator which transforms Xt in Xt−1 : BXt = Xt−1 Properties • ∀h ∈ N, B h Xt = Xt−h • B −1 : upward operator: B −1 Xt = Xt+1 • B Polynomials: If Φ(B) = 1 − ϕ1 B − · · · − ϕp B p then Φ(B)Xt = Xt − ϕ1 Xt−1 − · · · − ϕp Xt−p Definition 3.1. A sequence (Xt )t∈Z of uncorrelated random variables, each of zero mean and variance σ 2 , is referred to as white noise (with mean 0 and variance σ 2 ). This is indicated by the notation: {Xt } ∼ WN(0, σ 2 ) White noise are thus stationary processes. Every linear combination of white noises are stationary. Every IID(0, σ 2 ) sequence is WN(0, σ 2 ) but not conversely. Theorem 3.1. Wold representation theorem Every covariance-stationnary process Yt can be written as the sum of two time series, one deterministic and one stochastic:

Yt =

∞ ∑

bi ϵt−i + ηt

i=1

where: • ϵt : an uncorrelated process (a white noise) • ηt : a deterministic time series • bi : possibly infinite Moving-Average parameters This theorem can be seen as the theoretical basis of ARMA models. 16

3.2

Unidimensional ARMA processes

A particular case of autoprojective (or stochastic) models are ARMA (AutoRegressive Moving Average) processes. They gained momentum in the 80’. 3.2.1

AutoRegressive Processes (AR)

An AutoRegressive Process of order p, denoted AR(p) is a second-order stationary process (Xt )t∈Z that follows: p ∑ ∀t ∈ Z, Φ(B)Xt = Xt − ϕi Xt−i = ϵt i=0

where ϕi ∈ R, ϕp ̸= 0 and (ϵt )t∈Z is a white noise process. Stationarity and causality • if Φ(z) ̸= 0

∀z , |z| = 1 then (Xt )t∈Z is stationary

• if Φ(z) ̸= 0 ∑ ∀z , |z| ≤ 1 then (Xt )t∈Z is invertible and their exists a unique causal represen∞ tation Xt = i=0 ψi ϵt−i Autocorrelogramme • ρ(h) =

∑p i=1

ϕi ρ(h − i) h ≥ 1: non-null and globally non-increasing

• r(p) = ϕp ̸= 0 and r(h) = 0

∀h ≥ p + 1

"Financial" interpretation The autoregressive model can be seen as the extension of the random walk (Xt = Xt−1 + ϵt ) that includes terms further back in time. The model considers its own past behavior as inputs and as such attempts to capture market participant effects, such as momentum and mean-reversion in stock trading. 3.2.2

Moving-Average Processes (MA)

A Moving-Average process of order q, denoted MA(q), is a second-order stationary process (Xt )t∈Z that follows: q ∑ θi ϵt−i ∀t ∈ Z, Xt = Θ(B)ϵt = ϵt + i=1

where θi ∈ R, θq ̸= 0 and (ϵt )t∈Z white noise. Stationarity and invertibility • Xt is stationary ∀θi • if Θ(z) ̸= 0 ∀z , |z| ≤ 1 then (Xt )t∈Z is ∑ invertible and ϵt is the linear innovation of the ∞ process. AR(∞) representation: Xt = ϵt + i=1 αi Xt−i Autocorrelogramme • ρ(q) =

−θq 1+θ12 +···+θq2

and ρ(h) = 0

∀h ≥ q + 1

• Partial autocorrelations globally decreasing "Financial" interpretation The MA model sees random white noise "shocks" directly at each current value of the model. This model is used to characterize "shock" information to a series, such as a surprise earnings announcement or unexpected event.

17

3.2.3

AutoRegressive Moving Average Processes (ARMA)

An ARMA(p, q) process is a second-order stationary process (Xt )t∈Z that follows: ∀t ∈ Z, Φ(B)Xt = Θ(B)ϵt where ϕi , θj ∈ R, ϕp ̸= 0 , θq ̸= 0 and (ϵt )t∈Z white noise. Stationarity, causality and invertibility • The representation is non unique. We always consider the minimal representation (where Θ, Φ have no common root) • if Φ(z) ̸= 0 ∀z , |z| = 1 then (Xt )t∈Z is stationary. Φ invertible and M A(∞) representation: Xt = Φ−1 (B)Θ(B)ϵt • if Φ(z) ̸= 0 ∀z , |z| ≤ 1, only past value of ϵt appears in the M A(∞) representation (causal representation). • if Θ(z) ̸= 0 ∀z , |z| ≤ 1 then (Xt )t∈Z is invertible and ϵt is the linear innovation of the process. AR(∞) representation: Θ−1 (B)Φ(B)Xt = ϵt Example: ARMA(1,1) Xt − ϕXt−1 = ϵt + θϵt−1 Stationarity and causal representation Xt = Φ(B)−1 Θ(B)ϵt where Φ(z)−1 have the following serie representation depending of the value of |ϕ|: Φ(z)

−1

1 = = 1 − ϕz

{ ∑∞ i i ϕz ∑i=0 ∞ −i −i i=1 −ϕ z

if |ϕ| < 1 if |ϕ| > 1

We can infer that: Xt stationary ⇐⇒ |ϕ| ̸= 1 ⇐⇒ Φ(z) ̸= 0 ∀z , |z| = 1 Xt causal ⇐⇒ |ϕ| < 1 ⇐⇒ Φ(z) ̸= 0 ∀z , |z| ≤ 1 If ||ϕ| < 1|, the process is invertible and has the causal representation: Xt = ϵt + (ϕ + θ)

∞ ∑

ϕi−1 ϵt−i

i=1

Using the uncorrelation of ϵt and ϵt−i , i > 0 and assuming V ar(ϵt ) = σ 2 we also have, for h ≥ 1: Cov(Xt , Xt−h ) = (ϕ + θ)ϕh−1 σ 2 + (ϕ + θ)2

3.3

σ2 1 − ϕ2

Multidimensional ARMA processes

We can extend the definition defined above to the multivariate case, and introduce another useful class of multivariate stationary d-dimensional processes (Yt )t∈Z by requiring that (Yt )t∈Z should satisfy a set of linear difference equations with constant coefficients. (Yt )t∈Z is a multidimensional ARMA(p,q) (VARMA(p,q)) process if (Yt )t∈Z is stationary and if for every t, Yt − Φ1 Yt−1 − Φp Yt−p = ut + Θ1 ut−1 + ... + −Θq ut−q (6) where 18

• Yt−i are d-dimensional (d x 1 vector) • ut−i are d-dimensional white noises W N (0, Σ) • Φi and Θi are (dxd) matrices • B denotes the backward operator (B i Xt = Xt−i ) If we denote: Φ(B) = Id − Φ1 B − ...Φp B p Θ(L) = Id − Θ1 B − ...Θq B q The VARMA(p,q) equation can be written in the more compact form: Φ(B)Yt = Θ(B)ut Stationarity, causality and invertibility The stationarity, causality and invertibility conditions are the generalization of the unidimensional case. • The representation is non unique. We always consider minimal representation (where Θ, Φ have no common eigen vector) • if |Φ(z)| ̸= 0 ∀z , |z| = 1 then (Xt )t∈Z is stationary. Φ invertible and M A(∞) representation: Xt = Φ−1 (B)Θ(B)ut ∑∞ • if |Φ(z)| ̸= 0 ∀z , |z| ≤ 1, ∃Di Yt = i=0 Di ut−1 . • if |Θ(z)| ̸= 0 ∀z , |z| ≤ 1 then (Xt )t∈Z is invertible and ut is the linear innovation of the process. AR(∞) representation: Θ−1 (B)Φ(B)Xt = ut

Figure 13: Simulation of ARMA(2,3) and Figure 12: Simulation of ARMA(2,3) and VARMA(2,3) time series fitted on the Euro VARMA(2,3) time series fitted on the S&P 500.. stoxx.

19

There are plenty of manners of fitting ARMA and Vector-ARMA processes, from regression-based methods to maximum likelihood estimations. In the unidimensional case, Brockwell’s book [BD02] gives us an overview of the most used fitting methods. Lütkepohl [Lüt05] generalizes the maximum likelihood method and the State Space representation to the multidimensional case. After a state-of-the-art review, we got focused on two main fitting methods: the HannanRissanen procedure and the maximum likelihood estimation using the state-space EM algorithm detailed above. In 2010, Kascha [Kas12] concluded that the Hannan-Rissanen technique was the best alternative to the highly non-linear maximum likelihood by gradient descent technique. Metaxoglou (2005) developped a state-space EM algorithm for likelihood maximization and concluded that this algorithm is more robust than seven other optimization techniques. In our implementations results, Hannan-Rissanen appears as the best algorithm.

3.4 3.4.1

Maximum likelihood estimation Likelihood Maximization using gradient descent

ˆ j−1 )2 and assuming gaussian distribution for Xt , Using prediction variance vj = E(Xj−1 − X likelihood reduces to: 1 1∑ ˆ j−1 )2 /vj−1 L= √ (Xj − X exp(− n 2 j=1 (2π) v0 . . . vn n

The maximum likelihood estimators are the parameters Φ, Θ who maximize the log-likelihood L. The search of the maximum can be done with optimization techniques (gradient descent, genetic algorithms...). In what follows, we present a method of maximum likelihood via state-space representation and Kalman filtering. 3.4.2

State Space representation

A state space model consists in a transition equation on the state variables xt : xt+1 = Bt xt + Ft zt + wt ,

∼ WN (0, Σw )

and in a measurement or observation equation on the observations yt yt+1 = Ht xt + Gt zt + vt ,

vt ∼ WN (0, Σv )

(V)ARMA models can be associated to multiple state-space representations. Metaxoglou [MS07] introduced a useful state-space representation retranscribed below, for VARMA processes with order q higher than one. Using that a VMA(q) + white noise remains a VMA(q) ([Pei88] Thm 2), we can denote Θ(L)ut ≡ F (L)xt + ϵt where ( • xt = vt

...

vt−q

)′

• vt and ϵt denote white noise processes. • Θ = [Id , Θ1 , . . . , Θq ] • F = [Id Γ1 . . . Γq ]

20

Let Φ = (Φ1 , . . . , Φp ) denote the AR parameters and F = [Id Γ1 . . . Γq ] the "MA" parameters of the second MA representation. ( )′ If Zt = Yt−1 . . . Yt−p , we have: Yt − ΦZt = Θ(L)ut = F (L)vt + ϵt This is the observation equation. ( ) 0 0 If T denotes the matrix: , with d the dimension of the observation Yt , we have the Idq 0 following transition equation:

(

where ηt = vt ( Σv and Ση = 0

0

) 0 0

... 0

xt = T xt−1 + ηt , ηt ∼ N (0, Ση )

)′

So the following equations form a state-space reprensentation of a VARMA(p,q) process: {

Yt = ΦZt + F xt + ϵt , ϵt ∼ N (0, Σϵ ) xt = T xt−1 + ηt , ηt ∼ N (0, Ση )

(7)

We can go from F to Θ defined thanks to the following mapping (cf [MS07]): Θj = F T j−1 K

(8)

where K: is the assymptotic Kalman filter gain and Σu the steady-state error covariance. 3.4.3

EM algorithm with state-space model and Kalman recusion

EM algorithm - general framework The Expectation-Maximization algorithm (EM algorithm) is an algorithm which maximizes the "completed" log-likelihood of observations y and additionnal informations x. Considering: • a sample of observations Y = (Yt )t∈[0:T ] , • unobserved data X = (Xt )t∈[0:T ] (typically: the state variable), • unknown parameter θ such that θ parameterizes the law of Y Using Bayes formula: f (Y |θ) =

f (X,Y |θ) f (X|Y,θ)

Taking the log and the expecation under θ0 given Y : ℓ(θ; Y ) =Eθ0 (log(f (Y, X|θ))|Y ) − Eθ0 (log(f (X|Y, θ))|Y ) =Q(θ|θ0 ) − H(θ|θ0 ) where Q denotes the completed log-likelihood function. It can be shown that if there exists a θ that maximizes Q, then it maximized the log-likelihood ℓ (and is a good parameterization of the law of Y). So this algorithm appears as convenient to use whith a state-space representation, where observations Y are given and unknown parameters x (the state variables) can be estimated via Kalman filtering. The algorithm alternates between maximizing the likelihood function with respect to unknown parameters (E-step) and with respect to hidden system states (M-step). • Expectation-step: Calculate Q(θ|θ(i) ) = Eθ(i) (ℓ(θ; X, Y )|Y ) , X: unknown parameters • Maximization-step: Maximize Q(θ|θ(i) ) w.r.t θ

21

E-step of the EM algorithm Given an estimate of parameters values Φk , Fk , x0 and Σx,0 , state parameters xt are evaluated with Kalman filtering. Under a state-space model, Kalman filtering method offers a method of calculation of: xt|T = E(xt |y0 , ...yT ) depending also on the value of the matrices in the state-space representation. The Kalman recursions corresponding to our VARMA(p,q) state-space model are given in appendix.

M-step of the EM algorithm In this step, we want to maximize the completed log-likelihood in θ: E(log(f (Y, X|θ))|θk ). We are going to show that we have explicit formula for Q(θ|θ0 ) and θk+1 such that: θk+1 ∈ max Q(θ|θ0 ) θ

In the state-space framework, X values given θk were given by Kalman filtering. Assuming that ϵt ∼ N (0, Σϵ ) and vt ∼ N (0, Σv ), using observation and transition equations, we have that: Yt − Φk Zt − Fk xt ∼ N (0, Σϵ ) and xt|T − T xt−1|T ∼ N (0, Ση,k ).

Proposition 3.2. EM completed likelihood function and optimal solution:

• The completed log-likelihood E(log(f (Y, X|θ))|θk ) is given by:

ℓ(Σv,k , Φk , Fk ) =

T +1 T ln|Σ−1 ln|Σ−1 ϵ | v,k | + 2 2 T 1∑ − (xt − T xt−1 )′ Σ−1 η,k (xt − T xt−1 ) 2 t=1 −

T 1 ∑ ′ −1 ϵt Σϵ ϵt + cst 2 t=0

• The optimal Φk+1 and Fk+1 under state-space representation are: (

where x =

∑T t=1

xt|T Z =

∑T t=0

Φ′k+1 ′ Fk+1

Zt Y =

)

( =

∑T t=0

xZ ′ ZZ ′

xx′ Zx′

)−1 (

xY ′ ZY ′

)

Yt

• The optimal Σv estimate is:

Σv,k+1 =

T −1 1 ∑ vt vt′ T t=0

Proof is given in appendix. Conclusion: iterations of the EM algorithm under state-space model can be summarized as follow: 1. Consider initial values for Φ, Z, Σv (given by preliminary estimation)

22

2. Expectation step: using Kalman recursions, compute (xt |Y0 , . . . , YT ) 3. Maximization step: update of Φ, Z, Σv : replace xt with xt|T in the optimal solution 4. Repeat last two steps until convergence The convergence criteria used is: min(||θk+1 − θk ||∞ , ||ℓk+1 − ℓk||∞ ) ≤ ϵ Remark. • In practice, we make a first estimation of Φ and Γ with Hannan-Rissanen algorithm and we set Θ = Γ. Σv is initialized as the covariance matrix of the noises estimated by Hannan-Rissanen. • As F = [Id , Γ1 , . . . , Γq ], we can also and it is better to derive with regard to G = [Γ1 , . . . , Γq ] (as Id , the identity matrix, is fixed) and truncate xt between d first components and others. We get similar result than in the theorem, replacing F by G and xt by its q ∗ d last components. • In his paper, Metaxoglou proposes to assume that Σϵ is a fixed diagonal matrix. We are going to keep this assumption:Σϵ = ϵId . • When plugging xt|T and xt−1|T in the log-likelihood and in the estimators, we are not ensured that ′ ′ ηt|T = xt|T − T xt−1|T is coherent with the initial state-space reprensentation i.e. that it has zerocoefficient as its last d ∗ q components. In practice, this condition is relaxed.

3.4.4

Fit and Simulations

VARMA(1,1) fit with Kalman recursion and optimal matrices update at each iteration We tried to fit the Euro stoxx / S&P 500 data using our state-space EM algorithm. First, we initialize Φ and F using the followign results given by Hannan-Rissanen (the principle of this algorithm is given after). We find the following matrices Φ and Θ: ( ΦHR =

−0.094 0.00552

) ( 0.446 −0.201 ΘHR = 0.0979 0.0299

) 0.0198 −0.236

We set F0 = [I2 , ΘHR ] and Φ0 = ΦHR Second, we launch the EM algorithm: Initialization: ; n = 0; ( ) 0 x0|0 = ; 0 Σx (0|0) = Cov(ut , ut−1 ); Σϵ = ϵI2 ; ( ) Cov(ut ) 02 ; Ση = 02 02 F0 = [I2 , ΘHR ] and Φ0 = ΦHR ; while nϵ do Update xt|T with Kalman recursions; Update ML estimators Φ, F and Σv ; Compute distance and log-likelihood; end

Algorithm 1: EM algorithm ∞

We take as distance the L matrix.

norm on the difference between current updated matrix and last updated

dist(P, Pnew ) = max(|P − Pnew |i,j )) We faced a lot of problem of convergence and questions with this algorithm: • as said in a remark above, ηt is not necessarily with null coefficient as q ∗ d last components. If we update Ση with the sub-matrix Σv , the matrices can become to sparce to be inversed, or with inverse that has non-positive determinant. In what follows, we present an example without updating Σv • the assymptotic Kalman gain, to use in Θj = F T j−1 K: we take the last estimated Kalman gain. • the choice of ϵ in Σϵ and the fact that we initialize Γ with the estimated Θ are questionable assumptions.

23

ϵ M axiteration Cov(ut ) ϵ in Σϵ

0.01 1000 ( ) 0.000211 0.000127 0.00012673 0.000166 0.00001

Table 3: Parameters

The covariance matrix of ut is estimated on the residuals of Hannan-Rissanen algorithm. The covariance matrix between ut and ut−1 has component between 10−4 and 10−7 . ( ΦEM =

−1.68 −0.524

) ( 0.795 0.967 ΘEM = −0.486 3.21 10−3

−5.23 10−4 0.954

)

Unfortunately, while looking at the evolution of the log-likelihood with the iterations, it is decreasing instead of being maximized.

Figure 14: Evolution of the matrix distance with iteration - EM algorithm.

Figure 15: Evolution of the log-likelihood with iteration - EM algorithm.

→ The log-likelihood is decreasing instead of increases. This method does not work. It is possible that the optimal solution found by deriving the log-likelihood and setting it to zero corresponds to a minimum solution instead of a maximum solution.

• VARMA(1,1) fit with Kalman recursion and matrices update with "gradient ascent" We tried another update of the matrices based on the gradient descent method: update the matrices by adding a fraction of the partial derivative at each iteration. {

∂ℓ Φn = Φn−1 + α ∂Φ n n−1 ∂ℓ Γ =Γ + α ∂Γ

(9)

To measure the stability of the EM algorithm, we applied the following procedure: 1. Fit the serie(s) to draw a set of parameters 2.

• Generate a number of Monte Carlo times new (V)ARMA processes with the fitted parameters • Fit the generated serie and record the parameters

1. Initial fit

24

With α = 10−5 , we find the following matrices: The initial matrices are initialized by Hannan-Rissanen with long-AR form of order 6.

ΦEM

( −1.86 10−5 = −1.03 10−5

) ( −6.22 10−6 0.923 ΘEM = −1.45 10−5 0.0558

Figure 16: Evolution of the matrix distance with iteration - EM algorithm with update using partial derivatives.

) 0.056 0.903

Figure 17: Evolution of the log-likelihood with iteration - EM algorithm with update using partial derivatives.

The results are more encouraging but still far from Hannan-Rissanen results given later or R studio results. 2. VARMA generation Once the parameters fitted, we can generate new distributions using the fitted parameters and the recursion equation of the corresponding (V)ARMA model. ˆ 1 Yt−1 + ... + Φ ˆ p Yt−p + ut + Θ ˆ 1 ut−1 + ... + Θ ˆ q ut−q Yt = Φ As initial parameters, the first p values of the true serie and the first q estimated noises can be used. Generate uq+1 ∼ N (0, Σu ) new Yp+1 = Φ1 Yp + ... + Φp Y1 + uq+1 + Θ1 uq + ... + Θq u1 where Σu denotes the covariance matrix of the estimated noises. new Here, Y is updated via the equation: Yt+1 = Φ1 Yt + ut+1 + Θ1 ut .

3. Stability of the parameters With initial matrices initialized by Hannan-Rissanen with long-AR form of order 6, we get the following mean and standard-deviation on parameters:

Φsimul,n=6 =

( −1.90 10−5 ± 9.65 10−6 −1.05 10−5 ± 5.33 10−6 (

Θsimul,n=6 =

0.630 ± 0.0745 0.0745 ± 0.0258

−6.38 10−6 ± 3.23 10−6 −1.48 10−5 ± 7.51 10−5

)

) 0.0258 ± 0.518 0.518 ± 0.309

Additional result As shown later, the long AR form order n is more stable when taken equal to 2. With initial matrices initialized by Hannan-Rissanen with long-AR form of order 2, we get the following mean and standard-deviation on parameters: (

) ( −6.36 10−6 0.923 Θ = EM −1.47 10−5 0.0557

ΦEM,n=2 =

−1.87 10−5 −1.05 10−5

Φsimul,n=2 =

( −1.90 10−5 ± 8.32 10−7 −1.07 10−5 ± 6.75 10−7

25

) 0.0557 0.903

−6.64 10−6 ± 7.35 10−7 −1.49 10−5 ± 6.22 10−7

)

Θsimul,n=2 =

( 0.856 ± 0.0819 0.104 ± 0.0517

) 0.104 ± 0.0517 0.696 ± 0.0132

Comparison with R studio On R studio, with the marima package, the fitted parameters of a multidimensional ARMA(1,1) model are the following:

AR−marima =

( 0.0656 0.1230

) ( 0.9264 −0.3649 MR−marima = 0.5383 −0.0875

−0.4588 −0.6785

)

Conclusion The EM algorithm is a theoretically interesting algorithm which works well under a state-space model. However, it is difficult to ensure the state-space model constraints (last q ∗ d components of ηt equal to zero) with Kalman filtering. The results given by the algorithm are far from the results given by other benchmarks (Rstudio for example). Moreover, this state-space representation applies only to VARMA with an MA order q higher than zero, whereas we are going to show that the best order for our data corresponds to an AR(1).

26

The algorithm we are going to present is the Hannan-Rissanen algorithm. It is often used as an initialization procedure for likelihood maximization algorithms. However, we are going to show that this algorithm is very stable. Kascha [Kas12] showed that it was the best alternative to maximum likelihood method. In appendix, we added another regression based algorithm, the Yule Walker procedure, which applies only to autoregressive processes.

3.5 3.5.1

Regression estimation of (V)ARMA Mean, Autocovariance and Autocorrelations estimation

In this sub-chapter, we present how we calculate the empirical moments, autocorrelation and partial autocorrelation of a time serie. • Mean estimation X n = n1 (X1 + · · · + Xn ) Rem: we retrieve the mean to find ARMA models • Autocovariance estimation ∑ (Xt+|h| − X n )(Xt − X n ) γˆn = n1 n−|h| t=1 • Linear prediction of Xt given Xt−h , . . . , X1 and partial autocorrelation ai ∈ argminai S(a0 , ...at−h ) with S(a0 , ...an ) = E(Xn+h − a0 − a1 Xn − · · · − an X1 )2 a1 = r(h) partial autocorrelation. ∑ ∂dS = 0 → γ(h + i − 1) = n j=1 γ(i − j)aj ∂dai

With matrices: Γn an = γn (h) where an = (a1 , . . . , an )′ Γn = [γ(i − j)]1≤i,j≤n γn (h) = (γ(h), γ(h + 1), . . . , γ(h + n − 1))′

3.5.2

The Hannan-Rissanen procedure

The Hannan-Rissanen algorithm is more general than the Yule-Walker procedure as it applies to general ARMA(p,q) processes and can be easily extended to the multidimensional case. This is the main algorithm we used to fit (V)ARMA processes. The Hannan-Rissanen algorithm consists in two sets of regressions: • Step 1: Noise estimation u ˆt (n): Given a set X1 , . . . , XT of observations, take n "large" (> max(p, q)) and assume a long-AR form for Xt , t ∈ [n + 1, T ]: Xt =

n ∑

πi Xt−i + ut (n)

i=1

Regression of {Xt }t∈[n+1,T ] on {Xt−1 , . . . , Xt−n }t∈[n+1,T ] . Minimizing the least square error gives:

where X = (Xt )t∈[n+1,T ]

π = (π1 , . . . , πn )′ = XZ ′ (ZZ ′ )−1 ( )′ and Z = Xt−1 . . . Xt−n t∈[n+1,T ] → U = (ut (n))t∈[n+1,T ] = X − πZ

Rem: if we want to fit a simple AR(p) process, we take n = p and stop at the first step. • Step 2: Parameters estimation: Plug the estimated noises in the general VARMA(p,q) equation: Xt =

p ∑

ϕi Xt−i +

q ∑

θi u ˆt−j

j=1

i=1

Regression of {Xt }t∈[n+1+q,T ] on {Xt−1 , . . . , Xt−p , u ˆt−1 , . . . , u ˆt−q }t∈[n+1+q,T ] . Minimizing least square error gives: [Φ, Θ] = XZ ′ (ZZ ′ )−1 where: ( X = (Xt )t∈[n+1+q,T ] and Z = Xt−1

...

Xt−p

u ˆt (n)

...

)′ u ˆt−q (n) t∈[n+1+q,T ]

This algorithm remains the same in the multidimensional case, where the Xt and ut have more than one dimension.

27

3.5.3

Order Selection

High values of p and q orders can improve the fitting accuracy of the model but too large order can lead to overfitting and penalyse the prediction likelihood. The idea is to introduce of a penalty factor to discourage too many parameters (large p and q). • General case: Akaike Information Criterion (AIC) Kullback-Leibler discrepancy: For X n-dimensional random vector with density in {f (; ψ), ψ ∈ K}: d(ψ|θ) = ∆(ϕ|θ) − ∆(θ|θ) where ∆(ψ|θ) = Eθ (−2 ln f (X; ψ)) By Jensen inequality, d(ψ|θ) ≥ 0 with equality iff f (x; θ) = f (x; ψ) Idea: minimize Kullback-Leibler discrepancy given the true parameter θ Assuming gaussian distribution: ˆ AIC(p, q) = Eθ (∆(θ|θ)) = −2ln(Lx ) + 2n

p-q 0 1 2 3 4

0 -14274 -14264 -14272 -14255

1 -14236 -14234 -14239 -14248 -14246

2 -14218 -14216 -14225 -14235 -14232

p+q+1 n−p−q−2

3 -14220 -14218 -14227 -14236 -14233

4 -14201 -14199 -14208 -14218 -14215

Table 4: AIC criteria for Euro stoxx time series fitted by Hannan-Rissanen as an ARMA(p,q) process, n=6.

p-q 0 1 2 3 4

0 -15135 -15126 -15128 -15109

1 -15098 -15101 -15101 -15101 -15100

2 -15082 -15084 -15082 -15083 -15084

3 -15083 -15086 -15084 -15085 -15086

4 -15064 -15067 -15064 -15066 -15067

Table 5: AIC criteria for S&P 500 time series fitted by Hannan-Rissanen as an ARMA(p,q) process, n=6. Conclusion According to AIC criteria, the best order for Euro stoxx data corresponds to an AR(1) process. The conclusion is the same for S&P 500 data. • Multidimensional case: AIC Criterion The idea is the same than in one dimension: minimize the Kullback-Leibler discrepancy: ∆(θ) := E(−2 ln Ln (θ)) = n ln|Σϵ | + nT r(Σ− ϵ 1S(θ)) The AIC criteria of Tsai & Hurvich is defined as: AIC = n ln|Σϵ | + nd

2nd + p + q nd − p − q

The order selection gives optimal orders of (1,0) so we focused on the (V)AR(1) case.

28

3.5.4

Fit and Simulations

To measure the stability of the Hannan-Rissanen algorithm, we applied the following procedure: 1. Fit the serie(s) to draw a set of parameters 2.

• Generate a number of Monte Carlo times new (V)ARMA processes with the fitted parameters • Fit the generated serie and record the parameters

• Study under AR(1) framework 1. Initial fit As in the example of Brockwell [BD02], the "large n" of the step 1 in Hannan-Rissanen algorithm is taken equal to 6. It can be optimized by maximizing the Log-likelihood of the process.

Euro Stoxx 50 S&P 500

AR(1) coefficient - 0.03007 - 0.10603

Predicition MSE 2.9142 10−9 1.0061 10−8

Table 6: AR(1) fit and validation, NM onteCarlo = 105

The "prediction MSE" is the mean-squared error between the real observations Xt and their prediction αXt−1 ∑T −1 2 t=1 (Xt − αXt−1 ) M SEprediction = T Remark. • The coefficients are negative which is in favour of a mean-reverting behavior. They are lower than 1 in magnitude, which allow invertibility and stationarity. • They are very closed to those obtained by Yule-Walker (-0.030 against -0.028 and -0.0106 against -0.109) • R Studio AR(1) fit (based on likelihood maximization with BFGS algorithm) gives quite similar results:

Euro Stoxx 50 S&P 500

AR(1) coefficient -0.03006 -0.1056

Log-likelihood 10807.76 11237.22

Table 7: AR(1) fit - R studio - package FitARMA

2. VARMA generation Once the parameters fitted, we can generate new distributions using the fitted parameters and the recursion equation of the corresponding (V)ARMA model. ˆ 1 Xt−1 + ... + Φ ˆ p Xt−p + ut + Θ ˆ 1 ut−1 + ... + Θ ˆ q ut−q Xt = Φ As initial parameters, the first p values of the true serie and the first q estimated noises can be used. Generate uq+1 ∼ N (0, Σu ) new Xp+1 = Φ1 Xp − Φp X1 = uq+1 + Θ1 uq + ... + Θq u1 where Σu denotes the covariance matrix of the estimated noises. An important (and questionable) assumption is that the noises follow a gaussian distribution. In the multidimensional case, and in the unidimensional when considering multiple assets, we simulate multidimensional gaussian variables with covariance matrix estimated on the estimated noises. See appendix for more details on how to simulate correlated multidimensional gaussian variables.

29

3. Stability of the parameters Using the assymptotic distribution of α, ˆ the approximate 95% confidence bounds for the AR(1)-coefficient are given by NM C ∑ 2 1/2 −1/2 1.96NM C ( α ˆi ) i=1

Euro Stoxx 50 & S&P 500

True coefficient - 0.03007 - 0.10603

Mean coefficient -0.03004 -0.10598

Stdev coefficient 3.1626 10−5 5.1803 10−5

95% Confidence 0.03859 0.03843

Table 8: AR(1) fit and validation, NM onteCarlo = 105

The estimation MSE is the sum of the variance of the estimator and the squared bias of the estimator, where Biasα = E((α) ˆ − α). Here it is the sum of the squared standard deviation and the squared average difference, which gives approximately 3.810−4 in both cases. The variance is far higher than the bias term.

Figure 18: Parameter distribution of the autoregressive model of order 1 fitted by Hannan-Rissanen on S&P 500 data.

Figure 19: Parameter distribution of the autoregressive model of order 1 fitted by Hannan-Rissanen on Euro stoxx data.

• Study under VAR(1) framework 1. Initial fit We make a first fit of the matrix A =

( a00 a10

a01 a11

) such that Yt = AYt−1 + ϵt where Yt =

( 1) yt and ϵt is yt2

a 2-dimensional white noise. The initial fitted coefficients are given in table 9 in the first column. Here y 1 denotes Euro stoxx returns and y 2 S&P 500 returns. 2. VARMA generation Once the matrix A initialized, we regenerate VAR(1) time series following the fitted equation NM onteCarlo times. 3. Stability of the parameters

a00 a01 a10 a11

Initial coefficient -0.25779 0.43443 0.01855 -0.11946

Mean coefficient -0.25759 0.43439 0.01841 -0.11917

Stdev coefficient 0.02392 0.028119 0.02115 0.02485

Average difference 2.0329 10−4 -4.0907 10−5 -1.4229 10−4 2.8121 10−4

Table 9: VAR(1) fit and validation, NM onteCarlo = 5 105 30

The prediction MSE, ||Yt − AYt−1 ||L2 , is equal to 3.786 10−4 .

Figure 20: Parameter distribution of the multidimensional autoregressive model of order 1 fitted by Hannan-Rissanen on Euro stoxx and S&P 500 data. Remark. • The diagonal coefficients are negative as in the unidimensional case. This is in favour of a meanreverting behavior. • Comparison with "mAr" package (and the function mAr.est, which fits by stepwise least squares an m-variate AR(p)) and "marima" package (repeated pseudo regression procedure based on Spliid (1983)) and "MTS" package (gaussian maximum likelihood) results under RStudio:

a00 a01 a10 a11

mAr package -0.25779312 0.4344356 0.01855211 -0.1194578

marima package -0.2578 0.4344 0.0186 - 0.1195

MTS package -0.2577 0.434 0.0187 -0.120

Table 10: VAR(1) fit by Rstudio - packages mAr and marima

• Study under VARMA(1,1) framework ( ) ( ) a00 a01 m00 m01 We make a first fit of the matrix A = and M = such that Yt = AYt−1 + ϵt + a10 a11 m10 m11 ( 1) yt M ϵt−1 where Yt = and ϵt is a 2-dimensional white noise. yt2 ( ) ( ) −0.0943 0.446 −0.201 0.0198 Ainit = Minit = 0.00552 0.0979 0.0299 −0.236 Then we generate VARMA(1,1) processes following the equation Yt = Ainit Yt−1 + ϵt + Minit ϵt−1 , we re-fit the simulated sample and we take a look at the variation of the re-fitted parameters. The matrices below give mean parameters and standard-deviation.

31

Asimul

( −0.0958 ± 0.108 = 0.00638 ± 0.101

) ( 0.454 ± 0.355 0.00218 ± 0.111 Msimul = 0.0924 ± 0.322 −0.000343 ± 0.104

) −0.00862 ± 0.356 0.00479 ± 0.322

Figure 21: Parameter distribution of the multidimensional autoregressive-moving average model of orders (1,1) fitted by Hannan-Rissanen on Euro stoxx and S&P 500 data, on 2000 simulations. Remark. In our initial estimation, we took "n=6" as proposed by Lütkepohl [Lüt05]. We got the following estimations: ( AHR,n=6 =

−0.0943 0.00552

) ( −0.201 0.446 MHR,n=6 = 0.0299 0.0979

0.0198 −0.236

)

On R studio, with the marima package, the fitted parameters of a multidimensional ARMA(1,1) model are the following:

AR−marima =

( 0.0656 0.1230

) ( −0.3649 0.9264 MR−marima = −0.0875 0.5383

−0.4588 −0.6785

)

It is strange that these results are so far from ours as marima’s method also uses a "step-wise regression method". Actually, if we take "n=2" or "n=1" as long-AR order in Hannan-Rissanen step 1 estimation, we find closer estimators. With n=2, we get the following estimated matrices: ( AHR,n=2 =

0.0595 0.0488

) ( 0.897 −0.3564612 MHR,n=2 = 0.263 −0.01202033

) −0.43058611 −0.40226878

′ While looking at an estimation of the log-likelihood (given by the sum of ϵ′t Σ−1 ϵ ϵt minus 0.5 ln || ) in function of the n parameter, we actually got an optimal log-likelihood for n = 2:

Figure 22: Log-likelihood in function of the long AR order So we changed our n to 2. With "MTS" package, based on gaussian multivariate log-likelihood maximization, we get the following estimations: ( AR−M T S =

−0.0983 −0.0563

) ( 0.449 0.1960 MR−M T S = −0.139 −0.0913

32

−0.01593 −0.00611

)

Contrary to the VAR(1) case, VARMA(1,1) estimations are varying a lot from one Rstudio package to another.

3.6

Conclusion and limits

3.6.1

Conclusion on the algorithms

The Hannan-Rissanen algorithm appears more stable then the EM algorithm under state-space reprensentation. Moreover, state-space reprensentation requires a MA form for the state-space equation whereas we concluded by Akaike criteria that the best model was the auto-regressive of order 1.

(p,q) range of application Nb of iterations

Hannan-Rissanen ∀p, q ∈ N None (analytic formula)

Nb of chosen initial parameters

1 (the initial AR order)

Closeness to benchmark

Nearest

EM + state-space ∀p, q ∈ N, q ̸= 0 Various (likelihood maximization) 5 (matrices initializations and iterations bounds) Farthest

Table 11: Advantages and disadvantages of Hannan-Rissanen and EM algorithms

We have implemented these algorithms in Python 2.7 and plotted the former results on Spyder interface. For the inclusion in the robustness test, we have to implement a VARMA fitter and generator in "ADA" language. Only the Hannan-Rissanen algorithm will be translated as it appears as the best one.

3.6.2

Limits of (V)ARMA models in financial modelling Experience with real-world data, however, soon convinces one that both stationarity and Gaussianity are fairy tales invented for the amusement of undergraduates. (Thomson 1994)

• The Gaussian assumption When assuming that noises follows a gaussian distribution, it involves assuming that financial time series follows a gaussian distribution, which is not the case. As shown in the first chapter, financial returns distribution shows fatter tails than gaussian distribution. Moreover, when fitting ARMA or VARMA models on financial returns, the noises distribution shows also fatter tails than the gaussian distribution, as shown by the QQplot below.

Figure 23: Gaussian QQPlot of remaining noises after Hannan-Rissanen regressions on S&P 500 data.

Figure 24: Gaussian QQPlot of remaining noises after Hannan-Rissanen regressions on Euro stoxx data.

Noise distribution is actually closer to the student law distribution.

33

Figure 25: Student law QQPlot of remaining noises after Hannan-Rissanen regressions on S&P 500 data.

Figure 26: Student law QQPlot of remaining noises after Hannan-Rissanen regressions on Euro stoxx data.

• "Global non-stationarity" Presence of seasonality, volatility and jumps are some of the non-stationary behaviors that may affect financial time series. To see if such effect may prevail in our time series, we fitted a VAR(1) model on a slidding time window of 5 and 7 years, where the entire available data spread over 10 years.

Figure 27: VAR(1) parameters fitted on slidding 5 years

Figure 28: VAR(1) parameters fitted on slidding 7 years

Parameters are varying but not too much (they keep the same sign and same order of magnitude). If we fit the model on each year separately, parameters vary more:

34

Figure 29: VAR(1) parameters fitted on a slidding window of 1 year If these properties are observed, global stationarity is not verified. Indeed, the presence of volatility implies that the variance depends on time. Seasonality implies that the covariance varies over time too. ARMA models do not show any volatility clustering. The existence of volatility inside a real data set is generally modelled using time varying conditional variance, via the classical heteroskedastic models. This motivates the exploration of another class of time series model: Garch models. • Presence of GARCH effects in the residuals In absence of auto-correlation, i.e. if ϵt is an i.i.d white noise, √ nˆ ρ(h) → N (0, 1) where ρˆ(h) denotes the empirical autocorrelation. In presence of GARCH effects (i.e. heteroskedasticity, non-indepence of the variance of the residuals with time), we have the following convergence result: √ where σρˆ(h) =

nˆ ρ(h) → N (0, σρ(h) ) ˆ

E(ϵ2t ϵ2t+h )

We can draw the following 95% confidence intervals: 1.96 |ˆ ρ(h)| ≤ √ σρ(h) ˆ n

35

Figure 30: Autocorrelation and confidence intervals for Euro stoxx residuals after AR(1) regression.

Figure 31: Autocorrelation and confidence intervals for S&P 500 residuals after AR(1) regression.

To conclude, VARMA modelling (in the case of Euro stoxx and S&P data, VAR(1) modelling) appears as a quite simple model, easy to implement, giving a good representation of time series modelling as shown by the very low MSE for (V)AR(1) fits. But it is proving to be insufficient to reproduce well every financial time series behaviors. It focuses on the autocorrelation of the process, and assume i.i.d white noise residuals, while financial processes do not show strong autocorrelation but auto-correlated residuals, as shown in figure 30 and 31 and volatility clustering (figure 8). VARMA modelling is also often based on the strong assumption that the residuals follows a gaussian distribution, which corresponds also to assume gaussian conditionnal distribution for the observation. As shown by the QQPlot 5 6, the empirical distribution (10) or the Jarque-Bera test, financial resturns show fatter tails than gaussian distribution. We could have considered non-gaussian white noises and build other algorithms (for example, modifying the log-likelihood formula in the EM algorithm assuming Student law distribution for the residuals) to address this issue. Nevertheless, it would still be unsufficient to reproduce the heteroskedastic behavior of the time series. We summarizes the stylized facts and behaviors verified by VARMA models (and for the special case of our data when coefficients are needed)

Stationarity Invertibility Mean-reverting Fat tails Long-range dependence * Heteroskedasticity Leveraged effect

X X X X

Table 12: Stylized facts verification - VARMA models

∗ : for an AR(1) process, ρ(h) = −αρ(h − 1) → ρ(h) = (−1)h αh which is not compatible with (5)

36

4

Heteroskedastic Models

As we seen in the first chapter, financial time series show heteroskedastic behavior, i.e. a conditional variance that depends on time. In the last chapter, we have seen that ARMA models were unable to reproduce such aspect. This is why we turn to another class of model: Garch models.

4.1 4.1.1

Generalized AutoRegressive Conditional Heteroskedacity Model Heteroskedacity

Classical ARMA-type models often turn out to be inapropriate to model financial time series. They are too focussed on the autocovariance structure of the processes. Yet, from this point of view, financial returns does barely not differ from white noises. The fact that high values of squared returns are often followed by high values, independently of the sign of the return, is not compatible with a constant conditional variance. This phenomena is referred to as conditional heteroskedasticity: Var(ϵt |ϵt−1 , ϵt−2 , ...) ̸= cste Time series can be both second-order stationary and presenting conditional heteroskedasticity. Stationary GARCH processes are an example of such time series.

4.1.2

ARCH and GARCH models

ARCH models were introduced by Engle in 1982 [BEN94], then generalized by Bollerslev in 1986 [Bol87]. They are considered as the first stochastic volatility models. They handle the conditional heteroskedasticity by modeling the conditional variance as a linear function of time. {

Xt = µt + ϵt ϵt = σt ηt

µt = E(Xt |Xt−1 ) is the conditional expectation of Xt given the information available at time t. ϵt is referred to as the shock (or innovation) of an asset return, ηt a white noise of mean 0 and unit variance and independent from σt , is the standardized shock, and σt is the square root of the volatility σt2 . ϵt has zero mean. The initial ARCH model of order p model the volatility as a linear function of the past squared returns:

σt2 = w +

p ∑

αi ϵ2t−i

(10)

i=1

where w ∈ R, αi ∈ R, αp ̸= 0. The generalized ARCH model (or GARCH) of orders p and q add a linear part of the past volatility: σt2 = w +

p ∑

αi ϵ2t−i +

q ∑

2 βj σt−j

(11)

j=1

i=1

where w ∈ R, αi ∈ R, αp ̸= 0, βi ∈ R, βq ̸= 0. We thus have Var(ϵt |ϵt−1 , ϵt−2 , ...) = σt . By tower property and as ϵt has zero-mean, Var(ϵt ) = E(E(ϵ2t |ϵt−1 )) = Var(σt ). Rem: in the litterature, the squared volatility is often denoted ht . GARCH(p,q) equations system becomes:  √  ϵ = ht ηt   t q p ∑ ∑ (12) 2  βj ht−j α h i ϵt−i + t = w+   j=1

i=1

37

4.1.3

Stationarity and moments existence

Conditions on parameters To ensure that σt remains positive, parameters αi and βi are required to be positive. To ensure stationarity, the variance of ϵt has to remain constant over time. As Var(ϵt ) = Var(σt ), by taking the expectation in equation (11), we obtain: Var(ϵt ) =

w 1 − α1 − ... − αp − β1 − ... − βq

(13)

Thus, we need w > 0 and α1 + ... + αp + β1 + ... + βq < 1. Moment of order 4 and Kurtosis for GARCH(1,1) Let’s consider a GARCH(1,1) process {

ϵt = ηt σt 2 σt2 = w + αϵ2t−1 + βσt−1

(14)

Proposition 4.1. Let’s denote Kη = Kurt(ηt ) the kurtosis of ηt residuals. σt has 4-order moment if β 2 + Kη α2 + 2αβ < 1. The kurtosis of the residuals is then:

Kurt(ϵt ) =

E(ϵ4t ) Kη [1 − (α + β)2 ] = 2 2 E(ϵt ) 1 − (α + β)2 − (Kη − 1)α2

(15)

• If ηt follows a reduced centered gaussian distribution, Kη = 3 and GARCH(1,1) process kurtosis reduces to: 3[1 − (α + β)2 ] Kurt(ϵt ) = 1 − (α + β)2 − 2α2 ) The Kurtosis is then always higher than 3, which means that residuals distribution has fat tails. • If ηt follows a generalized error distribution, Kη =

4.1.4

Γ(5/ν)Γ(1/ν) Γ(3/ν)2

Properties

ARMA representation Let’s denote ut = ϵ2t − E(ϵ2t |ϵt−1 ) = ϵ2t − σt2 . ut is a martingale difference sequence. GARCH model can be expressed as an ARMA model of squared residuals ϵ2t−i and MDS terms ut−i . Let’s consider the GARCH(1,1) model (14). It has the following ARMA representation: ϵ2t = w + (α + β)ϵ2t−1 + ut − βut−1

(16)

Let’s denote σ 2 the unconditional variance of ϵt (σ 2 = w/(1 − α − β)). If α + β < 1, we also have the following AR(∞) representation of the squared residuals: ϵ2t = σ 2 + ut + α

∞ ∑

(α + β)i ut−i

(17)

i=1

Explicit autocorrelation formulas can be drawn from this representation. For example, the autocovariance at lag 1 is: α Cov(ϵt , ϵt−1 ) = σ 2 + α[α + β + ]E(u2t ) (18) 1 − (α + β)2 2

where E(u2t ) = E(σt4 )(1 + Kη ) and E(σt4 ) = σ 2 1−α1−β 2 K −β 2 . η Half-life time As w = (1 − (α + β))σ 2 , (16) can be rewritten in the mean-adjusted form as: ϵ2t − σ 2 = (α + β)(ϵ2t−1 − σ 2 ) + ut − βut−1

(19)

The above equation iterated k times gives: ϵ2t+k − σ 2 = (α + β)k (ϵ2t − σ 2 ) + vt+k

38

(20)

where vt is a moving-average process. Since α + β < 1, (α + β)k → 0 as k → ∞. So ϵ2t+k − σ 2 will approach zero "on average" as k gets large; i.e. the volatility "mean-reverts" to its long run level σ 2 . The so-called half-life of a volatility shock, defined as ln(0.5)/ ln(α + β), measures the average time it takes for |ϵ2t − σ 2 | to decrease by one half. The closer α + β is to one, the longer is the half-life of a volatility shock. Conversely, if α + β < 1 (non-stationarity) the volatility may explode to infinity as k → ∞.

4.1.5

Conditional Mean Specification

We can estimate µt = E(Xt |Xt−1 ) with an autoregressive process (i.e. a linear combination of the Xt−i , i ∈ N). By order selection, and as chosen by ..., the AR(1) looks like the best autoregressive process to take. So µt is of the form: µt = αXt−1

4.2

Fitting GARCH proccesses

In this section, we are going to present the gradient descent method used to maximize the log-likelihood under gaussian assumption for the standardized noises ηt . Even under this assumption, the GARCH process shows fat tails (as shown with the explicit Kurtosis formula). Then, we are going to wider the distribution class on which η distribution is defined to see if the gaussian assumption is correct or not.

4.2.1

Likelihood Maximization under gaussian assumption using gradient descent

Gradient descent / Newton-Raphson algorithm To maximize GARCH log-likelihood, we are going to use a "gradient-ascent" technique (which is equivalent to a gradient descent on the opposite of the function). Gradient descent technique is an optimization technique whose aim is to minimize a function. For a function h, the principle of gradient descent is to find a descent-direction d such that h(γ n + sd) decreases with s: 0>

dh(γ n + sdn ) ∂h(γ) |s=0 = |γ n dn ds ∂γ ′

|γ n with Dn positive definite. A good choice for d is then: dn = −Dn ∂h(γ) ∂γ For n ∈ N, the evaluation point is updated with: γ n+1 = γ n − αn dn As we want to maximize the log-likelihood, our iteration is of the form: θn+1 = θn + αn H(θi )−1 ∇ℓ(θi )

(21)

Conditional log-likelihood under gaussian assumption Assuming that ηt follows a reduced centered gaussian distribution, the conditional distribution of ϵt given the past is a gaussian distribution of variance σt2 . (d) ϵt |ϵt−1 = N (0, σt )



p(ϵt |ϵt−1 ) = √

1 2πσt2

exp(−

ϵ2t ) 2σt2

An is Engle and Bollerslev [EB86], by the prediction error decomposition, the conditional log-likelihood is: ℓ = ln

T ∏ t=1

p(ϵt |ϵt−1 ) = −

T T T 1∑ 1 ∑ ϵ2t ln(2π) − ln σt2 − 2 2 t=1 2 t=1 σt2

Log-likelihood Gradient

39

(22)

Proposition 4.2. Under gaussian assumption for the standardized residuals, the conditional log-likelihood of a GARCH(p,q) process is given by:  T  1 ∑ ϵ2t 1   ∇ℓ(θ) = ( − 2 )∇θ σt2   2 t=1 σt4 σt q  ∑  2 2 2 2 2 2   βi ∇θ σt−i  ∇θ σt = (1, ϵt−1 , ..., ϵt−p , σt−1 , ..., σt−q ) + i=1

Proof. Let’s denote ℓt = − 12 log(2π) − unknown parameters.

1 2

log σt2 −

2 1 ϵt 2 σt2

and θ = (α0 , α1 , ..., αp , β1 , ..., βq ) the vector of

By chain rule: ∇θ ℓt = As ℓ(θ) =

∑T

t=1 ℓt (θ),

ϵ2t 1 ∂ℓt 2 ∇ σ = ( − 2 )∇θ σ 2 θ ∂σt2 2σt4 2σt

(23)

the gradient of the log-likelihood ℓ is defined by :

∇ℓ(θ) =

T 1 ∑ ϵ2t 1 ( − 2 )∇θ σt2 2 t=1 σt4 σt

(24)

where ∇θ σt2 is defined by definition of σ 2 by the following recurrence: 2 ∑ ∂σt−i ∂σt2 2 2 = (1, ϵ2t−1 , ..., ϵ2t−p , σt−1 , ..., σt−q )+ βi ∂θ ∂θ i=1 q

(25)

Hessian estimation Using BHHH algorithm, one can approximate the Hessian matrix using only first derivative information:

H(θ) ≈

T ∑ ∂ℓt ∂ℓt ∂θ ∂θ′ t=1

Step optimization under parameters constraints When updating the parameters, we want to keep positive parameters and the stationarity conditions i.e:  n+1 = θin + αn dn for i ∈ [1, 1 + p + q]  i > 0  θi  p+q+1 p+q+1 p+q+1 ∑ n+1 ∑ n ∑ n n  θ = θ + α di < 1  i i  i=2

i=2

i=2

Based on these conditions and depending on the sign of dn i , we can infer a maximum bound on α such that positivity and stationarity conditions are preserved.  θn  for i ∈ [1, 1 + p + q] αn < − in if dn  i < 0   di ∑ p+q+1 ∑ n  1 − p+q+1 θin n  i=2  di > 0 if α < ∑  p+q+1 n di i=2 i=2 The equations above allow us to draw a maximum admissible αn In practice, we take : αn = max . th min(p% αn , α ) as n step where p is a percentage, typically equal to 0.9, and αmin is a fixed min % max maximum level, typically 10−5 . By "forcing" the parameters to be in the good intervals, we are "forcing" the stationarity of the process. Most papers about GARCH estimation do not talk about parameters stability in the stationarity intervals, assuming that if the process is stationary, the parameters would converge to stationary-compatible values. In our case, if we didn’t impose this conditions on the αn , we would not converge to stationary-compatible parameters.

40

4.2.2

Likelihood Maximization under generalized error distribution using gradient descent

Gaussian assumption for the innovation error is still a questionable assumption. Indeed, as we observed in the first part of this report, financial returns show fatter tails than the gaussian distribution, and so do the innovation errors as seen in figure 23 and 24. In "Quasi-Maximum Likelihood Estimation of GARCH Models With Heavy-Tailed Likelihoods" [FQX14], the authors try to generalize the quasi-maximum likelihood estimation to non-gaussian distributions with fatter tails. However, they highlight the fact that the GQMLE is consistent and assymptotically normal as soon as the innovations have finite fourth-moments. In what follows, we propose a maximum likelihood estimation of the parameters on a larger class of distributions for ηt . Reduced centered generalized error distributions (GED) is the class of distributions whose distribution function is of the form: fν (z) =

νexp(− 12 |z/λ|ν ) , λ21+1/ν Γ(1/ν)

−∞ < z < ∞,

0 2000 simulations:

Optimal Fitted Parameters

w 5.60 10−6 (1.53 10−6 )

α 9.44% (1.33%)

β 88.211% (1.61%)

MSE 2.26 10−7 (3.91 10−7 )

Notes: standard-deviation are in parentheses.

Table 16: GARCH(1,1) fit on simulated data - average parameters

where the mean-squared error (MSE) is defined as: M SEGARCH =

T 1 ∑ 2 (ϵt − σt2 )2 T t=1

(26)

It is nothing less than the empirical variance of the martingale difference sequence ut = ϵ2t − σt2 .

Figure 39: GARCH(1,1) parameter distribution - Euro stoxx data

• Illustration with the generalized distribution Plotting the evolution of the log-likelihood with α and the GED order ν, we notice that the optimal order is slightly higher than 2 (2.6 on the figure below) and small α (0.005 here) (the parameter β being initialized as a function of α and ν via the Kurtosis formula and w being initialized with the stationarity relationship).

46

Figure 40: Log-likelihood as a function of parameters α and ν - GARCH(1,1) with GED law on Euro stoxx data Research of optimal parameters using gradient descent gives parameters close to the one obtained by looking at the log-likelihood function. Conclusion As the maximum likelihood method with the generalized distribution class gives close result compared to the gaussian esimation, we prefer to select the fitter based on gaussian assumption, which has one less parameter.

4.3 4.3.1

Multidimensional GARCH processes The Flexible MGARCH

GARCH models can be extended to the multivariate case, in order to model the conditional covariance between two assets besides the conditional variances. General multivariate GARCH models have been exposed in Engle and Kroner (1995) and Kroner and Ng (1998). There are two main difficulties in GARCH extension to the multivariate case: • The "curse of dimensionality": number of parameters increase quickly with the dimension of the system. Some simplifications have to be done. • Keeping positive semi-definite matrices. The "Flexible MGARCH" modeling method proposed by Ledoit, Santa-Clara and Wolf [LSCW03] is all at one parcimonious and respectful of the definite positiveness requirements.

General Model E(xi,t |Ωt−1 ) = 0,

i ∈ 1..N

Cov(xi,t , xj,t |Ωt−1 ) = hij,t = cij + aij xi,t−1 xj,t−1 + bij hij,t−1 ,

(27) i, j ∈ 1..N

(28)

where xt denotes the residuals vector of size N of a multivariate autoregressive model on the serie of interest, Ωt−1 denotes the filtration corresponding to the information set available at time t − 1. The parameter values satisfy aij , bij ≥ 0 and cij > 0. Matrix representation Following the notations of Ding and Engle (1994), let C = [cij ]i,j=1..N , A = [aij ]i,j=1..N and B = [bij ]i,j=1..N the matrices of the parameters of the model. Let Ht = [hij,t ]i,j=1..N denote the conditional covariance matrix at time t. Let Σt = [xi,t xj,t ]i,j=1..N the matrix of cross-products of variable at time t. Then equation (28) can be rewritten as: Ht = C + A ∗ Σt−1 + B ∗ Ht−1

47

(29)

where the operator ∗ denotes the Hadamard product of two matrices. Replacing recursively equation (29) into iteself implies: Ht =

∞ ∑

Bk ∗ C +

k=0

∞ ∑

B k ∗ A ∗ Σt−k−1

k=0

= C/(1 − B) +

∞ ∑

B k ∗ A ∗ Σt−k−1

k=0

4.3.2

Compatibility constraints

Positive semi-definition constraint As Ht represents the conditionnal covariance matrix of Xt , it has to be positive semi-definite. The following proposition is useful to know which matrices have to be positive semi-definite to ensure the positive semi-definition of Ht : Proposition 4.4. If C/(1 − B), A and B are positive semi-definite, then the condtional covariance matrix is positive semi-definite. For a demonstration of this proposition, one can refer to Ledoit, Santa-Clara and Wolf.

Covariance stationarity constraint By tower property: E(xi,t xj,t ) = E(hij,t ), so taking the expectation of equation (28), and assuming hij,t stationarity, i.e E(hij,t+k ) = E(hij,t ), k ∈ Z, leads to : cij =

E(hij,t ) , 1 − aij − bij

i, j ∈ 1..N

(30)

As in the unidimensional case, we get the following generalized covariance stationarity constraint: aij + bij < 1,

i, j ∈ 1..N

During the fitting procedure, the following proposition will be useful: Proposition 4.5. If A and B are positive semi-definite, and if aii + bii < 1

∀i = 1..N

then aij + bij < 1

4.4 4.4.1

∀i, j = 1..N

Fitting MGARCH processes Ledoit et al. estimation method

• Step 1: Diagonal coefficients estimation as in the unidimensional case • Step 2: Off-diagonal coefficients estimation • Step 3: Compatibility constraints

4.4.2

Preliminary estimation under gaussian assumption

Diagonal coefficients estimation as in the unidimensional case Assuming conditionnal normality, estimations cˆii , a ˆii , ˆbii are obtained by maximizing the following conditional likelihood, seperately for each i ∈ 1..N : Li,T (cii , aii , bii ) = −

T T 1∑ T 1 ∑ x2ii,t log(2π) − log hii,t − 2 2 t=1 2 t=1 hii,t

48

s.t.

hii,t = cii + aii x2i,t−1 + bii hii,t−1

In practice, we use the same algorithm implemented for the univariate gaussian case, taking into account the positivity and stationarity constraints on the parameters.

Off-diagonal coefficients estimation Once the diagonal coefficients we can draw an estimation of the off-diagonal coefficients. If we ) ( estimated, xi,t and assume conditional normality, focus on the vector Xij,t = xj,t ( ) ˆ ii,t hij,t h denoting Hij,t = ˆ jj,t : hij,t h (d) Xij,t |Ωt−1 = N (0, Hij,t )



p(Xij,t |Ωt−1 ) =

−1 ′ exp(−Xij,t Hij,t Xij,t /2) N/2 (2π) |Hij,t |1/2

(31)

with hij,t = cij + aij xi,t−1 xj,t−1 + bij hij,t−1 . −1 ′ 1 Denoting ℓt,ij = − 2 ln Hij − 12 Xij,t Hij,t Xij,t −1 ′ we can expand Xij,t Hij,t Xij,t by calculating the inverse of Hij,t . −1 ′ Xij,t Hij,t Xij,t = Dij,t ij,t ˆ jj,t x2i,t + h ˆ ii,t x2j,t − 2hij,t xi,t xj,t Pij,t = h ˆ ii,t h ˆ jj,t − h2ij,t Dij,t = Hij = h P

From that expression, it is easy to derive the gradient of ℓt,ij to do a gradient descent optimization of cij , aij , bij . Remark. Hij,t is ensured to be positive semi-definite if we impose |cij | ≤ (ˆ cii cˆjj )1/2 , 0 ≤ aij ≤ (ˆ aii a ˆjj )1/2 1/2 ˆ ˆ and 0 ≤ bij ≤ (bii bjj ) , as Ding and Engle [DE01] show. It is easy to see that the sub-matrices Cij , Aij andBij are semi-definite positive (by the condition x′ M x > 0 for example).

4.4.3

Transformation to satisfy the compatibility constraints

The preliminary estimation provides us with matrices that are neither necessarily positive semi-definite nor stationary. There is a transformation to apply to matrices to get these constraints satisfied. Covariance stationarity constraint If A and B are positive semi-definite and if aii + bii < 1 then aij + bij < 1 (which ensures covariance stationarity). → As aii + bii < 1 (usual fit in unidimensional case), keeping the diagonals of the matrices while transforming them to positive semi-definite matrices ensures covariance stationarity.

Positive semi-definition constraint The authors propose to find the positive semi-definite matrices C/(1 − B), A and B the closest to the ˆ ˆ Aˆ and B ˆ in the sense of the Frobenius norm. estimated matrices C/(1 − B), We are trying to solve the following problem, where P + denotes the space of semi-definite positive matrices: min ||M − X||F

X∈P +

and the Frobenius norm is defined as: v u∑ p u n ∑ ||A||F = t a2ij i=1 j=1

The problem of getting semi-definite positive covariance matrix has been widely discussed in financial research. The method taken up by Ledoit et al. is the one of Houduo Qi and Defung Sun [QS06]. The method is the following:

49

( ) a11 m′ • Step 1: Estimate M an initial positive semi-definite approximation of A of the form: M = m M ( ) a11 m′ • Step 2: Given M an initial positive semi-definite approximation of A of the form: M = , m M ˜ is researched of the form: the new approximation M ( 2 ) ρ a11 + 2ρx′ m + x′ M x ρm′ +′ M ˜ M= ρm + M M ˜ 11 = a1 1: by minimizing the distance under the constraint M ˜ ||F −||A − M ||F = 2||a − (ρ + M )||22 − 2||a − m||22 ||A − M under constraint

ρ2 a11 + 2ρx′ m + x′ M x = a11

(32)

• This minimizing problem amounts to find ρ and x that minimize ||a − (ρ + M )||22 . In practice, ρ is fixed (for example to 0.5). Taking the Lagrangien of this subproblem, we have an analytical solution x(λ) such that ∇Lx (x, λ) = 0. The optimal λ which verifies the equality constraint is found via Newton’s method. ˜) = • The problem is convex, which implies that the solution matrix M is singular. Since det(M 2 ρ det(M ), the iterates stays within the interior of the feasible region if the initial matrix M is choosen singular. The step 2 of the method is precisely documented in the appendix of the article. The step 1, how to get the initial positive semi-definite matrices given to the algorithm, is not precised in the article. In what follows, we propose a method to get such matrices. For the requirement of the algorithm, the initial positive semi-definite matrices have to satisfy the following properties: ˆ ˆ A, ˆ 1. Keep the same diagonal than the estimated matrices C/(1 − B), 2. Being semi-definite positive i.e. having only positive eigen values, 3. Being "close" to the initially estimated matrices. We can exploit the fact that the initially estimated matrices are real and symmetric, hence diagonalizable. The idea of the method is to "cap" the negative eigen values to zero (or small epsilon). Proposition 4.6. Let M be a N xN real symmetric matrix. 1. Let D = Diag(λi , i = 1..N ) the diagonal matrix of M eigen values, and P the orthogonal matrix such that: M = P DP T 2. Let’s denote

D′ = Diag(max(ϵ, λi ), i = 1..N ) M ′ = P D′ P T

and

Then

√ mii , i = 1..N ) S = Diag( m′ii Mp = SP D′ P ′ S

is a semi-definite positive matrix that has the same diagonal than M. Proof. As M is symmetric and positive, there exist P orthogonal matrix such that M = P DP ′ . The matrix M ′ = P D′ P is positive definite as all its eigen values are non-negative. √ Then, to prove that Mp is positive definite, we can denote Ds = Diag(max(ϵ, λi ), i = 1..N ). We have ′ Ds = Ds and Ds xDs = D. As: Mp = (Ds P ′ S)′ x(Ds P ′ S) where Ds P ′ S is non singular as its determinant is non null, Mp is positive definite. Remark. This very intuitive method has also been theorized by McNeil, Frey and Embrechts, for the case of correlation matrices (with 1 as diagonal coefficients).

50

4.4.4

Fit and Simulations

Initial AR(1) fit As explained above, the conditional mean is modeled thanks to an AR(1) process. (M)GARCH fit are done on the remaining noises.

Initial parameters - multidimensional case The diagonal parameters are initialized the same way as in the unidimensional case. The off-diagonal parameters are initialized such that the sub-matrices are positive semi-definite. For example, we have tried to initialize the parameters in function of diagonal parameters: √ mij = ρ mii mjj 0 0 and − γ + θ < 0 ⇐⇒ −γ < θ < γ – The assymetry is taken into account by θ coefficient. If θ = 0, the volatility depends only on the ηt−j modulus. If θ < 0 and ln σt2 = w + θηt−1 , if ηt−1 < 0 i.e ϵt−1 < 0, ln σt2 will be above its mean whereas it will be below if ϵt−1 > 0. We recover the assymetry property of asset returns. – γ coefficient is also important: it says that high innovations, whatever their signs, are followed by high innovations. It reinforces the volatility clustering effect. • ln σt2 is a strong ARMA since g(ηt ) is an iid white noise.

Stationarity If B(z) = 1 −

∑p i=1

βi z i has its roots outside the unite circle then σt2 is stationary and invertible.

Invertibility σt2 is a complicated function of ϵu , u < t by replacing ηt−i by ϵt−i /σt−i Examples To understand better the "effects" of the parameters, let’s have a look at examples with different choices of θ and γ:

56

Figure 47: Simulation of an EGARCH process Figure 48: Simulation of the associated high θ with high θ parameter. parameter AR-EGARCH process. In the figure above, the volatility clustering effect is visible, and even too much marked to be realistic. High θ and γ leads to impact so strong that the underlying evolution might falls to zero, as seen in the following figure:

Figure 49: Simulation of the associated high θ parameter AR-EGARCH process: underlying drops to zero.

Figure 50: Simulation of an EGARCH process Figure 51: Simulation of the associated close θ with close θ and γ parameters. and γ parameters AR-EGARCH process.

57

Figure 52: Simulation of an EGARCH process Figure 53: Simulation of the associated low θ with low θ and high γ parameters. and high γ parameters AR-EGARCH process. The above figures were simulated assuming gaussian standardized errors ηt and EGARCH of orders (1,1). In what follows, we focus on the EGARCH(1,1) processes.

5.2 5.2.1

Fitting EGARCH models Likelihood Maximization under gaussian assumption using gradient descent

As in GARCH model, assuming that ηt follows a reduced centered gaussian distribution, the conditional log-likelihood is given by (22) :

ℓ = cste −

T T T ∑ 1 ∑ ϵ2t 1∑ 2 ln σ = ℓt − t 2 t=1 σt2 2 t=1 t=1

(34)

where ℓt can be rewritten in the more useful form: ( ) 1 1 ℓt = − ϵ2t exp − ln σt2 − ln σt2 2 2 ( ) 1 ( ) 1 2 2 2 = − ϵt exp − ln (w + β ln σt−1 + g(ηt−1 ) − ln w + β ln σt−1 + g(ηt−1 ) 2 2

(35)

Proposition 5.1. Under gaussian assumption for the standardized residuals, the conditional log-likelihood gradient of a EGARCH(1,1) process is given by:                   

T ∑ 1 ∂ln σ t ∂ln σ t ∂ln σ t ∂ln σ t 1 , , , , 0) ( ηt2 − )( 4 2 ∂w ∂β ∂θ ∂γ t=1 ν 1 ϵt ϵt − ln (0, 0, 0, 0, 1) 2 λσt λσt ) ( ) ( 2 ∂ ln σt−1 ∂ ln σt2 2 = (1, ln σt−1 , ηt−1 , |ηt−1 | − E|ηt−1 |) + ∂θi ∂θi i=1..4 i=1..4

∇ℓ(θ) =

Proof. Taking the same notation as in the GED case, we have: ∇ℓt (θ) =

∂ℓt (θ) ∇ln σt2 (θ) ∂ ln σt2

where ∂ℓt (θ) 1 1 = ηt2 − ∂ ln σt2 4 2 and 2 2 ∇ln σt2 (θ) = (1, ln σt−1 , ηt−1 , |ηt−1 | − E|ηt−1 |) + ∇ln σt−1 (θ)

58

5.2.2

Likelihood Maximization under generalized error distribution using gradient descent

Conditional log-likelihood under GED assumption Assuming that ηt follows a reduced centered generalized error distribution, the conditional distribution of ϵt given the past is a gaussian distribution of variance σt2 . (d) ϵt |ϵt−1 = GED(ν) Let’s denote



p(ϵt |ϵt−1 ) =

νexp(− 12 |ϵt /(λσt )|ν ) λ21+1/ν Γ(1/ν)

ν 1 1 ϵt − ln σt2 ℓt = − 2 λσt 2 ν ( ) 1 1 ϵt ν = − exp − ln σt2 − ln σt2 2 λ 2 2 ν ( ν ) 1 ( ) 1 ϵt 2 2 + g(ηt−1 )) − ln w + β ln σt−1 + g(ηt−1 ) = − exp − (w + β ln σt−1 2 λ 2 2

(36)

and θ = (w, β, θ, γ, ν) the vector of unknown parameters. The conditional log-likelihood is then: ℓ(θ) = cste +

T ∑

ℓt (θ)

t=1

Log-likelihood Gradient Let’s denote θ = (w, β, θ, γ, ν) the vector of unknown parameters. Proposition 5.2. Under generalized error distribution assumption for the standardized residuals, the conditional log-likelihood of an EGARCH(1,1) process is given by:                   

ν T ∑ ν ϵt 1 ∂ln σ t ∂ln σ t ∂ln σ t ∂ln σ t ∇ℓ(θ) = ( − ))( , , , , 0) 4 λ 2 ∂w ∂β ∂θ ∂γ t=1 ν 1 ϵt ϵt − ln (0, 0, 0, 0, 1) 2 λσt λσt ( ) ( ) 2 ∂ ln σt−1 ∂ ln σt2 2 = (1, ln σt−1 , ηt−1 , |ηt−1 | − E|ηt−1 |) + ∂θi ∂θi i=1..4 i=1..4

Proof. The log-likelihood gradient is defined as: ∇ℓ(θ) = (

∂ℓt ∂ℓt ∂ℓt ∂ℓt ∂ℓt , , , , ) ∂w ∂β ∂θ ∂γ ∂ν

(37)

For i = 1..4, by chain rule: ∂ℓt ∂ℓt ∂ ln σt2 = ∂θi ∂ ln σt2 ∂θi ν ∂ℓt 1 ν ϵt − ≡ dℓt = ∂ ln σt2 4 λσt 2

and

and by definition of ln σt2 , (

∂ ln σt2 ∂θi

(

∂ ln σt2 ∂θi

) is defined by the following recurrence: i=1..4

)

( 2 = (1, ln σt−1 , ηt−1 , |ηt−1 | − E|ηt−1 |) +

i=1..4

2 ∂ ln σt−1 ∂θi

) i=1..4

The partial derivative of ℓt with regards to ν is the same than the one under GARCH models under GED assumption (4.2.2).

59

We estimate the hessian matrix by BHHH.

Step optimization under parameters constraints When updating the parameters, we want to keep positive parameters and the assymetry conditions i.e:  n+1 = β n + αn d n  2 > 0 β    θn+1 = θn + αn dn < 0 3 n n n  − γn+1 = −γn − αn dn  4 < θ + α d3    n n n θ + αn d n 3 < γn − α d4 Based on these conditions and depending on the sign of dn i , we can infer a maximum bound on α such that positivity and symmetry conditions are preserved.  βn   αn < − n if dn 2 < 0   d2    n   θ   αn < − n if dn 3 > 0   d  3  n n θ +γ n αn < − n if dn  3 + d4 < 0  d3 + dn  4     γ n − θn  n  αn < n if dn  3 − d4 > 0  d3 − dn  4   

5.2.3

Fit and simulations

• We used an "approximate gradient" of the form to have smoother results:   ℓ(w + ϵ, β, θ, γ) − ℓ(w, , b, θ, γ) 1 ℓ(w, β + ϵ, θ, γ) − ℓ(w, , b, θ, γ)  ∇ℓ =  ϵ ℓ(w, β, θ + ϵ, γ) − ℓ(w, , b, θ, γ) ℓ(w, β, θ, γ + ϵ) − ℓ(w, , b, θ, γ) with ϵ = 10−4 Initial parameters • As the S&P 500 shows more volatility than the Euro stoxx, wi initialize the parameters with higher θ and γ coefficient for the S&P 500 than the Euro stoxx. w is initialized with the stationarity relationship w = E(ln h)(1 − b) where E(ln h) is approximated as the average of the logarithm of the squared residuals.

Euro stoxx S&P 500

β 0.8 0.78

θ −1 10−3 −0.7 10−3

γ 5 10−2 9 10−2

lnˆh -10.9 −10.2

Table 18: EGARCH(1,1) initial coefficients

60

w −2.4 -2.05

1. Initial fit

Figure 54: Log-likelihood evolution with iteration - EGARCH(1,1) fit on Euro stoxx data.

Figure 55: EGARCH(1,1) parameters evolution with iteration - EGARCH(1,1) fit on Euro stoxx data

Euro stoxx S&P 500

w -2.04 -2.40

β 0.753 0.727

θ -0.00622 -0.0285

γ 0.0576 0.140

MSE 4.03 10−7 3.49 10−7

Table 19: EGARCH(1,1) fit

As expected regarding the higher level of volatility and heteroskedasticity for the S&P compared to the Eurostoxx, the obtained heteroskedasticity-related parameters θ and γ are higher for the S&P. 2. Unidimensional EGARCH processes generation Once the parameters fitted, we can draw new EGARCH(1,1) processes using the fitted parameters and the recursion equation of the EGARCH(1,1) model.

61

 2 2  σt = w + βσt−1 + θηt−1 + γ(|ηt−1 | − E|ηt−1 |) Generate ηt ∼ N (0, 1)  ϵt = ηt σt

3. EGARCH(1,1) parameters staibility

Euro stoxx S&P 500

w -2.25 (3.26 10−3 ) -2.40 (2.28 10−3 )

β 0.724 (0.00202) 0.721 (0.00198)

θ -0.0138 (0.00183) -0.0280 (0.0156)

γ 0.105 (0.00127) 0.140 (3.13 10−3 )

Table 20: EGARCH(1,1) fit on simulated data - average parameters - 10 000 simulations

Notes: standard-deviation are in parentheses. Remark (Results with R studio). Contrary to our results, the EGARCH fit using rugarch package leads to higher β and lower θ and γ for S&P 500. As well as in the AR-GARCH fit, the AR parameter seems to be fitted simultenaously with the EGARCH parameters.

Euro stoxx S&P 500

AR -0.0288 -0.0563

w -1.17 -2.25

β 0.809 0.976

θ -0.928 -0.194

γ 0.916 0.149

Table 21: EGARCH(1,1) fit on R studio

Remark (Exponential GARCH under GED distribution). We did not have the time to optimize the algorithm of EGARCH fit under GED distribution. However, we reach higher level of log-likelihood with a GED order higher than 2. Using approximated gradient and initial b, θ and γ parameters used in the gaussian case, we got the following log-likelihood evolution with different initial order n u:

Figure 56: Log-likelihood evolution with different initial GED distribution order. Highest likelihood is obtained with ν = 3

62

5.3 5.3.1

Multidimensional EGARCH processes Parcimonious MEGARCH

The exponential GARCH model has been extended to the multidimensional case only recently by Tsay in 2005 and Koutmos and Booth’s in 1995. We are going to consider a variant of Koutmos and Booth’s model, which appears to be more parcimonious than the general M-EGARCH framework (2N 2 + 2N parameters) Contrary to Koutmos, we are not going to consider the inter-effet of the standardized error on the log conditional variance through a NxN g function. Denoting Et = (ϵ1,t , ...ϵN,t )′ and Zt = (η1,t , ...ηN,t )′ , the considered multidimensional EGARCH model of order (1,1) follows the general multidimensional heteroskedastic framework : 1/2

Et = Ht Zt Ht = Cov(Et |Ωt−1 ) {

where

and the hi,t

Hii,t = hi,t √ Hij,t = ρi hi,t hj,t

f or i = 1...N f or i, j = 1...N, i ̸= j

f or i = 1...N follows the EGARCH process: ln hi,t = wi + βi ln hi,t−1 + g(ηt,i )

Remark. • ρij is the "constant conditional correlation" between the ith and jth residuals. The correlations are assumed constant, as in the multidimensional GARCH proposed by Bollerslev [Bol90]. Indeed, |ρij | < 1 • A matrix Ht of this form is not necessarily positive semi-definite. Let’s take the case N=3. A positive determinant is a necessary condition for a matrix to be positive semi-definite. Denoting √ Dt = Diag( hi,t , i = 1...N ), Ht = Dt Ct Dt where Ct is the "correlation matrix":   1 ρ12 ρ13 1 ρ23  Ct = ρ12 ρ13 ρ23 1 The determinant of the product Dt Kt Dt is the product of the determinants, and the determinant √ of Dt is the product of the hi,t , i = 1...N so it is positive. Yet: 1 ρ12 ρ13

ρ12 1 ρ23

ρ13 ρ23 = 1 − ρ212 − ρ213 − ρ223 1

if ρ12 = ρ13 = ρ23 = 0.99, the determinant would be negative and the matrix not positive semidefinite.

Diagonal coefficients We want to estimate W = (wi )i=1...N , B = (βi )i=1...N , Θ = (θi )i=1...N and Γ = (γi )i=1...N , such that: ln Ht = W + B ∗ ln Ht−1 + Θ ∗ U + Γ ∗ (|U | − E|U |)

(38)

where * denotes the Hadamard product, ln Ht = (ln hi,t )i=1...N and |U | = (|ui |)i=1...N . As in the unidimensional case, the diagonal coefficients follow : ln hi,t = wi + βi ln hi,t−1 + g(ηt,i ) So we apply separately for all i = 1...N the gradient descent method proposed in the last subsection.

63

Off-diagonal coefficients We are going √ to suppose a constant conditional covariance ρij such that Cov(ϵi,t , ϵj,t |Ωt−1 ) = ρij σi,t σj,t (where σ.,t = h.,t ). ( ) ( ) ϵi,t ηi,t , and assuming that the standardized residual noises Focusing on the sub-vector Eij,t = ηj,t ϵj,t follows a gaussian law, as in the multidimensional GARCH model, the conditional density is given by (31). To estimate ρij , we want to maximize the scalar log-likelihood:

ℓ(ρij ) =

T ∑ t=1

where:

5.3.2



1 ′ 1 −1 Hij,t Eij,t ln Hij − Eij,t 2 2

−1 ′ Eij,t Hij,t Eij,t = Dij,t ij,t 2 2 2 2 Pij,t = σ j,t ϵi,t + σi,t ϵj,t − 2ρij σi,t σj,t ϵi,t ϵj,t 2 2 Dij,t = Hij = σi,t σj,t (1 − ρ2ij ) P

Fit and simulations

• Initial parameters – ρij is initialized as the empirical correlation between the i-th and j-th asset residuals. – β, θ and γ are initialized with the values inidicated in the last section. On the Euro stoxx and S&P 500 data: ρ0 = 0.642

1. Initial fit The vectors W, B, Θ and Γ are filed with the parameters fitted in the unidimensional case. ( ) ( ) −2.04 0.753 W = B= −2.40 0.727 Θ=

) ) ( ( 0.0576 −0.00622 Γ= 0.140 −0.0285

ρ parameter is optimized through gradient descent technique. ρ = 0.681

Figure 57: Evolution of the log-likelihood ℓ(ρ) with iteration.

64

Figure 58: Evolution of ρ with iteratio.

(39)

2. EGARCH processes generation 1/2

To simulate an EGARCH processes Et = Ht calculated. 1. Initialization: ln H0 = ( T1

∑T t=1

Zt , Once the parameters simulated, Ht matrix can be

ln ϵ2i,t )i=1..N

2. Recurrence: • Generate U ∼ N (0, IN ) • ln Ht = W + B ∗ ln Ht−1 + Θ ∗ U + Γ ∗ (|U | − E|U |) • Ht = Dt Ct Dt with Dt = Diag(exp(ln hi,t ), i = 1..N ) and Ct = (ρi,j )i,j=1..N 3. Positive definte transformation of Ht (using the method proposed in proposition 4.6) 1/2

4. Cholesky decomposition to get Ht 1/2

5. Product Et = Ht

Zt

3. MEGARCH(1,1) parameters stability On 1000 simulations, ( W = ( Θ=

) ) ( 0.767 ± 0.00311 −1.91 ± 0.00913 B= 0.744 ± 0.00245 −2.21 ± 0.0102

) ) ( 0.0510 ± 0.000229 −0.00116 ± 0.000328 Γ= 0.0908 ± 0.000231 −0.00096 ± 0.000328 ρ = 0.675 ± 0.0194

Parameters are slightly higher for w, theta and β.

5.4

Conclusion, limits of MEGARCH, opening

The MEGARCH model appears as the most complete model considered. It is better than GARCH model as it reproduces the leverage effect.

Stationarity Invertibility Mean-reverting Fat tails Long-range dependence Heteroskedasticity Leveraged effect



Table 22: Stylized facts verification - EGARCH models

Inherent difficulties • The global non-stationarity We have seen that fitting financial time series with model stable in presence of stationary time series can be difficult. Indeed, financial time series are not stationary in the long-term. Assuming stationarity and doing the regression in the AR(1) case, we got parameter lower than 1 in magnitude, and negative, which was in favour of a stationary behavior. But a shift in the time origin gave different parameters, which reject the stationarity on the long-term.

65

• Highly non-linear methods Log-likelihood maximization methods are highly non-linear and difficult to fit: falling in local minima is possible. Genetic algorithm (Nelder Mead for R studio) could have been used. Improvements • Convergence measures In the multidimensional case, we have not clearly defined convergence criteria for our estimators. We could have calculated Carmer Rao bounds for example. Other models that could have been explored It is still a quite simple model. We could have explored other classes of leveraged-effect compatible models, such as the T-GARCH, an assymetric GARCH extension modelling the conditional variance as: ht = w + α+ max(Xt−1 , 0) + α− min(Xt−1 , 0) + βht−1 AR-GARCH and AR-EGARCH only reproduce partially the long-range dependence and are not adapted to long time series. Other class of model could be more efficient to reproduce stylized facts. To quote some of them: • Multifractal models: they reproduce the long-range dependence, and are more efficient than GARCH models in high frequency(cf Vargas [Var10]) • Multifactor models: Ng and Rotschild [NER92], for example, propose a GARCH-based model with dynamic factors. They add macro-economic factors in addition to a market factor to model asset returns.

66

6

Assessing Robustness of Systematic Strategies

6.1

Systematic strategies

Systematic strategies (or systematic trading) are strategies trying to optimize the allocation of a portfolio under different risk restrictions. It relies strongly on Backtesting. For example, a systematic strategy can have a target volatility on a certain period, and wants to maximize its returns under this constraint.

6.1.1

Markowitz optimal allocation theory

Markowitz "theory" is a mathematical framework of portfolio allocation. It is based on a maximization of the expected returns under constraints of variance. Thus, it proposes a tradeoff between risk and profitability, making it a useful tool for systematic strategies. The mean-variance portfolio Let us consider a universe of n risky assets. Let µ and Σ be the vector of expected returns and the covariance matrix of asset returns. We note r the risk-free asset. A portfolio allocation consists in a vector of weights w = (w1 , ..., wn ) where wi is the percentage of the wealth invested in the ith asset. We can add constraints on weights, for example assuming full-investment by setting the sum of weights to one, or assuming the portfolio is long-only by setting all the weights non-positive. Let us now define the quadratic utility function U of the investor which only depends of the expected returns µ and the covariance matrix Σ of the assets: U(w) = w′ (µ − r1) −

ϕ ′ w Σw 2

where ϕ is the risk tolerance of the investor. The mean-variance optimized (or MVO) portfolio w is the portfolio which maximizes the investors utility. The optimization problem can be reformulated equivalently as a standard QP problem: 1 w∗ = argmin w′ Σw − γw′ (µ − r1) 2 where γ = ϕ−1 . Without any constraints: w∗ =

1 −1 Σ (µ − r1) ϕ

In practice, µ and Σ are unknown, so the optimal investment can’t be exactly reached. µ and Σ have to be estimated. There are plenty of manners of optimizing these estimations, from resampling method, shrinkage approach or penalization methods. Some of them are detailed in Lyxor’s white paper [BGRR13]. Lakner [Lak98] gives explicit formulas of optimal allocation under partial information using Clark-Ocone theorem.

6.1.2

Trend-following or Mean-reverting?

Momentum strategies are strategies that wants to take advantages of stock movements. They can be seperated in two main classes: • Trend-following strategies: it is based on the assumption that high returns will be followed by high returns and vice versa. The idea is to buy when the stock rises and sells if it goes down. This works if we assume an inertia effect of the financial markets. In practice, it can be shown that this is the case for most stock indices 60. • Contrarian strategies: it is a kind of investment going against the market and consists in buying poorly performing assets and selling when they perform well. It is based on a mean-reverting behavior assumption: if an asset goes up, there is strong probability that it will go down to its long-term average. The "momentum risk premium" is the premium associated with such investment strategies. It comes from the time delay/gap of reaction between markets and investment.

67

Figure 59: Average conditional returns on one month after returns of the same sign on last 3 months.

6.2 6.2.1

Profitability indices Sharpe Ratio

Sharp ratio measures the profitability gap between a risky portfolio and a non-risky portfolio, divided by a risk indicator of the risky investment (tipycally the standard deviation of the portfolio). It roughly reduces to: SR =

µ−r σ

where: • µ is the expected asset return; • r is the risk-free rate; • σ is the volatility of the asset. The definition is of course adapted with the way µ, r and σ are estimated It indicates how well the return of an asset compensates the investor for the risk taken. When comparing two assets versus a common benchmark, the one with a higher Sharpe ratio provides better return for the same risk (or, equivalently, the same return for lower risk). • A ratio of 1 or better is considered good; • 2 or better is very good; • 3 or better is excellent.

6.2.2

Drawdown

A Drawdown is defined as the maximum decline on a specific recorded period of an investment, fund or commodity. It is usually quoted as the percentage between the peak and the subsequent trough.

Figure 60: Drawdown illustration (source: The Practical Quant).

68

6.2.3

Calmar Ratio

The Calmar Ratio is a performance benchmark commonly used to assess the risk efficiency of hedge funds. It is a downside riskadjusted performance measure that employs the maximum drawdown to penalize risk. The Calmar Ratio is equal to the compounded annual growth rate divided by the maximum drawdown. CR =

CompundedAnnualReturn M aximumDrawdown

The maximum drawdown is the maximum peak to trough of the returns, and is typically measured over a three year period. Fundamentally, the maximum drawdown indicates the greatest loss an investor could suffer if an investment is bought at its highest price, and sold at the lowest. By using the maximum drawdown as a proxy for risk, the Calmar Ratio is considered a good indicator of the emotional pain an investor could feel if the the stock market suddenly goes down. • 1 or higher is considered good • 3 or higer is considered excellent • 5 or higher indicated excellent performance

6.3

Systematic strategies robustness test

Most systematic strategies relies on backtesting, i.e. historical data, to predict the future profitability of the strategy. In order to challenge the results obtained by backtesting, and how a strategy would react under different economic scenarios, one can fit and re-simulate the underlyings trajectories under a model, perturb the parameters associated to the chosen model coherently with an economic scenarios and look at the profitability inidices distributions. This work was partly done in BNP SSH team, under Black scholes model or Ornstein-Uhlenbeck model for the drift of the underlying. For more information, see Bel Hadj Ayed [BHA16]. The purpose of this internship was to find a model that reproduced well statistical properties of financial time series in order to build a new robustness test.

6.3.1

AR(1)-MEGARCH(1,1) underlying model

The idea is to simulate new trajectories of the underlyings fitted by the AR(1)-MEGARCH(1,1) model, and re-price the strategy to get new Profitability indices.

6.3.2

AR(1)-MEGARCH(1,1) parameters perturbation

In order to highlight the importance of autocorrelation in systematic strategies, we are going to apply the following parameters perturbation: 1. multiplicative increase/decrease of autocorrelation level: It means that we are going to increase the autocorrelation parameter (respectively decrease) by 50 %. For a trend-following strategy, we expect higher sharpe ratio under enhanced auto-correlation (perturbation 1). Why do we expect that the sign and magnitude of the AR parameter should have an impact? We model stock returns as: ∆St = µ + Yt St Yt = αYt−1 + ϵt where Yt is an AR(1) process and ϵt an EGARCH(1,1) process. So: t • E( ∆S |Yt−1 ) = µ + αYt−1 St

2 t • Var( ∆S |Yt−1 ) = σt2 − 2µαYt−1 − µ2 St

69

The conditional Sharpe ratio associated to the underlyings St is then: SRt−1 =

µ + αYt−1 2 σt2 − 2µαYt−1 − µ2

And if µ = 0, it reduces to: SRt−1 =

αYt−1 σt2

So we are going to test these scenarios under null and non-null drift parameter. We expect an impact of α sign and magnitude on the overall sharpe ratio.

For confidentiality purposes, we do not show in this report the results of the application of the perturbations on a true systematic strategy.

70

7

Conclusion

After a review of the global statistical properties verified by financial time series, we got interested in the times series models the most used and the most appropriate to reproduce these patterns. We extended them to the multivariate case, proposed fitting methods and tested the limits of the models and fitting algorithms. The MEGARCH model appears as the best model in terms of stylized facts reproduced. We proposed a parcimonious model and a descent-gradient method using approximate gradient function to fit the model. The purpose of this model search was to implement a new "robustness test" in which we could highlight the impact of autocorrelation on trend-following strategies, as well as reproducing acurately financial time series statistical properties. The results given by our robustness test shows that MEGARCH might overestimate the drawdown of a strategy. By enhancing the autocorrelation parameters of the underlyings, we improve the profitability indices. However, our conclusion are based on a small number of simulations (1000) which might distort our results. A future work to get done is: • getting more simulations on the scenarios presented in the last sections; • explore more scenarios with parameters perturbation (for example, pertub the heteroskedasticityrelated parameters w, b, θ, γ).

71

8

Appendix

8.1 EM-algorithm under state-space representation and Kalman filtering 8.1.1

Proof of maximization step optimal matrices Proof. To prove this result, first we give the log-likelihood formula under statespace representation, second, using standard matrix differenciation rule, the partial derivatives of this function. Assuming that ϵt ∼ N (0, Σϵ ) and vt ∼ N (0, Σv ): Yt − ΦZt − F xt ∼ N (0, Σϵ ) and xt − T xt−1 ∼ N (0, Ση ) Using Bayes formula: L(Σϵ , Ση , Φ, F ) = f (x, y|Σϵ , Ση , Φ, F ) =

T ∏

f (xt |xt−1 )f (yt |xt )

t=1

Using gaussian assumption : ℓ(Σϵ , Σv , Φ, F ) =

T +1 T ln|Σ−1 ln|Σ−1 v |+ ϵ | 2 2 T 1∑ − (xt − T xt−1 )′ Σ−1 η (xt − T xt−1 ) 2 t=1 −

T 1 ∑ ′ −1 ϵt Σϵ ϵt + cst 2 t=0

As: • T r(a) = a if a is scalar • T r(AB) = T r(BA) ℓ(Σϵ , Ση , Φ, F ) =

T T +1 ln|Σ−1 ln|Σ−1 η |+ ϵ | 2 2 T ∑ 1 (xt − T xt−1 )(xt − T xt−1 )′ ) − T r(Σ−1 η 2 t=1 −

T ∑ 1 T r(Σ−1 ϵt ϵ′t ) + cst ϵ 2 t=0

Expanding ϵt ϵ′t = (Yt − ΦZt − F xt )(Yt − ΦZt − F xt )′ : ϵt ϵ′t = Yt Yt′ − Yt x′t F ′ − F xt Yt′ − ΦZt Yt′ − Yt Zt′ Φ′ + F xt Zt′ Φ′ + ΦZt x′t F ′ + F xt x′t F ′ So the partial derivatives of ℓ with regards to F and Φ are given by: {

∂ℓ ∂F ∂ℓ ∂Φ

∑T = − 12 Tr(Σ−1 ϵ t=0 ∑ T = − 21 Tr(Σ−1 ϵ t=0

∂ϵt ϵ′t ) ∂F ∂ϵt ϵ′t ) ∂Φ

∑T ′ ′ ′ = Tr(Σ−1 ϵ t=0 Yt xt − ΦZt xt − F xt xt ) ∑ T ′ ′ ′ = Tr(Σ−1 ϵ t=0 Yt Zt − ΦZt Zt − F xt Zt )

Equalizing the partial derivatives of ℓ to zero is the same than setting the partial derivatives of ϵ ϵ′ to zero. This is true as we assume Σϵ = Id . It gives the estimators of Φ and F . Deriving ℓ with regards to Σv gives the second estimator.

8.1.2

Kalman recursions

Kalman recursions are composed of three steps: (a) a prediction step: xt|t−1 ← xt−1|t−1 (b) a correction step: xt|t ← xt|t−1 (c) a smoothing step:n xt|T ← xt|t

72

(40)

• Kalman filtering Initialization x0|0 := µ0 , Σx (0|0) := Σ0 Prediction step (1 ≤ t ≤ T ) xt|t−1 = T xt−1|t−1 Σx (t|t − 1) = T Σx (t − 1|t − 1)T ′ + Ση yt|t−1 = F xt|t−1 + Φzt Σy (t|t − 1) = F Σx (t|t − 1)F ′ + Σϵ Correction step (1 ≤ t ≤ T ) xt|t = P (yt − yt|t−1 ) Σx (t|t) = Σx (t|t − 1) − P Σy (t|t − 1)P ′ where P := Σx (t|t − 1)F ′ Σy (t|t − 1)−1 (Kalman filter gain) • Kalman smooting: backward recursion xt|T = xt|t − S(xt+1|T − xt+1|t ) Σx (t|T ) = Σx (t|t) − S[Σx (t + 1|t) − Σx (t + 1|T )S ′ where S := Σx (t|t)T ′ Σx (t + 1|t)−1 (Kalman smoothing matrix )

8.2

The Durbin-Levinson / Yule-Walker procedure

Estimation This algorithm aims to estimate ϕi in a pure and causal AR(p) model. It takes advantages of the bi-representation of pure and causal AR(p) proccesses.  p ∑    X = ϕi Xt−i + ϵt t        Xt =

i=1 ∞ ∑

ψi ϵt−i

AR(p) representation causal representation

i=0

By taking the expectation of both representation, as ψ0 = 1 and ϵt ∼ N(0, σ) independent from ϵt−j , we have: E[(AR(p)representation) ∗ Xt ] → γ(0) − ϕ1 γ(1) − . . . ϕp γ(p) = σ 2 E[(AR(p)representation) ∗ Xt−j ], j = 1..p → γ(j) − ϕ1 γ(j − 1) − . . . ϕp γ(j − p) = 0 We deduce that: −→ Γp ϕ = γp ′

and σ = γ(0) − ϕ γp where Γp = [γ(i − j)]i,j=1..p and ϕ = (ϕ1 , ...ϕp )′ Replacing γ(j) by its estimator γˆ (j) gives Yule-Walker estimators ϕˆY.W . Example in the AR(1) case 2

{

γ(0) − ϕ1 γ(0) = σ 2 γ(1) − ϕ1 γ(0) = 0

Yule-Walker estimators:  γˆ (1) 2   ) ) ˆ = γˆ (0)(1 − (  σ γˆ (0)   ϕˆ = γˆ (1)  1 γˆ (0)

Euro Stoxx 50 S&P 500

γˆ (0) 0.02324% 0.01685%

γˆ (1) -0.00067% -0.00184%

σ ˆ 0.02322% 0.01665%

Table 23: AR(1) Yule-Walker estimation

73

ϕˆ1 -0.02878 - 0.10913

Large-Sample Distribution of Yule-Walker Estimators As noted by Brockwell [BD02], for a large sample from an AR(p) process: ϕˆ ≈ N (ϕ, n−1 σ 2 Γ−1 p )

(41)

We can draw the following confidence bound at 95%: −1/2

σΓp |ϕˆ − ϕ| ≤ 1.96 √

(42)

n

For an AR(1) process, an estimation of this bound is: 1.96 √

σ ˆ nˆ γ (0)

It is equal to 0.05 % for the 2584 data considered (Euro stoxx and S&P 500 from 22/5/07 until 22/05/17). Conclusion Durbin-Levinson method is a simple method, easy to implement but it applies only to AR processes. In what follows, we will present a regression-based methods that can be applied to general multidimensional ARMA(p,q) processes.

8.3

Correlated Multivariate Gaussian Variable

in this section, we are going to see how to generated correlated gaussian and Cholesky factorisation principle.

8.3.1

Case dimension N = 2

For a random vector with N = 2 variables Y = (Yk )k=1,...,N , correlation is defined by: Cor(Y1 , Y2 ) = ρ

(43)

ρ is a constant between 0 and 1. To generate a bi-variate gaussian random process with a constant correlation ρ, let’s take 2 gaussian independent random variables (X1 , X2 )) following N (0, 1), and let’s define: √ Y = ρX0 + 1 − ρ2 X1 (44) Y is a gaussian variable as it is a linear combination of two independent gaussian variables. We then verify that Y satisfy the following conditions: E(Y ) = ρE(X0 ) + =0 V ar(Y ) = E((ρX0 +

√ √

1 − ρ2 E(X1 ) 1 − ρ2 X1 )2 )

= ρ2 E(X02 ) + (1 − ρ2 )E(X12 ) =1 Cor(Y, X0 ) = Cor(ρX0 +



1 − ρ2 X1 , X0 )



8.3.2

Case dimension N > 2

If the correlations of (Yk )k=1,...,N are not constantes, that is to say: ρ(Yk , Yl ) = ρ2k,l , a constant depending on k and l

(45)

Then we get a correlation matrix A with: Ai,i = 1, 1 ≤ i ≤ N Ai,j = Aj,i = ρ2i,j , 1 ≤ i, j ≤ N et i ̸= j We seen that A is symmetric positive semi-definite. Then we can apply Cholesky factorisation on A: A = LLT (46)

74

L in a lower triangular matrix and LT the transpose of L. We take X = (X1 , X2 , ..., XN ) random variables following the distribution of N (0, 1), independent, the we can generate Y = (Y1 , Y2 , ..., YN ) by: (47)

Y = LX Then, we get the matrice L of the following form:  l1,1 0  l2,1 l2,2  L= . ..  .. . lN,1 lN,2

··· ··· .. . ···

0 0 .. .

    

(48)

lN,N

From equality A = LLT , we get that: Ai,j = (LLT ) =

N ∑



min(i,j)

li,k lj,k =

k=1

li,k lj,k , 1 ≤ i, j ≤ N

(49)

k=1

Since L is a lower triangular matrix, lp,q = 0 pour 1 ≤ p < q ≤ N . A being symmetric, it is enough for this relation to be verified for i ≤ j, i.e li,j have to satisfy: Ai,j =

i ∑

li,k lj,k , 1 ≤ i ≤ j ≤ N

(50)

k=1

For i = 1, we deduce: A1,1 = l1,1 l1,1 ⇒ l1,1 = A1,j = l1,1 lj,1 ⇒ lj,1 =



(51)

A1,1

A1,j ,2 ≤ j ≤ N l1,1

(52)

We determine the i-th column of L for 1 ≤ i ≤ N , after having calculated the first (i − 1)-th columns: v u i−1 i u ∑ ∑ Ai,i = li,k li,k ⇒ li,i = tAi,i − li,k2 (53) k=1

Ai,j =

i ∑

k=1

li,k lj,k ⇒ lj,i =

Ai,j −

∑i−1

k=1 li,k lj,k

li,i

k=1

Then we get L.

75

,i + 1 ≤ j ≤ N

(54)

8.3.3

GARCH Gradient Descent algorithm pseudo-code

Data: AR(1) residuals Result: GARCH(1,1) parameters w, a, b, ν niter = 0; h0 = V ar(residuals); ℓ = GARCHLikelihood(h0 , w, a, b, (ν)); ℓnew = ℓ + ϵ; for w ∈ winit , a ∈ ainit , b ∈ binit , (ν ∈ νinit ) do while niter < maxiter and ℓnew > ℓ do ℓ = GARCHLikelihood(h0 , w, a, b, (ν)); ∇ℓn = GARCHGradientLikelihood(h0 , w, a, b, (ν)); Hℓn = GARCHHessianLikelihood(h0 , w, a, b, (ν)) ; Dn = Hn−1 ∇n ℓ; n αmax computation, αn = min(p% αmax , αmin ); n w ← w + αn D1 ; b ← b + αn D2n ; a ← a + αn D3n ; ν ← ν + αn D4n ; ℓnew = GARCHLikelihood(h0 , w, a, b, (ν)); distance = max(αn |Din |, |ℓnew − ℓ|); if ℓnew < ℓthen Dn = ∇n ℓ; w ← w + αn D1n ; .. . end niter ← niter + 1; end if ℓnew > ℓlastnew then woptimal ← w; aoptimal ← a; boptimal ← b; νoptimal ← ν; end ℓlastnew ← ℓnew ; end

Algorithm 2: GARCH(1,1) fit algorithm pseudo-code Remark. ℓlastnew and optimal fitted parameters are initialized during the first fit on the grids winit , ainit , binit and eventually νinit ).

8.4 8.4.1

Generalized Error Distribution E|z|

For variable Z following a generalized error distribution: ∫ E|Z| =

1 z ν



|z|

z=−∞





=2 By substituing

where

∫∞ u=0

1

z z=0

1 z ν ( ) 2 λ

e− 2 | λ | dz λ21+1/ν Γ(1/ν) z ν

e− 2 ( λ ) dz λ21+1/ν Γ(1/ν)

with u, and after some simplifications, we have: ∫ ∞ λ E|Z| =21/ν u2/ν−1 e−u du Γ(1/ν) u=0

u2/ν−1 e−u du is nothing but Γ(2/ν).

Γ(1/ν) As λ = 2−1/ν Γ(3/ν)

1/2

, we finally get: E|Z| =

1 Γ(2/ν) (Γ(1/ν)Γ(3/ν))1/2

76

8.4.2

Generating GED variable

If U ∼ U [0, 1] and Fν−1 denotes the inverse cdf of the GED distribution, X = Fν−1 (U ) follows a GED distribution. Indeed, P(F −1 (U ) < x) = P(U < F (x)) = F (x) so F −1 (U ) has the same cdf than X. • Fν



x

Fν (x) =

fν (y)dy −∞

=

1 1 1 1 x (1 + sign(x) γ( , | |ν ) 2 Γ(1/ν) ν 2 λ

• Fν−1 Fν−1 (y)

{ =

λ[2γ −1 ( ν1 , 2Γ( ν1 )(y − 12 )]1/ν λ[2γ −1 ( ν1 , 2Γ( ν1 )( 21 − y)]1/ν

77

if y > if y