Après les exposés des dernières séances, l’examen du cours MAT8595, **Copules et Valeurs Extrêmes** avait lieu hier matin. L’énoncé est en ligne, et j’ai aussi écrit quelques éléments de correction, là. En cas de désaccord, merci de me le faire savoir rapidement ! Sinon, bonne fin de session..

# Tag Archives: MAT8595

# Independence and correlation

A short post to get back on a property I gave briefly in the MAT8595 class in January, and again in the MAT8181 class this week (to illustrate the distinction between weak and strong white noises). Recall that (real-valued) random variables and are independent if for all , Another characterization, for integrable variable is that for all , which can be written, if variables are square integrable The idea to prove this characterization is to observe that if and are independent can be written Using a standard argument in integration theory, equality is valid for step functions (not only indicators), and then to positive measurable functions, and finally to integrable functions. Proving this result is not that difficult. Observe that Rényi (1959) – inspired by Gebelein (1947) – followed by Sarmanov (1958) introduced the concept of *maximal correlation*, that can be related to this result, where the maximum is taken over all functions and such that the correlation exist. Actually, it is possible to consider only transformations such that and (and similarly for , the idea is that we simple center and scale, which does not impact the correlation.Thus, and are independent if and only if Algorithm to estimate that coefficient are interesting. The problem can be written, equivalently And if the minimization is considered over , assuming that is fixed, then the optimal transformation is And similarly for . So using an iterative algorithm, it is possible to get and (see Breiman and Friedman (1985) for more details). Actually, those functions appear in nonlinear canonical analysis. As mentioned in Lancaster (1957), for a Gaussian random vector and in that case and are affine functions. This can be related to Hermite’s polynomial and to the expansion of the bivariate Gaussian density. I still hope that someone will go further for the project in the MAT8181 course.

# Bivariate Densities with N(0,1) Margins

This Monday, in the ACT8595 course, we came back on elliptical distributions and conditional independence (here is an old post on de Finetti’s theorem, and the extension to Hewitt-Savage’s). I have shown simulations, to illustrate those two concepts of dependent variables, but I wanted to spend some time to visualize densities. More specifically what could be the joint density is we assume that margins are distributions.

**The Bivariate Gaussian distribution**

Here, we consider a Gaussian random vector, with margins , and with correlation . This is the standard graph, with elliptical isodensity curves

r=.5 library(mnormt) S=matrix(c(1,r,r,1),2,2) f=function(x,y) dmnorm(cbind(x,y),varcov=S) vx=seq(-3,3,length=201) vy=seq(-3,3,length=201) z=outer(vx,vy,f) set.seed(1) X=rmnorm(1500,varcov=S) xhist <- hist(X[,1], plot=FALSE) yhist <- hist(X[,2], plot=FALSE) top <- max(c(xhist$density, yhist$density,dnorm(0))) nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE) par(mar=c(3,3,1,1)) image(vx,vy,z,col=rev(heat.colors(101))) contour(vx,vy,z,col="blue",add=TRUE) points(X,cex=.2) par(mar=c(0,3,1,1)) barplot(xhist$density, axes=FALSE, ylim=c(0, top), space=0,col="light green") lines((density(X[,1])$x-xhist$breaks[1])/diff(xhist$breaks)[1], dnorm(density(X[,1])$x),col="red") par(mar=c(3,0,1,1)) barplot(yhist$density, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE,col="light green") lines(dnorm(density(X[,2])$x),(density(X[,2])$x-yhist$breaks[1])/ diff(yhist$breaks)[1],col="red")

That was the simple part.

**The Bivariate Student-***t*distribution

Consider now another elliptical distribution. But we want here to *normalize *the margins. Thus, instead of a pair , we would like to consider the pair , so that the marginal distributions are . The new density is obtained simply since the transformation is a one-to-one increasing transformation. Here, we have

k=3 r=.5 G=function(x) qnorm(pt(x,df=k)) dg=function(x) dt(x,df=k)/dnorm(qnorm(pt(x,df=k))) Ginv=function(x) qt(pnorm(x),df=k) S=matrix(c(1,r,r,1),2,2) f=function(x,y) dmt(cbind(Ginv(x),Ginv(y)),S=S,df=k)/(dg(x)*dg(y)) vx=seq(-3,3,length=201) vy=seq(-3,3,length=201) z=outer(vx,vy,f) set.seed(1) Z=rmt(1500,S=S,df=k) X=G(Z)

Because we considered a nonlinear transformation of the margins, the level curves are no longer elliptical. But there is still some kind of symmetry.

**The Exchangeable Case with Conditionally Independent Random Variables**

We did consider the case where and with independent random variables, given , and that both variables are exponentially distributed, with parameter . As we’ve seen in class, it might be difficult to visualize that sample, unless we have log scales on both axis. But instead of a log transformation, why not consider a transformation so that margins will be . The only technical problem is that we do not have the (nonconditional) distributions of the margins. Well, we have them, but they are integral based. From a computational point of view, that’s not a bit deal… Computations might take a while, but we can visualize the density using the following code (here, we assume that is Gamma distributed)

a=.6 b=1 h=.0001 G=function(x) qnorm(ifelse(x<0,0,integrate(function(z) pexp(x,z)* dgamma(z,a,b),lower=0,upper=Inf)$value)) Ginv=function(x) uniroot(function(z) G(z)-x,lower=-40,upper=1e5)$root dg=function(x) (Ginv(x+h)-Ginv(x-h))/2/h H=function(xy) integrate(function(z) dexp(xy[2],z)*dexp(xy[1],z)* dgamma(z,a,b),lower=0,upper=Inf)$value f=function(x,y) H(c(Ginv(x),Ginv(y)))*(dg(x)*dg(y)) vx=seq(-3,3,length=151) vy=seq(-3,3,length=151) z=matrix(NA,length(vx),length(vy)) for(i in 1:length(vx)){ for(j in 1:length(vy)){ z[i,j]=f(vx[i],vy[j])}} set.seed(1) Theta=rgamma(1500,a,b) Z=cbind(rexp(1500,Theta),rexp(1500,Theta)) X=cbind(Vectorize(G)(Z[,1]),Vectorize(G)(Z[,2]))

There is a small technical problem, but no big deal.

Here, the joint distribution is quite different. Margins are – one more time – standard Gaussian, but the shape of the joint distribution is quite different, with an asymmetry from the lower (left) tail to the upper (right) tail. More details when we’ll introduce copulas. The only difference will be that the margins will be uniform on the unit interval, and not standard Gaussian.

# Bias of Hill Estimators

In the MAT8595 course, we’ve seen yesterday Hill estimator of the tail index. To be more specific, we did see see that if , with , then Hill estimators for are given by

for . Then we did say that satisfies some consistency in the sense that if , but not too fast, i.e. (under additional assumptions on the rate of convergence, it is possible to prove that ). Further, under additional technical conditions

In order to illustrate this point, consider the following code. First, let us consider a Pareto survival function, and the associated quantile function

> alpha=1.5 > S=function(x){ifelse(x>1,x^(-alpha),1)} > Q=function(p){uniroot(function(x) S(x)-(1-p),lower=1,upper=1e+9)$root}

The code here is obviously too complicated, since this power function can easily be inverted. But later on, we will consider a more complex survival function. Here are the survival function, and the quantile function,

> u=seq(0,5,by=.01) > plot(u,Vectorize(S)(u),type="l",col="red") > u=seq(0,99/100,by=.01) > plot(u,Vectorize(Q)(u),type="l",col="blue",ylim=c(0,20))

Here, we need the quantile function to generate a random sample from this distribution,

> n=500 > set.seed(1) > X=Vectorize(Q)(runif(n))

Hill plot is here

> library(evir) > hill(X) > abline(h=alpha,col="blue")

We can now generate thousands of random samples, and see how those estimators behave (for some specific ‘s).

> ns=10000 > HillK=matrix(NA,ns,10) > for(s in 1:ns){ + X=Vectorize(Q)(runif(n)) + H=hill(X,plot=FALSE) + hillk=function(k) H$y[H$x==k] + HillK[s,]=Vectorize(hillk)(15*(1:10)) + }

and if we compute the average,

> plot(15*(1:10),apply(HillK,2,mean)

we do get a series of estimators that can be considered as unbiased.

So far, so good. Now, recall that being in the max-domain of attraction of the Fréchet distribution does not mean that , with , but is means that

for some slowly varying function , not necessarily constant! In order to understand what could happen, we have to be slightly more specific. And this can be done only by looking at second order regular variation property of the survival function. Assume, here that there is some auxilary function such that

This (positive) constant is – somehow – related to the speed of convergence of the ratio of the survival functions to the power function (see e.g. Geluk *et al.* (2000) for some examples).

To be more specific, assume that

then, the second order regular variation property is obtained using , and then, if goes to infinity too fast, then the estimator will be biased. More precisely (see Chapter 6 in Embrechts *et al.* (1997)), if , then, for some ,

The intuitive interpretation of this result is that if is too large, and if the underlying distribution is not *exactly* a Pareto distribution (and we do have this second order property), then Hill’s estimator is biased. This is what we mean when we say

- if is too large, is a biased estimator
- if is too small, is a volatile estimator

(the later comes from properties of a sample mean: the more observations, the less the volatility of the mean).

Let us run some simulations to get a better understanding of what’s going on. Using the previous code, it is actually extremly simple to generate a random sample with survival function

> beta=.5 > S=function(x){+ ifelse(x>1,.5*x^(-alpha)*(1+x^(-beta)),1) } > Q=function(p){uniroot(function(x) S(x)-(1-p),lower=1,upper=1e+9)$root}

If we use the code above. Here, with

> n=500 > set.seed(1) > X=Vectorize(Q)(runif(n))

the Hill plot becomes

> library(evir) > hill(X) > abline(h=alpha,col="blue")

But it’s based on one sample, only. Again, consider thousands of samples, and let us see how Hill’s estimator is behaving,

so that the (empirical) mean of those estimator is

# Likelihood Based Methods, for Extremes

This week, in the MAT8595 course, we will start the section on inference for extreme values. To start with something simple, we will use maximum likelihood techniques on a Generalized Pareto Distribution (we’ve seen Monday Pickands-Balkema-de Hann theorem).

**Maximum Likelihood Estimation**

In the context of parametric models, the standard technique is to consider the maximum of the likelihood (or the log-likelihod).Let denote the parameter (with ). Given some – stnardard – technical assumptions, such as , or on some neighbourhood of , then

where denotes Fisher information matrix (see any textbook for *mathematical statistics* courses). Consider here some i.i.d. sample, from a Generalized Pareto Distribution, with parameter , so that

If we solve (numerically) the first order condition of the maximum likelihood, we get an estimator which satisfies

The idea of this asymptotic normality is the following : if the *true *distribution of the sample is a GPD with parameter , then, if is large enough, then will have a joint normal distribution. So if we generate a lot of sample (sufficently large, say 200 observations), then the scatterplot of the estimator should the same as the scatterplot of a Gaussian distribution,

> library(evir) > n=200 > param=matrix(NA,1000,2) > for(s in 1:1000){ + x=rgpd(n,xi=1/1.5,beta=1) + param[s,]=gpd(x,0)$par.ests + } > m=apply(param,2,mean) > S=var(param) > library(mnormt) > x=seq(min(param[,1])-.05,max(param[,1])+.05,length=101) > y=seq(min(param[,2])-.05,max(param[,2])+.05,length=101) > vx=rep(x,each=length(y)) > vy=rep(y,length(x)) > vz=dmnorm(cbind(vx,vy),m,S) > z=matrix(vz,length(y),length(x)) > COL=rev(heat.colors(100)) > image(x,y,z,col=COL) > points(param)

and to get a 3d representation

> x=seq(min(param[,1])-.05,max(param[,1])+.05,length=31) > y=seq(min(param[,2])-.05,max(param[,2])+.05,length=31) > vx=rep(x,each=length(y)) > vy=rep(y,length(x)) > vz=dmnorm(cbind(vx,vy),m,S) > z=matrix(vz,length(y),length(x)) > persp(x,y,t(z),shade=TRUE,col="green",theta=-30,phi=20,ticktype="detailed", + xlab="xi",ylab="sigma")

With 200 observations, if the true underlying distribution is a GPD, then, indeed, the joint distribution of seems to be normal. That would be interesting to generate some confidence intervals for instance, or define some tests.

To go further, see any standard textbook on statistical mathematics, e.g. Casella & Berger (2002).

**Delta Method**

Another important property is the so called delta-method (we’ve seen Monday in class that it was obtained easily using a first order Taylor expansion). The idea is that if is asymptotically normal, and if is sufficently smooth, then will also be asymptotically Gaussian. More precicely (see also the header of this blog)

From this property, we can get the normality of (which is another parametrization used in extreme value models), or on any quantile, . Let us run some simulation, one more time to check that we actually have a joint normality.

> library(evir) > n=200 > param=riskm=matrix(NA,1000,2) > for(s in 1:1000){ + x=rgpd(n,xi=1/1.5,beta=1) + param[s,]=gpd(x,0)$par.ests + xihat=param[s,1] + shat=param[s,2] + q=shat * (.01^(-xihat) - 1)/xihat + tvar=q+(shat + xihat * q)/(1 - xihat) + riskm[s,]=c(1/xihat,q) + } > m=apply(riskm,2,mean) > S=var(riskm) > library(mnormt) > x=seq(min(riskm[,1])-.05,max(riskm[,1])+.05,length=101) > y=seq(min(riskm[,2])-.05,max(riskm[,2])+.05,length=101) > vx=rep(x,each=length(y)) > vy=rep(y,length(x)) > vz=dmnorm(cbind(vx,vy),m,S) > z=matrix(vz,length(y),length(x)) > image(x,y,t(z),col=COL) > points(riskm)

As we can see bellow, with samples of size 200, we cannot use this asymptotical result: it looks like we do not have enought data. But if we run the same code with

> n=5000

We get the joint normality of and . This is what we can get from this result, called delta-method in statistical textbooks. See again Casella & Berger (2002) for more details.

**Profile Likelihood**

Another interesting tool is the concept of profile likelihood. This would be interesting here since the main interest is the tail index , being here some kind of auxilary parameter. See Venzon & Moolgavkar (1988) for more details. Here, we will plot

But more generally, it is possible to consider

where is the set of *interesting* parameters. Then (under standard suitable conditions) we can prove that

which can be used to derive confidence intervals. In the GPD case, for each , we have to find an optimal . We compute the (profile) likelihood i.e. . And we can compute the maximum of this profile likelihood. This two-stage optimization is, in general, not equivalent with the (global) maximization of the likelihood, as computed below

> n=500 > set.seed(1) > x=rgpd(n,xi=1/1.5,beta=1) > loglikelihood=function(xi,beta){ + sum(log(dgpd(x,xi,mu=0,beta))) } > XIV=(1:300)/100;L=rep(NA,300) > for(i in 1:300){ + XI=XIV[i] + profilelikelihood=function(beta){ + -loglikelihood(XI,beta) } + L[i]=-optim(par=1,fn=profilelikelihood)$value } > plot(XIV,L,type="l") > XIV[which.max(L)] [1] 0.67 > gpd(x,0)$par.ests xi beta 0.6730145 0.9725483

We are not far away. Actually, if we want to compute the maximum of the profile likelihood (and not only compute the values of the profile likelihood on a grid, as before), we use

> PL=function(XI){ + profilelikelihood=function(beta){ + -loglikelihood(XI,beta) } + return(optim(par=1,fn=profilelikelihood)$value)} > (OPT=optimize(f=PL,interval=c(0,3))) $minimum [1] 0.6731025 $objective [1] 822.5574

Observe that, indeed, we are not far away from the maximum likelihood estimator of (I believe that it’s mainly a computational issue here, and theat the two are similar, here… actually, I’d be glad to hear about cases where maximum of the profile likelihood is not the same as the maximum of the likelihood). The interesting point is that we can use this technique to compute a confidence interval, and even visualize it on a graph

> up=OPT$objective > abline(h=-up) > abline(h=-up-qchisq(p=.95,df=1),col="red") > I=which(L>=-up-qchisq(p=.95,df=1)) > lines(XIV[I],rep(-up-qchisq(p=.95,df=1),length(I)), + lwd=5,col="red") > abline(v=range(XIV[I]),lty=2,col="red")

The vertical lines are the lower and the upper bound of a 95% confidence interval for parameter .

To go further, see Murphy, S.A & van der Vaart, A.W. (2000). *On Profile Likelihood*.

# Central Limit Theorem

This week, in the MAT8595 course, before proving Fisher-Tippett theorem, we will get back on the proof of the Central Limit Theorem, and the class of stable distribution (in Lévy’s sense). In order to illustrate the problem of heavy tails, on the behavior of the mean, consider a sequence of i.i.d. Gaussian random variables ‘s. On top, we visualize the sequence, and below, we visualize the associate random walk

(the central limit theorem will give a limiting distribution for in the case where the variance of the ‘s is finite)

If we consider a sequence of i.i.d. random variables ‘s whith heavier tails (possibly with infinite variance), we can still define , but as we can see below, can be quite heratic.

As we will see this Thursday, the key to derive stable distribution for the central limit theorem, or possible limiting distributions for the maximum is Cauchy’s function equation. I strongly recommand to look at the proof.

# Copules et valeurs extrêmes, syllabus

Le plan de cours pour le cours MAT8595 *Copules et Valeurs Extremes** *est en ligne. L’entente d’évaluation sera signée au premier cours, ce lundi à 9:00 (salle **SH-2140**). D’autre billets seront mis en ligne dans les jours à venir, avec quelques exercices, et les articles qui serviront de base pour les projets, sur http://freakonometrics.hypotheses.org/courses/copulas-and-extremes.

# Graduate Course on Copulas and Extreme Values

This Winter, I will be giving a (graduate) course on extreme values, and copulas (more generally multivariate models and dependence), MAT8595. It is an ISM course, and even if it will probably be given in French, I will upload information here, in English. I will upload the (detailed) syllabus of the course during the Christmas holidays. But to give an overview, for those willing to register, the first part of the course will focus on **extreme value theory**. The references will be

- Beirlant, J, Goegebeur, Y., Segers, S. & Teugels, J. (2004)
*Statistics of Extremes*. Wiley. - Embrechts, P. Kluppelberg, C. & Mikosch, T. (1997)
*Modelling Extremal Events*. Springer Verlag. - Resnick, S. (1987).
*Extreme Value, Regular Variation and Point Processes.*Springer Verlag.

The second part of the course will be on **multivariate distributions**. The references will be

- Joe, H. (1997).
*Multivariate Models and Multivariate Dependence Concepts*. Chapman & Hall. - Nelsen, R. (1999).
*An Introduction to Copulas*. Springer Verlag.

Specific references and more details about the chapters will be given during the course. I will upload exercises this winter, as well as a list of articles that will be used for projects. Examples will be illustrated using R functions from dedicated packages.

Grades will be based on exercises (homework), report (based on a published paper) and final writen exam.