Default Priors for Robust Bayesian Inference
George Casella1 and Martin T. Wells Cornell University

BU-1124-MA

January 1992

Robust Bayesian inference involves examining the performance of Bayes rules from a class of prior distributions. However, given a specified class of priors, there is no mechanism for choosing a reasonable default option, that is, a robust prior that is somehow noninformative. We show that, by applying results of classical robustness theory, such priors can be easily defined. These priors display attractive robustness properties and also provide a means for "touring" through a class of priors.

1Research supported by National Science Foundation Grant #DMS9100839 and National Security Agency Grant No. 90F073.

2
1. Introduction
One attractive feature of the Bayesian approach to statistical inference is its simplicity. Once the loss function and prior distribution have been chosen the calculation of the Bayes rule is a rather straightforward procedure. This simplicity makes Bayesian statistics quite appealing. However, when there is little or no prior information available the Bayesian is at a loss. In this case a Bayesian sometimes constructs a default or "noninformative prior," that is, a prior that should contain no information about the parameter of interest. The problem of constructing noninformative priors has always attracted a fair amount of attention. For example, see Berger (1985), Box and Tiao (1973), Jeffreys (1961), and Zellner (1987). Perhaps most importantly Bernardo (1979) introduced the concept of reference priors, reviewed in Berger and Bernardo (1991).
A procedure for generating default priors was introduced by Jeffreys (1946). Assume that the probability measures in some class «5' are absolutely continuous with respect to a civminating measure p, and let f(x 19) denote the density of P9 E «5' at the observed value x. Jeffreys recommended the prior
(1.1)

where

I9>J]I(9) = E~( Vlog r(x I9)X Vlog f(x

is· the Fisher information matrix at 9. Here V log f(x 19) is the vector of first partial derivatives of

log f{x 19) evaluated at 9. It may be shown that 'lrJ has the desirable property of parameter

invariance.

Jeffreys' original derivation of 'lrJ was based on its connection to the Kullback-Leibler divergence

metric {Kullback, 1959)

JDK(P9' Pq,) = f(x 19) log{f(x 19)/f(x Itfl)/ip(x)

as well as the squared Hellinger distance

3

where p is a dominating measure. Essentially, Jeffreys' deviation consisted of showing that DK and Du behave locally like 1(6). This can be seen by observing that forD= Du or D = DK

(1.2)

where V 2 is the Hessian with respect to f/1. Since

(1.3)

it follows that Du and DK behave locally like Fisher information. Also, since I(6) is positive definite,

(1.2) and (1.3) imply that "'J is the solution to a minimization problem.

Jeffrey's prior can also be derived as the solution to an asymptotic minimization problem, as

shown by Clarke and Barron (1990). (See also Polson, 1988.) This development is similar in spirit to

the reference prior approach of Bernardo (1979), and establishes the asymptotic expansion
=J 1
~DK(r(O Ix), •(0))) llog 2!e + r(O) log I!~~ I! d9 + o(1)

(1.4)

for the Kullback-Leibler divergence between the posterior and prior distribution. Maximizing (1.4) over all priors,.. should yield a prior that imparts the least information, and Clarke and Barron (1990)
1
show that a lililiting form of (1.4) is maximized by "'J = I•(9) I!, Extensions of this result to the case
of nuisance parameters have been given by Clarke and Wasserman (1991). These derivations of "'J leads us to interpret it as a "noninformative" prior. However, in the
absence of good prior information it is not enough to require only that a default prior be noninformative. Since the Bayesian decision rules are fundamentally dependent on the prior information there must be concern .about the influence of the prior. Such concerns have prompted Bayesians to construct procedures which are robust with respect to the specification of the prior distribution. The groundwork for robust Bayes analysis was laid out by Berger (1985), and more recently surveyed by Berger (1990) and Wasserman (1991).
A natural means to investigate robustness with respect to a prior is to specify a class r of plausible prior distributions and see how the choices among the priors in r affect the constructed
decision procedures. Let p(r) be some posterior quantity of interest (such as a posterior mean,

4
variance, or credible set probability). For the class of priors r much research in robust Bayesian analysis has concentrated on evaluating
p~ = inf p(1r) ,
11'tT
If the range (p~, p¥) is small enough that the prior is deemed to be noninfluential the prior is called robust.
Perhaps the most common r that have been considered are neighborhood classes, one example of which is the E- contamination class of priors. For a single elicited (baseline) prior 11'0, the class is
Here E (0 < E $ 1) reflects the amount of uncertainty in the elicited prior "'O• and Q is a class of
densities determining the amount of contamination which is to be mixed with 11'0• The choice of Q usually reflects the researcher's notion of a reasonable prior. Possibilities for Q include
QA = { all distributions q } , Qu = {all distribution q with mode 80 (fixed) } ,
={QSU all symmetric unimodal distributions q with mode 80 (fixed) } .
The class with QA is usually the easiest to work with since there are simple characterizations of pL and pU for various statistical functions (see Huber, 1973). Results for the classes Qu and QSU are contained in Sivaganisan and Berger (1989).
Other common neighborhood classes are the Kolmogorov and Levy neighborhoods given by
={r~8(1r0) P: P0(8)- E $ P(8) $ P.0(8) + E, for all 8},
={r~6(11"o> p : p o<8- 6) - E $ P{8) $ po<8+6) + E ' for all 8 } '
respectively, where "'o = dP0/dp. Although Levy neighborhood structure is an important structure in
robust estimation theory (the neighborhoods are based on the Levy distance, which metrizes the weak topology) calculations of pL and pU have not been carried out for these neighborhoods. The
Kolomogorov and Levy neighborhood classes are related to each other because r~S = r~0 • Also, they

5
are related to the density and distribution bounded class, which has been extensively studied by Lavine (199la,b). Given two specified bounding functions Land U, the density bounded class is given by
6} 'rf,u = { p : L(O) ~ P(O) ~ U(O) I for all
= =Note that for the choice L(O) P0(6- c5)- t" and U(O) P0(B+c5) + t", the Levy class is equal to the
distribution bounded class. This identification reduces the complexity of the density bounded class dramatically. For details on other classes see the review articles by Berger (1990) and Wasserman (1991).
Now the question arises, "How can one be a robust Bayesian while at the same time try to use a
prior that is somewhat noninformative?" That is, given a class of priors r, what is the single prior 1r E r such that, among the class r, 1r is the least informative. At first thought this seems like a
difficult constrainted optimization problem; however, we are able to apply results from the classical parametic robustness literature, specifically applications of results of Huber (1973), to solve these problems and construct solutions. In the next section we will survey the relevant results and give some application to the construction of robust priors.
The remainder of the paper is organized as follows. In Section 2 we discuss methods for deriving robust default priors, and show connections with Bayesian decision theory. Section 3 gives details for many common classes of priors, and exhibits the form of a robust default prior and investigates some of the behavior. Section 4 has a concluding discussion.
2. Robust Default Priors
In this section we outline a strategy for constructing a robust default prior based on a Bayesian decision theoretic argument. That is, we look for priors that (approximately) maximize a Bayes risk. We also show that, in many common problems, the decision-theoretic answer can also be arrived at through classical robustness theory. These problems are ones whose information is constant in the parameter of interest, and include location and regression problems.

6

2.1 A Decision-Theoretic Approach

One possible method for generating a robust default priors to find a prior that maximizes the

Bayes risk of a procedure over some class, r, of priors. This is the r-minimax approach to robustness.

(Berger (1985) gives a full review of the literature.) The main difficulty encountered in the r-minimax

approach is that of solving the variational problem which produces the default prior. This problem is

remedied by using an approximation of the Bayes risk due to Brown (1988) and Brown and Gajek

(1990). Once this approximation is developed it yields a more straightforward r-minimix problem

where the prior solves a second order linear differential equation restricted to the class r. Fortunately,

solutions of these equations exist in the literature.

We will first develop the approximation to the Bayes risk. Let X be an observable random
variable with probability density f(xiO) relative to some cr-finite measure p. Assume 0 E e, where

9 C R is a possibly infinite interval. Suppose it is desired to estimate 0 by 6 E R under the loss L(0,6)

= m(O)(c5- 0)2, where m>O is a specified weight function. Let R(0,6) = EgL(O,cS(X)) denote the risk

function of the non-randomized estimator 6, and let 11' be a prior probability density with respect to
e.Lesbesque measure on For any estimator c5 define the integrated risk and Bayes risk as
=J =r(1r,cS) R(0,6)1r(dO) and r(1r) inf6 r(1r,6),
respectively. Let V(O) = 1(0)-1, the inverse of Fisher information. Brown and Gajek (1990) proved,

under certain regularity conditions, that

(2.1)

where

J=C V(O)h(O)dO
JD - [(Vh)'(0)]2 dO - h(O)

h(0) = m(0)1r(0) .

Simulation results indicate that this bound becomes sharp as the sample size tends to infinity. Note that this inequality is a function of the loss function, the prior and sampling density, three elements in any Bayesian decision theory problem. Also, the inequality is invariant under transformations on the

7
sample space since the sampling density enters only through Fisher information. If f(x 16) is in the exponential family this inequality is related to Brown's (1971) heuristic method of proving admissibility.
We shall proceed as follows. First, we will assume that the inequality in (2.1) is sharp enough to be a good approximation of the Bayes risk. Next we will find the prior that maximizes (2.1) over the class of priors of interest. This prior will give us an approximation of the r-minimax Bayes risk. Since this approach is approximate, we will need to examine the default priors to check for sensibility. In examples in the next section, the default priors tum out to be quite reasonable. Most have the properties that the posterior behaves like the base prior in center of the distribution and like the likelihood in the tails. Also, by construction, the default priors will have the desired invariance property of a Jeffrey's prior.
As an alternative to working with the approximation (2.1), we could use an asymptotic approximation of r(r,cS). The article by Ghosh et al. (1982) gives such an approximation as well as a review of the research in this direction. Examination of the rather complex expansion of Ghosh et al. will show that the prior enters there in the form of C and Din (2.1). Hence, this alternate route will yield the same results as the maximization of (2.1). This is related to the approach proposed by Clarke and Barron (1990).
=The inequality in (2.1) has interesting implications for conjugate priors. Suppose m 1 and
f(x I6) is an exponential family with expectation parameter 6 and 1r is a conjugate prior. Then (and
only then), subject to mild regularity conditions, (2.1) is actually an equality. This follows since both the Bayes rule and (Vr)'/r are linear. Hence the Bayes risk atiains its lower bound and is therefore
minimized. If m 1:- 1, then equality holds if and only if 1r is proportional to a conjugate prior density.
Again the Bayes risk attains its minimum. As we are searching for the prior that maximizes the Bayes risk to generate robust priors, the prior that minimizes the Bayes risk will lack robustness. The fact that conjugate priors lack robustness has been noted earlier by Berger (1984).
Example 2.1: To illustrate the proposed technique consider the problem of observing
= =X""' binomial (n,p), and estimating () log(p/1- p) under the normalized loss 1(6,6) I(0)(6- 6)2

8
where 1(9) = ne9/(t+e9)2 is the Fisher information. Maximization of the bound in (2.1) requires
finding the prior r which minimizes D. The solution to this calculus-of-variations problem follows by
setting ii' = u and solving the Euler equation (u'/I)'- ~u = 0, where ~ is a Lagrange multiplier used
to incorporate the constraint that r is a proper density. The solution of the Euler equation, in terms of
= =r, is r(9) 6e29/(l+e(J)4. This corresponds to a prior, in terms of p, of r(p) 6p(l- p). Note that
1
the Jeffrey's prior for this problem is rJ(P) ex (p(l- p))2. The Jeffrey's prior does not take the weighted loss function into account where the other prior does. It is interesting to note the r(p) puts the most mass near one-half whereas "'J(P) puts more mass near zero and one. Also r(p) is symmetric and concave whereas rJ(P) is symmetric and convex.
2.2 Connections with Classical Robustness We now investigate the parallels between the problem at hand and Huber's minimax variance
theory. Let x1, ••·, Xn be a random sample from a population with distribution function F(x- 0),
where 9 is an unknown location parameter. It is assumed that F is an unknown member of a specified convex vagu~ly compact neighborhood, CJ, of a fixed baseline (or ideal) distribution G. Suppose {Tnl is a sequence of estimators of (J such that "lil(Tn- 6) converges in distribution to the normal law with mean equal to zero and variance V(T,F). Let F0 be the distribution which minimizes Fisher information, I(F), over all F E CJ. If T0 denotes an estimator which is asymptotically efficient at F0, that is
then the minimum value of sup{V(T,F) ; F E CJ} is I - l(F0), which is attained at To· Thus the problem of finding the form of the minimax variance estimator corresponding to particular neighborhood models may be solved by finding the distribution, F0, with minimum Fisher information in CJ.
The link between classical robustness and Bayesian robust is most easily seen in problems where the Fisher information is constant in IJ (such as location and regression problems). The classical

9
statistician is looking for the distribution with minimum Fisher information, which is the distribution that minimizes
(2.2)
over a specified class of distributions. If m(O) = 1 in (2.1), then minimization of (2.2) is exactly equivalent to maximization of (2.1), the goal of Bayesian robustness. Thus, in these cases, we can use
the minimizers of (2.2) as our robust default priors. In the classical robustness literature many classes of distributions have been studied, and
distributions with minimum Fisher information have been explicitly found in many cases. Perhaps the neighborhood class which has been most thoroughly studied is the ~-contamination model, ~(G) where G is a fixed distribution symmetric about zero. Huber (1981) gives a complete solution for strongly unimodal G. When the unimodality of G is no longer assumed the results are no longer as complete. How~ver, Collins and Weins (1985) find the minimum information distribution F0 for a quite general base distribution G.
Another neighborhood model which has received attention is the Kolmogorov neighborhood model
r~(G) where G is a fixed symmetric distribution. The most complete results pertain to the special
case where G is the normal distribution. In this case the minimum information distribution may be
found using results of Huber (1964) for t: < .0303 and by Sacks and Ylvisacker (1972) for t: ~ .0303.
Calculations by Wiens (1986) may be used to obtain the minimum information distribution of nonnormal G, subject to various regularity conditions. The Kolmogorov neighborhood may be extended to
cover the important Levy neighborhood structure, r~6(G). The minimum information distribution is
given in Collins and Wiens (1989). In the next Section we give several examples of prior distributions which maximize (2.1), hence
are approximately r-minimax, for several classes of neighborhood priors. These will give examples of robust default priors.
When working in some class of priors one must distinguish between the central portion of the prior and the tail. Most Bayes procedures will be somewhat robust in the central portion of the prior, but it is rare that they are robust to changes in the tail areas. Furthermore, it is quite difficult to elicit

10
prior information in the tails of the prior. If the likelihood gives most of its weight to the tails, this is strong evidence that the prior has probably been misspecified. Rubin (1977) also shows that risk robustness tends to be much worse if the tails are influential.

3. Examples

3.1 The £-contamination class

As the first example consider the e- contamination class ~(• ), where • is the standard normal

distribution. We can use the results of Huber (1973) to maximize (2.1) and hence find that the robust

noninformative prior r~(8), for specified e, is given by

!!&&!;r~(l) = {

exp{- 822 } 111 ~k
exp{k2/2-kl11} Ill> k

(3.1)

where k is a function of e chosen so that r~(8) integrates to 1. (Table 4.5.1 of Huber (1981) gives values of k.) By varying e, we can examine a variety of these priors and effectively tour the class. Note that as e increases the prior information becomes more diffuse and hence the procedure becomes more robust. These priors are illustrated in Figure 3.1 for a number of values of e.
To demonstrate the robustness of these priors we consider the example treated by Berger and Berliner (1986) and Wasserman (1989). Berger and Berliner investigated the robustness of the BPD region using a normal prior, and Wasserman looked at the robustness of the likelihood region. In each
=case, robustness was investigated using the class ~(r0), where r 0 normal(p, r 2). The robustness of
an interval C(x) is quantified by calculating

6(x) = sup P(8 E C(x) Ix)- inf P(l E C(x) Ix) •

11'€~

11'€~

(3.2)

Calculation of 6(x) is quite simple, as Huber (1973) provides a formula (also used by Wasserman, 1989).
= =We consider x,..,. normal(8, u2) and I,..,. normal(p, r2), with u2 = 1, r2 2 and p 0. The

11

following small table shows the robustness behavior of the three intervals.

Table 1
= =Comparison of Credible Regions (1 - a .9, £ .25)

HPD normal Likelihood HPD robust

X= 0.5 Region
( -1.01, 1.68)
( -1.15, 2.15)
( - 1.03, 1.78)

6 0.246 0.146 0.224

X= 4.0 Region
{1.33, 4.01)
(2.36, 5.65)
(1.89, 5.18)

6 0.893 0.756 0.821

The HPD robust region is calculated by finding the 1 -a HPD region of the posteriors arising from the prior (3.1). Notice that, in each case, the likelihood region is the most robust, the HPD normal region is least robust, while the HPD robt.at region is in between. Moreover, the HPD robust region displays exactly the behavior that we want. When the data and prior agree (x ~ 0) the HPD robust region acts like the HPD normal region. However, when the data and prior disagree (x far from 0) the HPD robust region behaves like the likelihood region (which is maximally robust, as shown by Wasserman, 1989). This behavior is illustrated in Figure 3.2, which shows the interval endpoints for a range of x. It can be seen that for x near 0 the HPD robust region behaves similar to the HPD normal region. However, as x gets far from 0, the HPD robust region gets closer to the likelihood region, while the HPD normal region goes off on its own.
It is interesting to note that the HPD robust region will not converge to the likelihood region as x-+ oo. This is because the tails of ,.-~(0), which are exponential, will always maintain some influence. This point is also made by Hwang and Casella {1991), where it is shown that if CE(x) is the HPD region using an exponential ({3) prior, then

where Z is a standard normal random variable and Za/2 is the upper a/2 cutoff. Upon reflection, such behavior is actually desirable if the prior is to be regarded as noninformative in any way. This is

12
because if we truly are to have no information about 8, then we should never allow the prior to be totally ignored. Even if x--+00, a noninformative prior should still leave doubts about the finiteness of 8. If not, then it is somehow informative.
Extensions to other £-contamination models may also be developed, including multivariate and scale problems. When working on the multivariate problem one usually assumes that the distributions under consideration are from a spherically symmetric distribution; that is, having a density of the form
=r(8) 1r( I8 I) where 8 E ftl and I I is the Euclidean norm. The problem may be transformed into = =a univariate problem by defining '1 I8 I· The density of fJ is given by v(q) dCd I'II m-1w-(q),
where Cd denotes the volume of the unit sphere in ftl. When working on a scale problem the problem
may also be transformed to a location problem by a logarithmic transformation.

3.2 The Kolmogorov-Smirnov Class

Now we turn our attention to the Kolmogorov-Smirnov neighborhood structure, r~S. The class

r~S has not received much attention in Bayesian robustness despite the fact that it has great intuitive

appeal. Recall that the base prior density in r~(w-o) is "'0' which has distribution function equal to

= - ei.P0• Define e<8)

""o<8) 1 "'0(8) and J (8) = 2{(8) -

If r o<8) is symmetric, J (8) is

strictly decreasing on [0, oo) and continuously differentiable on (0, oo), then by applying the results of

Wiens (1986) and Collins and Wiens (1989) it may be shown that for some t0(Po) the robust
noninformative prior for this class, with t E (to(P0), !), is given by

on [0, a] on [a,oo) ,

(3.3)

where A = A1 tan(A1a/2). The constants are determined by (i) PKs(a) = P0(a)- £, (ii) PKs(oo) = 1.
The constant a will also satisfy -w-K~(a)/rKs(a) :s; {(a). In Figures 3.3 and 3.4 these priors are
graphed for a selection of values of epsilon. Figure 3.3 has the normal as a root prior, while Figure 3.4 has the Cauchy as a root prior.

13
Based on the figures alone, it seems a difficult task to choose between (3.2) or (3.3) as the appropriate form for a noninformative prior. Since examination of the graphs gives little guidance, the experimenter must rely on the interpretation of the neighborhoods ~ and rfS. Fortunately, this interpretation is rather straightforward, and should allow the experimenter to make a somewhat informative choice of t~e appropriate robust prior based on the form and degree of contamination.
The density given in (3.3) is the "large £"' solution of Collins and Wiens (1989), the "small £"' solution containing another term. However, our computations indicate that "small £"' is very small, and for most values of£ that will be considered (say ~ .05), the "large£"' solution is appropriate.
The Levy neighborhood r~6(r0) is a generalization of r~S neighborhood structure. In this class
an additional location parameter, 6, indexes the family. The necessary calculation for the construction of the robust noninformative prior is given in Collins and Wiens (1989). The form of these priors is similar to "'Ks(9); however, now the location parameter, 6, is added.
3.3 The Quantile Class Elicitation of prior information can sometimes involve specification of a finite number of features
e 'iof the prior distribution. Suppose the parameter space is partitioned into intervals = [(i, (i+1]
ei=O, .. ., k where -oo = ( 0 < 1 < ··· < (k < (k+1 = oo. Let the prior probability assigned to~
equal Pi i=1, •••, k. Now defme the quantile class of prior distributions to be
Jr Q = {r : Pi = r(dO)} •
'i
This is a natural class based on some elicitation mechanisms. See Berger (1990) for more on this class. Consider estimating the mean of a normal (9, 1) distribution under squared error loss and with
prior information given by r Q' Again we follow the proposed methodology and maximize the bound in (2.1) to find the default prior. The results of Huber (1974) may be extended to show that default prior distribution function P0 is such that
(i) i=O, 1, ..., k+1; (ii) P0 is twice continuously differentiable;

14

(iii) the density r 0 = Po is strictly positive, except that it vanishes on those intervals rei, ei+l]

for which Pi = pi+1;
(iv) on each interval rei, ei+1] the function {ii)'/..fro is constant= li, that is

+;\.(} -l.(J
a·e I b.e ·'i II
h(O) ={ a,O + h;

if;\.= 0 I

aicosllilfJ + bisinllilfJ

In the case where (} is constrainted to lie within two points, say -L and L, one can define the class
r 1 ={r : 101 < L} .

This is a very special case where the first and last constraint quantiles are set at Land -L respectively,

and no other quantiles are constrained. In this case it can be shown that the distribution that

maximizes (2.1) equals

IOI < L.

This is the solution of the calculus of variations problem solved by Bickel (1981) in the context of the bounded normal mean problem.

4.. Discussion
In this article we considered the construction of default robust priors. These priors give the Bayesian a way to specify a particular class of priors and then, using an automatic mechanism, choose a noninformative member of that class from which to make inferences. Often in robust Bayesian analysis, one has an idea about the range of posterior decisions, but not about a particular prior with which to work. The present article takes a step toward remedying this.
The default robust prior has a number of advantages associated with it that make it a good candidate for a default prior. Firstly, it is a proper prior, which eliminates any possible incoherent inferences. Secondly, it has a meaningful interpretation in the context of a class of priors. Thirdly, it

15
is robust in the sense that it tends to favor the data over the prior when there is disagreement between the two. This was illustrated in Figure 3.2, and is also seen in Figure 4.1. There we see that as prior information and data become more discrepant, the robust posterior moves toward the data and flattens, mimicking the likelihood function. However, as is also illustrated in Figures 3.2 and 4.1, when the data and prior agree, the robust posterior behaves like a conjugate posterior, providing narrow confidence intervals and precise inferences. Jd this happens, however, the default prior retains its noninformative interpretation by always retaining a slight amount of influence.
The methodology that started these investigations, that of Jeffreys' outlined in the introduction, provides a method for constructing default priors when no partial prior information (in the form of prior classes) is available. It would then be possible to construct a Jeffreys' prior from a class of sampling distributions. In this ease one would need to solve for the distribution in the class which has the minimum Fisher idormation, as is usually done in classical robustness theory. Once this distribution has been found one can compute the Fisher information of this distribution and take the square root of the Fisher information as the Jeffreys prior. Note in this discussion the class of interest is the class of sampling distributions. One could envision •upe,..robustness methods where it is assumed that both the sampling distributions of the prior distribution are in classes. However, this would be extremely complicated to deal with analytically.
Lastly, one might be curious as to why we are using a frequentist measure (Bayes risk) to generate a Bayesian object (the prior). The Bayesian point of view is that one should condition on the data at hand then choose the optimal decision procedure with respect to that data only. Unfortunately, this is a somewhat local perspective on the construction of actions. To study the global properties of a procedure one must find optimal actions over a class of possible actions. In this ease the goodness of the procedure should be measured by the Bayes risk. Note that if a procedure has a large Bayes risk, the repeated use of the procedure will give poor long run risk properties.
This is not to say that Bayes risk is necessarily a good measure for a particular data set. Jd it is a frequentist assessment, its worth can only be judged when averaged over all data sets. However, frequentist measures have a role in robust Bayesian statistics. Looking at the behavior of a Bayes rule

16
for a variety of data points (often extreme) may point out unsuspected and unacceptable features of the subjectively chosen prior. These extreme points may be the data points which one particularly wishes to guard against. Hence, considering the possibility of their existence will certainly increase the robustness of a procedure.
One goal of robust Bayesian statistics should be to develop Bayesian procedures which are automatically robust. This will help users of Bayesian methods who may not be sophisticated users, and may not be experienced in the use of the methods. Such users will surely not carry out a sensitivity analysis of the procedures. In such cases procedures which are automatically robust will be extremely helpful.

17
Bibliography
Berger, J.O. (1984). The Robust Bayesian Viewpoint (with discussion). In Robustness of Bayesian Analysis, J. Kadome (Ed.). North Holland, Amsterdam.
Berger, J.O. (1985). Statistical Decision Theory (2nd Edition). Springer-Verlag: NY.
Berger, J.O. {1990). Robust Bayesian analysis: sensitivity to the pri~r. J. Statist. Plann. Inference 25, 303-328.
Berger, J.O. and Berliner, M. {1986). Robust Bayes and empirical Bayes analysis with £-contaminated priors. Ann. Statist. 14, 461-486.
Berger, J. 0., and Bernardo, J. M. (1991). On the Development of the Reference Prior Method. To appear in the Proceedings of the Fourth Valencia International Meeting on Bayesian Statistics, Peniscola, Spain, April 1991.
Bernardo, J.M. {1979). Reference posterior distributions for Bayesian inference (with discussion). J. Roy. Statist. Soc. 41, 113-147.
Bickel, P.J. (1981). Minimax estimation of a mean of normal distribution when the parameter space is restricted. Ann. Statist. 9, 1301-1309.
Box, G.E.P. and Tiao, G.C. {1973). Bayesian Inference in Statistical Analysis. Addison Welsey, Reading, MA.
Brown, L.D. {1971). Admissible estimators, recurrant diffusions, and insoluble boundary-value problems. Ann. Math Statist. 42, 855-903.
Brown, L .D. {1988). The Differential Inequality of a Statistical Estimation Problem. In Statistical Decision Theory 8 Related Topics III, S.S. Gupta and J. Berger (Eds.). Springer-Verlag, New York.
Brown, L.D., and Gajek, L. {1990). Information Inequalities for the Bayes Risk. Ann. Statist. 18, . 1578-1594.
Clarke, B. and Barron, A. {1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36, 453-471.
Clarke, B. and Wasserman, L. (1991). Noninformative priors and nuisance parameters. Carnegie Mellon Stat. Tech Report #535.
Collins, J.R. and Wiens, D.P. (1985). Minimax variance M-estimators in £-contamination models. Ann. Statist. 13, 1078-1096.
Collins, J.R. and Wiens, D.P. (1989). Minimax properties of M-, R-, and L- estimators of location in Levy neighborhoods. Ann Statist. 17, 327-336.
Ghosh, J.K., Sinhu, B.K. and Joshi, S.N. (1982). Expansions for posterior probability integrated Bayes risk. In Statistical Decision Theory & Related Topics III, S.S. Gupta and J. Berger (Eds.). Academic Press, New York.
Huber, P.J. {1964). Robust estimation of a location parameter. Ann. Math. Statist. 35, 73-101.

18
Huber, P.J. (1973). The use of Choquet capacities in statistics. Bull. but. Internat. Statist. 45, 181191.
Huber, P.J. (1974). Fisher information & spline interpolation. Ann. Statist. 2, 1029-1034.
Huber, P.J. (1981). Robust Statistics. Wiley, New York.
Hwang, J.T. and Casella, G. (1991). Frequentist Priors. Technical report BU-924-MA, Biometrics Unit, Cornell University.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London Ser. A. 186, 453-461.
Jeffreys, H. (1961). Theory of Probability (3rd Edition). Oxford, Oxford University.
Kullbaek, S. (1959). Information Theory and Statistics. Wiley, New York.
Lavine, M. (1991a). Sensitivity in Bayesian Statistics: The Prior and the Likelihood. J. Amer. Statist. Assoc. 86, 396-399.
Lavine, M. (1991b). An Approach to Robust Bayesian Analysis for Multidimensional Parameter Spaces. J. A mer. Statist. Assoc. 86, 400-403.
Polson, N. (1988). Bayesian perspectives on statistical modeling. Ph.D. dissertation, Department of Mathematics, University of Nottingham.
Rubin, H. (1977). Robust Bayesian estimation. In Statistical Decision Theory and Related Topics II, S.S. Gupta and D.S. Moore (~.). Academic Press, New York.
Sacks, J. and Ylvisaker, D. (1972). A note on Huber's robust estimation of a location parameter. Ann. Math. Statist. 43, 1068-1075.
Sivaganesan, S. and Berger, J.O. (1989). Ranges of posterior measures for priors with unimodal contaminations. Ann. Statist. 17, 868-889.
Wasserman, L. (1989). A robust Bayes interpretation of likelihood regions. Ann. Statist. 17, 1387.. 1393.
Wasserman, L. (1991). Recent Methodological Advances in Robust Bayesian Inference. To appear in the Proceedings of the Fourth Valencia International Meeting on Bayesian Statistic:S, Peniscola, Spain, April 1991.
Wiens, D. (1986). Minimax variance M-estimators of location in Kolmogorov neighborhoods. Ann. Statist. 13, 724-732.
Zellner, A. (1987). An Introduction to Bayesian Inference in Econometrics. Krieger, Florida.

19
Figure 3.1: Prior distributions from equation (3.1), using a normal root prior, for£= 0 (solid line), .05 (long dashes), .25 (dots) and .4 (short dashes).

0
~~~--~--~--~--~~--~--~--.---r-~---r--,---r-~---, 0
.N
I")
0

.
LO
·L0-
o_<O
.
0
r0o.
0
-,..
.0
0 ....
.
0 -8 -6 -4

.....
I I I :I I I
-2 0
Theta

"\._., ,
'~ ·. . ......
- ---....... ....... .....
246

8

20

Figure 3.2:

Upper and lower endpoints of the .9 credible interval from a normal prior {dotted line),
=the robust prior of (3.1) with € .25 (solid line) and the likelihood interval {dashed
line).

.0.-
co

<D
v
Cf)
-+-' N
·c-
Oo
0...
-o
eN wl
'¢
I
(!)
I

/

0

r-

I -8 -6 -4 -2

0

2

4

6

8

X

21
FigUre 3.3: Prior distributions from equation (3.3), using a normal root prior, fort= 0 (solid line), .05 (long dashes), .25 (dots) and .4 (short dashes).

0
.N
1'0 0

.
~.,_o
0
·~.,_
Q_<D
.
0

CX) I . \
0 . f .. · · .. \

0

.-.-.::-::-

.. ·-/ --

0.

~-L-~~~--~--~--~--~--~~--~--~--~--~--~~--~--~

o·-8 -6 -4 -2

0

2

4

6

8

Theta

22

Figure 3.4:

=Prior distributions from equation (3.3) using a Cauchy root prior, for f 0 (solid line),
.05 (long dashes), .1 {dots) and .25 {short dashes).

N ~~~--~~--~--r-~--~-,~~--.-~--~~--.-~--.
0

CX)
N
0

~
N 0

0
N.
0
.omL
- ..--
L.
o_O
N
.
0
CX)
0.
0
.~
0 0

\

!J

f· ·. \

l "\

I\

\· .

----- -..-··-:.-/--->- /-- --------...

_---".·-..'...-': ::- ....

0

OL-~--~--~--~~--~--~--~~--~--~~~~--~--~--

0 -8 -6 -4 -2

0

2

4

6

8

Theta

23
Figure 4.1: Posterior distributions, for x = 0, 2.5, 5, using a normal prior (dashed lines) and the robust prior of (3.1) withe= .25 (solid lines).

."¢
0
!......
·-0 !...... Q)
-+-' (/)
0
Q_
N.
0

.......
I\ I\ I I \I \I I
\ \ \ \
\ \ \
\ \ \
\ :\
12340678 Theta