On the Completeness of Full-Text Search Languages for XML

Chavdar Botev Cornell University cbotev@cs.cornell.edu

Sihem Amer-Yahia AT&T Labs–Research sihem@research.att.com

Jayavel Shanmugasundaram Cornell University jai@cs.cornell.edu

Abstract
We study formal properties of full-text search languages for XML. Our main contribution is the development of a formal model for full-text search based on the positions of tokens in XML nodes. Building on this model, we deﬁne a full-text calculus based on ﬁrst-order logic, and a full-text algebra based on the relational algebra. We show that the full-text calculus and algebra are equivalent even in the presence of arbitrary position-based predicates, such as distance predicates and phrase matching. This suggests a notion of completeness for full-text languages. None of the full-text search languages that we are aware of are complete under the above characterization. We propose a new full-text language that is complete and naturally generalizes existing full-text languages. Our formalization in terms of the relational model can also serve as the basis for (a) the joint optimization of structured and full-text search queries, and (b) ranking full-text search query results by leveraging existing work on probabilistic relational models.
1 Introduction
The emergence of XML has resulted in the rapid growth of semi-structured XML repositories. Examples of such publicly available repositories are the Library of Congress (LoC) documents [15], the IEEE INEX data collection [14], Shakespeare’s plays, DBLP and SIGMOD Record. A common characteristic of these repositories is that they have a mix of structured and unstructured data. For example, the LoC XML documents describe congressional bills and contain structured information such as bill date and sponsors and unstructured text such as bill description. Querying the content of such documents requires powerful full-text search primitives, in addition to structured query operations.
While XML query languages such as XPath and XQuery [20] support sophisticated structured query operators such as selections, joins and aggregations, they can only support very rudimentary full-text search. For instance, full-text search in XQuery is expressed using the function: contains($e, tokens) which returns true iff the XML node bound to the variable $e contains all the tokens in tokens. While this function is sufﬁcient for simple sub-string matching, it is woefully inadequate for more complex searches. As an illustration, consider the following example from the XQuery Full-Text Use Cases Document [21].
Example 1 (Use Case 10.4): Consider an XML document that contains book and article elements. Find the book elements that contain the keywords “efﬁcient” and the phrase “task completion” in that order with at most 10 intervening tokens.
The above query includes phrase matching, order speciﬁcations, and distance predicates. The contains function in XQuery cannot express and compose all these full-text primitives. XML full-text languages that we are aware of are also unable to express the above query as well as other full-text operations speciﬁed in the XQuery Full-Text Use Cases Document, such as general Boolean conditions and paragraph scoping. We believe that one reason for the current state of affairs is that there is no adequate formal model for full-text

1

search in XML. This makes it hard to understand the full scope of full-text search in XML and develop appropriate languages that can be tightly integrated with structured XML query languages.
One of the main contributions of this paper is the development of a formal model for full-text search in XML. Our model is based on the notion of positions of tokens within an XML node. Explicitly modeling the positions in an XML node, instead of simply treating nodes as a “bag of words” as in traditional information retrieval (IR), allows us to formally capture position-based predicates such as distance and ordered predicates, in addition to traditional Boolean operators. Building on the above formal model, we deﬁne a full-text calculus based on ﬁrst-order logic, and a full-text algebra based on the relational algebra. The calculus and algebra are powerful enough to model arbitrary position-based predicates. We also show that for any given set of position-based predicates, the full-text calculus and algebra have equivalent expressive power. This suggests a notion of completeness for full-text search languages.
Given the above notion of completeness, we show the incompleteness of existing full-text search languages that we are aware of. We also describe a new full-text search language based on the full-text calculus, which is complete and naturally generalizes existing Boolean full-text search languages. This new language is the formal basis for the TeXQuery full-text language proposal [1], which we have submitted to the W3C Full-Text Task Force, whose charter is to extend XQuery with full-text search capabilities.
Although not discussed in detail in this paper, our full-text algebra based on the relational model can serve (a) as the foundation for the joint optimization of structured and full-text search queries, and (b) as the basis for scoring and ranking full-text search results based on the probabilistic relational model developed for information retrieval [11].
Finally, although our formalism is used to model full-text search in XML documents, our results also apply to “ﬂat” documents, where we simply consider each ﬂat document to be an XML node with no internal structure. However, one of the main beneﬁts of formalizing full-text search for XML is that it provides the formal foundation for tightly integrating full-text search with structured queries.
2 Modeling Full-Text Search in XML
At its core, a full-text search speciﬁcation for XML has two components: the search context, which speciﬁes the set of XML nodes (i.e., the context nodes) over which the full-text search is to be performed and, the full-text condition, which speciﬁes the condition that should be evaluated on each context node. Only the context nodes that satisfy the condition qualify as answers. In Example 1 in the introduction, the search context is the set of book elements (and not article elements) that satisfy the full-text condition: contains the keyword “efﬁcient” and the phrase “task completion” in that order with at most 10 intervening tokens.
In order to specify a full-text search query over XML documents, we need (1) a language to specify the context nodes and, (2) a language to specify the full-text search condition (the full-text search language). For (1), we can use a structured XML query language such as XQuery, which is relationally complete and has a well-deﬁned formal semantics [20]. We thus focus on the full-text search language in this paper..
One important issue in full-text search is scoring and ranking the context nodes based on how well they satisfy the full-text search condition. While we do not explicitly consider speciﬁc scoring schemes in our formalism, in Section 3.3, we shall brieﬂy describe how a scoring scheme based on the probabilistic relational model naturally ﬁts into our full-text algebra formalism.
2.1 Modeling Context Nodes
To reason about full-text search languages, we need a formal model for the context nodes. A context node can be any XML node (e.g., element and attribute nodes). Existing models for context nodes are insufﬁcient for expressive full-text search. For instance, the XQuery data model for the book element in Figure 1
2

<book(1) id(2)="1000(3)">

<author(4)>Elina(5) Rose(6)</author(7)>

<content(8)>

<p(9)> Usability(10) of(11) a(12) software(13) measures(14) how(15) well(16)

the(17) software(18) provides(19) support(20) for(21) quickly(22)

achieving(23) specified(24) goals(25).

</p(26)>

</content(27)>

</book(28)>

Figure 1: Positions Example

treats all the text under an element as a single text node (ignore the numbers in parentheses for now). This

model is enough to identify sub-strings in the text and evaluate queries such as ﬁnd author nodes containing

’Elina’. However, it is insufﬁcient to answer queries such as ﬁnd books that contain the tokens ’usability’

and ’testing’ with at most three intervening tokens. Traditional IR models solve part (but not all) of this

problem by tokenizing the entire document (or equivalently, context node in our setting), and representing

each token separa ¢t¡¤e£¥l£§y.¦©¨In¨the§§ex¨!a#m"%p$&le£¢'¢in¨!(0F)1ig324u%re¨!561,£§7¢th8§e¨9te9x9A@ t in the context node would be modeled as the

“bag of tokens”

. However, this model still cannot capture the

distance between tokens (we note that some IR languages do, however, support restricted forms of distance

predicates; see Section 4.2 for a more detailed comparison with such languages).

In this paper, we explicitly model the position of a token in a context node. We argue that this model,

although simple, is powerful enough to capture the semantics of existing full-text search languages. Further,

it serves as the formal basis for deﬁning position-based predicates such as proximity distance and order

predicates. In Figure 1, we have used a simple numeric position (within parenthesis) for each token.

It is important to note that our proposed model does not dictate any speciﬁc implementation of positions

(such as numeric positions, as shown above). More expressive positions that capture the notions of lines,

sentences and paragraphs can be used, and this will enable more sophisticated predicates on positions than

simple distance (such as whether two search tokens appear in the same line or sentence). Our language

formalisms are extensible with respect to the set of predicates, and more complex position identiﬁers will

enable the use of more complex predicates.

2.2 Model Deﬁnition

We now deﬁne our formal model. B is the set o£§f73c$Fon£¢t2Ge7Ixt H nodes, C is the set of positions, and D is the

(possibly inﬁnite) set of tokens. The functio£§n¦V8E 2WH

BQPSR1T maps a context node to the set of

positions in the context node. The function U

CXPYD maps e2ach positi£§o7n3$Fto£¢2Gth7ae`12ctobekedfn  #s¥to¨ re¨d9g9ga9g¨t tha@ t

po£§s¦Vit8io2in`p.qIbrndsth¡¤e£¥e£§x¦ amp£§le¦V8in2iF` igbrudtre1 , if the context node is denoted by , then E

R R§h ,

U , U R , and so on.

3 Full-Text Calculus and Algebra
We deﬁne a calculus for full-text search based on ﬁrst-order logic that captures the key notions of search context, positions, and position-based predicates, which were introduced in the previous section. We then deﬁne a relation-based full-text algebra that is equivalent to the full-text calculus. We also outline the potential beneﬁts of our formalism.
3.1 Full-Text Calculus
The full-text calculus deﬁnes the following predicates to model basic full-text primitives.

3

 ¢¡ ¤£ ¦¥ ¨§ ©8#' & 6£¢2 $8 %$`124£¥a8¢b

24£¥a8

is true iff

B (recall that B is the set of context nodes).

    ©&7 £§7a`124£¥a8§¨ ©£§7¥b

©£§7 £§73$F£¢2G7a`124£¥a8¢b

E

is true iff

E

. This predicate explicitly captures the notion

   of positions in an XML node.

&7 £§¦V82i` ©£§7 ¨$p£§¦b

$p£§¦ d £§¦V82i` ©£§7¥b

U

is true iff

U

. This predicate captures the relationship between

tokens and the positions in which they occur.

' 8a7

A full-text language may also wish to specify an additional set of position-based predicates, E

,

¥depending on user needs. T#h' e c£§a7 lculus is general enough to suppo6rt£¢a2Gr7b$i7trary position-based predicates.

    £  £"! # $© £  £"!%©Speciﬁcally, given a set E %o'f 8p©o` siti¨o9gn9g9g¨ va0ria¨ bqle¨s9g,9g9g¨an¢db a set ¨9g9g9 of cons#ta' nts£§,7 the calc¨u9gl9gu9g¨ s can

¥ &£    '   ¤  's6u£¢p2Gpo7r$t7 any predicate of the form:

' 8a7 df q F7$p#2 8#` ©, £§w7 h¨e©re£§7 ¢¨! F7$!b ¨!£¢'§a8' 8©E ` ©£§7 aqn¨ d©£§7 qb ¨

)( 0  ¤  ' 2131)  ¤  ' &£  4  '7q 8 ©#.'§Fo` ©r £§ex7 am¨ ©p£§l7 eq,b w¨!e cou©ld£§7ad` e©ﬁ£§7ne¨ E©£§7 qb!@

 F7$p#2 8#` ©£§7 ¨ ©£§7 ¥¨! F7$!b

 5  '  ¤  '  4 F7$

©£§7 . H©er£§e7 , £¢'§a8' 8©` ©£§7 ¨ ©£§7 b

return©s£§t7 rue iff there are

 ' )( 0  ¤  '  4  6' 2131)  ¤  '©at£§m7 os7qt 8 ©i#n'§ter`v©e£§n7ing¨ ©t£§o7 keb ns between©£§7 and ;

©£§7 is  true©i£§ff7a` ©£§7 o¨ ©cc£§u7 rsb before

 ¤  '©£§;7 ©£§7

is true iff is in the same paragraph as ;

is true

iff and are different positions.

3.1.1 Full-Text Calculus Queries
57 ¡ ¤£ ¦¥ ¨§ 8@9 6A B§4 24£¥a8 9 6A B§4 9 6A B§4A full-text calculu24s£¥qau8 ery is of the form:

8#' & 6£¢2 $8 %$`124£¥a8¢b " 8' ( %'`124£¥a8¢b!@ " 8' ( %'`124£¥a8¢b " 8' .I( ntu%it'iv`124el£¥ya, 8¢b

the query returns s that are in the search context, and that satisfy . ,

hereafter24c£¥aal8led the query expression, is a ﬁrst-order logic expression that speciﬁes the full-text search con-

dition.

is the only free variable in the query expression. The structure of the query expression is

recursively deﬁned as follows.

   DC  ¤C©&7 £§7a`124£¥a8§¨ ©£§7 b

24£¥a8 ©£§7

E is a query expression where is the free variable and

#' £§7 E.

   DC  ECF© G©H¥&7 £§¦V82i` ©£§7 F¨$p£§¦b

©£§7

U is a query expression, where

#' £§7

$p£§¦ 6£¢2G7$7

E and

.

    ¤  D £  £"!  I©  EC©%' 8©` ©£§7 ¨9g9g9g¨ ©£§7 ¨ ¨9g9g9g¨ ¢b

%' 8

£QPB©H¥ 6£¢2G7$7

is a query expression, where

' 8a7 ©£§7 E,

#' £§7 E and

.

  6 D' 4 6F8 D' 6T D'8 8

88

R R SR R R R RIf and are query expressions, ,

88 , and

   EC  DC D8  DC  DC XW8 ©£§7 p`A&7 £§7a`124£¥a8§¨ ©£§7 b

R U  EC© R VIf8¢b is a query expression, then ©£§7

E #' £§7

R are query expressions, where

E.

8 are query expressions.
8¢b ©£§7 p`A&7 £§7a`124£¥a8§¨ ©£§7 3b , and E

A full-text calculus query has the conventional semantics of ﬁrst-order logic. The form of the quantiﬁ-

cation in the last bullet guarantees that the query expression in the full-text calculus can be evaluated using

only the positions and tokens in the context node, without having to look at other positions. This notion is

similar to the notion of safety for the relational calculus.

We now provide some examples of full-text calculus queries. The following query returns the context

57 ¡ ¤£ ¦¥ ¨§ #8  ¤  ¤ &8  ¤ Y 2Y Q8node s24t£¥haa8t co8n#ta' in& t6he£¢2 to$k8 e%n$s`124’t£¥eas8¢t’b and ©’u£§s7 ab`Ai&lity7 ’. £§7a`124£¥a8§¨ ©£§7 b &7 £§¦V82i` ©£§7 ¨ ¤$8q7$ b

 '  ' &8 U  ' Y 2A5Y©£§7 ¥`A&7 £§7a`124£¥a8§¨ ©£§7 qb &7 £§¦V82i` ©£§E 7 ¢¨ " 7q¡ )13$ bbb!@

U

UE U

In the subsequent examples, we only show the query expression since the rest of the query is the same. The

following query returns the context nodes that contain the token ’test’ and the token ’usability’ with at most

 ¤  ¤ &8  ¤ Y 2Y 38  '  ' Q85 inte©rv£§e7 ni`An&g7tok£§en7as`124. £¥a8§¨ ©£§7 b &7 £§¦V82i` ©£§7 ¨ $8q7$ b U UE U

©£§7 ¥`A&7 £§7a`124£¥a8§¨ ©£§7 qb E

4

 ' Y 2A5Y &8 &£  ¤  '&7 £§¦V82i` ©£§7 ¢¨ " 7q¡ )13$ b  F7$p#2 8#` ©£§7 ¨ ©£§7 ¢¨¡  bbb U

The following query returns the context nodes that contain two occurrences of the token ’test’ and do not

 ¤  ¤ Q8  ¤ Y 2Y Q8  '  ' Q8  ' Y 2Yconta©in£§7the`A&tok7 en£§’7au`1s24a£¥bail8§it¨ y©’£§. 7 b e&7 £§¦V82i` ©£§7 ¨ ¤$8q7$ b U8 2131)  ¤  ' 38   W U  Y 2A5Y  ©£§7a` E©£§7 ¨ ©£§7 b ©£§7£¢§`A&U 7 £§7a`124£¥a8§¨ ©£§7£¢qb V SE

©£§7 ¥`A&7 £§7a`124£¥a8§¨ ©£§7 qb e&7 £§¦V82i` ©£§7 ¥¨ ¤$8q7$ b &7 £§¦VE 82i` ©£§7£¢¢¨ " 7q¡ )13$ bbU bb U

3.2 Full-Text Algebra

We now deﬁne our full-text relations an¨d a"lg!¤e¨b9g9gr9ga¨o pe#%r$ators. The underlying data model for our algebra is

a full-text relation of 'th& e form ¤¦¥ §©¨

where the domain of §©¨ is B (context nodes),

and the domain of is C (positions). ¤ satisﬁes the following properties.

  ¤ has always at least the attribute §©¨ . This captures the context node for full-text search. The

remaining attributes in ¤ capture the essence of full-text search, which is to manipulate positions.
    B© ©£§7  ©£§7 ¥Ea£§c7h3$Ftu£¢p2Gle7a`1$¤9i)n a( f£¥uall8¢-tb ext relation should satisfy the condition that for all the positions in ,
E . The intuition is that a full-text search query can only manipulate positions

within a single context node.

A full-text algebra expression is based on the following full-text relations that characterize the search

context nodes, their positions, and the tokens at these positions.

  ©10



243£5

§

76



8



`
§©¨

b 24£¥a8 : This relation contains a tuple (

24£¥a8 ) for each

B.

  8 © @9

A%B

AV` ¨ 4!!b

24£¥a8

© £§7 §©¨£§ 73$F£¢2G7a:`124T£¥hais8¢brelation contains a tuple for each (

©£§7 ,

) pair that satisﬁes:

24£¥a8

©

BE

. Intuitively, this relation relates context nodes to their positions.

  ©` ¨ 4! b

$p£§¦V82

5QPSRUTWVFX

 B© 8  B© 48 ¤C¡DFE¡GIH §©¨24£¥a 8 ©£§7 : This is a family 24of£¥are8 lations, o©n£§e7 for e£§ac7h3$F£¢2G7a`124£¥aD8¢b . $p£§¦V82 cd onta£§in¦Vs82ia` t©u£§p7¥leb

for each ( 5YPSRU,TWVFX ) pair that satisﬁes:

B $p£§¦V82 E

U.

Intuitively,

contains positions that contain

, and is similar to an inverted list in IR.

We note that while each ¤`C¡DFE¡GIH relation is ﬁnite, there number of such relations will be inﬁnite if D is

inﬁnite. However, this does not lead to a problem in deﬁning the algebra because each algebra expression is

ﬁnite, and can only refer to a ﬁnite set of such relations. Also, physically instantiating the potentially inﬁnite

set of ¤C¡DFE¡GIH relations is not a problem because only a ﬁnite sub-set of these relations will be non-empty

(because the search context is ﬁnite), so only this ﬁnite set of relations will have to be explicitly stored. This

is in fact what happens in current implementations of inverted lists.

' 8a7

In addition, as in the calculus, we have a set of position-based predicates E

.

3.2.1 Full-Text Algebra Operators and Queries
The full-text algebra operators are similar to the relational operators, but with two important differences. First, full-text algebra operators only operate on full-text relations (as deﬁned above), and not on arbitrary relations, due to the nature of full-text search. Second, full-text algebra operators implicitly enforce that each operation only manipulates positions within a single node, and not across nodes. These two properties ensure that the full-text algebra is equivalent to the full-text calculus in characterizing full-text search. A full-text algebra expression is deﬁned recursively as follows.
 10  243£5 § 76  8  is an algebra expression that returns the tuples in the full-text relation 0  243£5 § 76  8  .

5

 @9

A%B



A

is an algebra expression that returns the tuples in the full-text relation 9

A%B



A
.

  ©$p£§¦V82

¤C¡DFE¡GIH is an algebra expression that returns the tuples in the ¤'C¡DFE¡GIH relation, where

D.

  B§4 ¤ B§4 ¤ B§4 )( If

%'

is an algebra expression,!  ¢¡¤£¡D¦¥ G¨§ ©UC C¤§    § ©UC C ` (

%' b ( is an algebra expression. If

%'

et hv¡¤ea£¡lD¦ru¥eaG¨st§u©UelC stCftu§o llt§ -©UhtC eeCx ft` u¤ rle!ll-b at,etwixothneraererlae tiroeinsnat¤hmee,tdrtahtdoeithfiaouvnllea-tlceroxentlsareteicoluanttaiiovlneprcoojerrc&et’sisop.noNnoodptieenrgtahtaototr. 

the new expression is: The attribute names of always has to include

§©¨ in the full-text algebra - this enforces the property that full-text search is always scoped within

a single context node.

  B§4 ¤ B§4 ' B§4 ) B§4 '( %'

( %'

` ( %'  ( %' b

B§4 ¤ B§4 '  'I( f %' an(d %' are algebr5a expres5sions, then

is an algebra expression, If

the

and new expression

evalu!at!e
is: ¤

to
¦ ¡¤£¡D¦¥

G" an# d ¡¤£¡D¦¥

G)r¤%ep$ ,ecwtihveerley,&th¦e n¡¤£¡D¦th¥ Ge"

fu# l ¡¤l-£¡tD¦e¥ xG t

relation corresponding to is the traditional relational

equi-join operation on the §©¨ attribute. The duplicate §©¨ attribute is p ro& jected out in the result full-text relation, and the position attributes are renamed to be consecutive ’s. Note again how the

full-text algebra does not allow operations across nodes because the only predicate that is permitted

in the join is equality between the attributes §©¨ of the two relations.

  B§4 ¤ ! B§4 ¤(  © B§4 4 If

%' %'

8i s

an

alg' e8bar7 a

exp(ress%io' n,

then

')(

V¦021 ©U5C C¦§    § ©UC C432§ 5¤4§    § 546¦7 ` (

%' ¤b is an algebra expression,

where
!expression is: '

E
(

V¦021

©UC

.
C¦§

If
   § ©UC

C43¨§

5¤¦§







§ e54v687 a` l¤ u!abt,eswhtoere

'

,

the full-text relation corresponding to the new is the traditional relational selection operator.

  B§4 ¤ B§4 ' B§4 ) B§4 ' B§4 ¤ B§4 ' B§4 ¤( %'

( %'

` ( %' !9 ( %' b ( %' A@ ( %'

( %' CB

B§4 'I( f %' and

are algebra expres9 sio@ n, thenB

, , and

are algebra expressions. These , and operators have the same semantics as in traditional

relational algebra.

A full-text algebra query is a full-text algebra expression that produces a full-text relation with a single

attribute (this attribute has to be §©¨ by deﬁnition). The set of nodes in the result full-text relation deﬁnes

the result of a full-text algebra query.

We now provide some examples of full-text algebra queries that correspond to the calculus example in

S e¡¤£¡cD¦ti¥ oG n` ¤3C .G1DF.C 1. T¤FhE2eD¦©¦fG o&¦lH¡l&oCwI bing query returns the context nodes that contain the token ’test’ and ’usability’:

CThe
within a

fdoislltoawnciengofq5u:er ¢y¡¤£¡rD¦e¥ tGu`r'ns0 QPthPQR eXTS cVo1 (TnU t§e(WxV t§ X7n` o¤dC eGsDFC

th at¤FcE2oD¦©¦nG t&¦aH¡i&nCI thbbe

token

’test’

and

the

token

’usability’

CThe following query
d o¡¤£¡nD¦¥ oG t` ¤FcoE2D¦n©¦tGa&¦iH¡n& CtIhbebb token

returns the ’usability’:

context
 Y¡¤£¡D¦¥ G

`n` ' o0dQe`as`

(

tR hP a1bRt P

PcU o§ RnP tP aV i7 n` ¤tC wGDFoC

occu¤rCreGDFnC cbebcs of

`th0 e

t24ok3£e5 n
§

’te st’
76 8

dan9 d

3.3 Equivalence of Calculus and Algebra and Its Applications

' 8a7

Theorem 1 Given a set of position-based predicates E

, the full-text calculus and the full-text algebra

are equivalent in expressive power.

The proof is in Appendix A, and is similar to the equivalence proof for the relational calculus and algebra.

The equivalence of the full-text calculus and algebra suggests a notion of completeness for full-text search languages. This provides a formal basis for comparing the expressive power of different query languages, as we shall do in the next section. To the best of our knowledge, this is the ﬁrst attempt to formalize the expressive power of full-text search languages, either for ﬂat documents or for XML documents. Developing

6

a full-text algebra in terms of relations also provides a foundation for tightly integrating, optimizing and evaluating structured (relational or XML) queries with full-text search.
The full-text algebra also enables us to rank query results by leveraging existing work on the probabalistic relational model developed in the context of IR [11, 23]. Speciﬁcally, the probabilistic relational model includes a probability attribute for each tuple that speciﬁes its relevance to the result relation. A tuple with a high probability is very relevant to the result relation, while a tuple with low probability is not. In addition, the model deﬁnes how these probabilities are propagated through traditional relational operators. In our context, we simply need to add a new probability attribute to our full-text relations. We can then rely on these techniques to propagate this attribute through the algebra operators, and produce ranked results.

4 Incompleteness of Existing Languages and a Complete Language
In this section, we show the incompleteness of existing full-text languages with respect to the algebra and calculus. We then deﬁne a complete full-text language based on the full-text calculus that naturally generalizes existing languages.

4.1 Incompleteness of Boolean Full-Text Search Languages

Boolean full-text search languages are commonly used in IR, and have also been proposed for XML full-

text search [10, 19]. A typical syntax for such languages, which we shall call BOOL, is given below. The

simplest query is a search token, which can either be a string literal (such as ’test’) or the keyword ANY,

which matches any token in a node. In addition, the query can be composed with Boolean operators.
     Query := Token NOT Query Query AND Query Query OR Query  Token := StringLiteral ANY

  8  Y 3YWe can recursively deﬁne the semantics of BOOL in terms of ou`A&r c7 alc£§u7alu`1s2e.¨ Ifb tIhe&qu7 ery£§¦Vis82ia` GSt¨ r¤i$pn£§g¦VL8i2 tebrab l

 U ’token’, it is equivalent to the calculus query express`Ai&on7 £§7a`12e¨ E bb

U

.

B§4 B§4 UIf the query is ANY, it is equ( iva%le' nt to the( ex%p'ression

E

. If the query is of the form NOT

S B§4 8 B§4 B§4 B§4Query, it is equivalent to

, where

is( the%c'Valculu( s e%x' pression fo( r Q%u'Ve ry. I(f th%e' query is of the

form Query1 AND Query2, it is equivalent to

R , where

and R are calculus

  & D8expressions for Query1 and Query2 respectively. OR is deﬁned similarly. As an eqx`Aa&m7ple£§, 7ath`12ee ¨ qu¤eb ry

& Y 2Y &8 ¦' ¦' 38 ¦' Y 2A5Y U’&t7es£§t¦V’82iA` ND¨ $N8qO7$Tbb ’usa` bil&it7 y’£§7ais`12eeq¨ uqivb alen&tt7 o th£§e¦V8c2ial`cu¢lu¨ s" q7que¡ ry)13e$ xpbrb ession:

E

S UU

EU

.

¢¡Obviously, BOOL cannot express position-'b8aase7 dd predicates. However, we now show that even if we

disallow such predicates in the calculus (i.e., E

), BOOL is still incomplete if D is inﬁnite.

¢¡Th' e8oar7 emd 2 If D is inﬁnite, there exists a full-text query that can be expressed in the full-text calculus with
E , but which cannot be expressed by BOOL.

  ¨8   Pro`Ao&fS7 ke£§tc7ah`1:2eW¨ eb shal&ls7how£§¦Vt8h2iat` Gn¨o$ qbub ery in BOOL can express the following calculus query:

$

U @© S$ E U (i.e., ﬁnd context nodes that contain at least one token that is not ),

£ ¥ where

D . The proof is by contradiction. Assume that there exists a query in BOOL that can)e( xpress

¥ ' ¥  ¥¤  ¥ ' £ " Q' © t)he( cal)cu( lus query. Let D be the se$ t of)to( kens that appear in $ . We construct two con$ text no9 de` s @  $ an@¥db

Q' ¦¤. con$ tains only the token . contains the token and one other token D D

¥  ¥ ' £()su( ch a token always exists because D is inﬁn)it( e and is ﬁnite). By the construction, we can see that

¥  ¥ ' £)( doe)s( not satisfy the calculus query, while

does. We will now show that either returns both

or or neither of them; since this contradicts our assumption, this will prove the theorem.

¥¨§ ¥§ ¥§ £ ¥ G ¥ ' ¥©§Let be the calculus expression equivalent to . We show by induction on th)e( structure)o( f that

every sub-expression of (and hence ) returns the same Boolean value for

and . If the

7

  F8  ¥ `A&7 £§7a`12e¨ b &7 £§¦V82i` G¨$p£§¦V82cbb

)(

¥ '  U ¡  " ¥ % ¥ 's)u(b-expr$pe£§s¦Vs8io2Ind is $ of the form $p£§¦V82 E d $

U )( , i)t r( eturns true for b$p£§o¦Vth82 and

  ¥  ¥ 'if , and false if

(by`A&co7 nst£§ru7ac`12etio¨ nbob f and - recall )th(at )a(ppears

£ B§4 U ¥  ¥ 'in ). If the sub-expression is of th( e f%o'rm

E

, it returns true for both )( and ).( If

B§4 Sthe sub-e( xpr%e'ssion is of the form

, then it returns the same Boolean value for both

and

because

returns the same Boolean value (by induction). A similar argument can also be made for the

8 T ¢ ¢¡and Boolean operators.

' 8a7 d

If we limit D to be ﬁnite, however, we can prove that BOOL is complete with E

.

¢¡' 8a7 d

Theorem 3 If D is ﬁnite, every query that can be expressed in the full-text calculus with E

can be

expressed in BOOL.

  8  "The proo`Af&is7 pre£§s7ae`1n2et¨edb in Ap&pe7 ndi£§x¦V8A2i. ` TGh¨$e mbb ain intuition is that, if D is ﬁnite, we can express queries

 U Ss$ uch as:

E

U in BOOL by explicitly listing all the tokens that are not

6. Although BOOL is complete under this assumption, it is not always practical bec$ ause even for simple

queries such as the one above, we need to explicitly list all possible tokens other than in the query.

4.2 Incompleteness of Existing Predicate-Based Full-Text Search Languages

We now consider full-text languages that have position-based predicates in addition to Boolean operators [2,

4]. A typical syntax for such a language, which we shall call DIST, is given below.
       Query := Token NOT Query Query AND Query Query OR Query dist(Token,Token,Integer)  Token := StringLiteral ANY

&£The semantics of DIST is the same as B OF7O$pL#2, ex8 cept for the addition of dist(Token,Token,Integer).

This construct is the equivalent of the

predicate introduced in the calculus (Section 3.1), and spec-

" 0' " 0'iﬁes that the num$ be$ r o f intervening token$ s shou$ ld be less than the s peciﬁed integer. More formally, the

& & 68 &  68 ¦' ¦' 68 ¦' 0' 68 &£ & ¦'semq`Aa&nti7cs o£§f7ad`12eis¨ t( ¤b , &, )7 for£§¦Vso8m2i`e to¨k$ enb s a§n`Ad&7 an£§d7a`1s2eo¨meqb int&ege7 r £§i¦Vs8g2iiv` e¥n¨b$ yb the Fc7a$pl#c2ulu8#s` exq¨pre¢s¨!sibobnb :

U  0' U$ $ E

U

E U &7 £§¦V82

.

If or is ANY instead of a string literal, then the corresponding U

predicate is omitted in the

semantics. We now show that DIST is incomplete with respect to the calculus so long as D is not trivially

small. We can also prove similar incompleteness results for other position-based predicates.
7 &£ 7 £ & ¦'Th' e8oar7 emdf q4 FIf7$p#D2 8#` R ¨ , t¥h¨!erb!e@ exists a full-text query that can be expressed in the full-text calculus with
E , but which cannot be expressed by DIST.

& & 8 ¦' ¦' 8 &  8 ¦' 0' 8 &£ & ¦'Proqo`A&f Sk7 etc£§h7a:`12eW¨ e ¤sbhall s§h`Ao&w7tha£§t 7an`1o2eq¨ uerb yii&nD7 IS£§T¦V8c2ian` ex¨p$ ¤reb sis&th7e fo£§l¦Vlo8w2ii` ng¥¨c$ albculusFq7u$p#er2 y:8#` ¨ ¢¨!#bbb

U © 0' © U ¤  0' ¨ S 0'$ E $

$ Ed $

U

U $$

,

where D , D and

(i.e., ﬁnd context nodes where the tokens and do not appear next

¥  ¥ ' £ ¥ to each other at least once). The proof is by contradiction. Assume that )th( ere exist)s ( a query in D)IS( T

 Q' " ¥ '  Q'that can express the$ calculus query.$ We now constr$ uct)tw( o context nodes a$ nd as fol$lows.

 Q' ¥ Gcon$ tains the tokens$ followed by followed by .

)co( ntains the tokens followed by followed

¥ 'b)y( followed by . By the construction, we can see that

does not satisfy the calculus query, while

¥ % ¥ ' £ £does. Using i)nd( uction)o( n the structure of similar to the proof of Theorem 2, we can show that

¢either returns both or or neither of them. This is a contradiction.

4.3 A Complete Full-Text Query Language
We now present a new language COMP based on the full-text calculus that is complete even in the presence of arbitrary position-based predicates. COMP shares the same syntax as BOOL for simple Boolean queries,

8

but naturally generalizes BOOL with position variables to achieve completeness. Thus, simple queries retain

the same conventional syntax, while new constructs are only required for more complex queries.
           Query := Token NOT Query Query AND Query Query OR Query SOME Var Query EVERY Var Query Preds      Token := StringLiteral ANY Var HAS StringLiteral Var HAS ANY    Preds := distance(Var,Var,Integer) diffpos(Var,Var) ...

The main additions to BOOL are the HAS construct in Token, and the SOME, EVERY and Preds constructs in

Query (the semantics of the other constructs remain unchanged from BOOL). The HAS
  )explicitly bind position variab$pl£§e¦s (Var) to positions whe&re7 tok£§e¦Vn8s2io` cc#u' r.¨T$ph£§e¦sb emantics

fcoorn’s  tr#u' ct

allow$ps£§u¦ s HAS#' ’

to in

terms
4ANY’

of is:

t&he7

cE al£§c7au`1l2eus¨ , 

w#' heb re is . While

a StringLiteral is: the HAS construct

U allows

us

to

  explicitly

. The semantics for ’  bind position variables

to

HAS token

¤ U 4 5 ¤ 8 B§4 V 4 B§4 ¤ GW B§4 B§4pf’SoorOsiQMtiEoune  sry#,.' thTeQhSueeOsrMyemE’ iasann¡tdi  c#sE' VoqEf`A&R’EY7VEcEo£§Rn7aYs`1tr2e u¨#c  t' s#'

allb ow(s u%s '#tob ,
Query’ is ¢ 

quantif(y w#h' e¢r`Ae&7

o%v'er these positions. The semantics of

E

£§7ai`1s2et¨h  e#c' albculus

(exp%re'#sb sion sema( nti%c's , where

is the calculus expression semantics for
 ¤ £  £  £  £%ar' b8i©tr`a  r#y' po¨s9i9t9qio¨ n  -#b' a( s¨ed¨p9r9e9qd¨ica£ tb e.s. The

Qseumearyn.ticFsinoafllay,prthedeicParteeds’pcroedn(s tr#u' ct

,.a.l.l,o  w#' s(

for the , ,... £

deﬁnition of )’, is simply:

As an illustration of the power of COMP, the following two queries express the calculus queries used to

& & prove the incompleteness of$ BOOL and DIST in Theorems 2 and 4, respectively.

& ¦' &  ¦' 0' # ¦'SOME (NOT HAS ) $

$

SOME SOME ( HAS AND HAS AND NOT distance( , ,0))

We can prove that COMP is complete (the proof is in the appendix).

' 8a7

Theorem 5 Every query th'a8tac7 an be expressed in the full-text calculus using predicates E

can be

expressed by COMP using E

.

5 Related Work
Most of IR research [2][18] has focused on methods for relevance estimation and efﬁcient evaluation of keyword queries. In this context, full-text languages have been developed to implement speciﬁc primitives, but their formal properties such as expressive power and completeness have not been studied. This observation also applies to XML full-text search languages such as XQuery/IR [3], XIRQL [10], XSEarch [8], XRANK [12], XXL [19] and Niagara [22]. Existing work on probabilistic relational databases [11, 23] could be used to extend our algebra to support relevance scoring of query results. Several other works have used relational systems to store inverted lists and translate keyword queries to SQL [5, 9, 13, 16, 17, 22]. Such implementations do not provide a formal basis for completeness and could beneﬁt from our formalization. Finally, our study of completeness for full-text languages is similar to the that for the relational algebra and calculus [7], and goal is to provide a similar formal basis for full-text querying so that it can be tightly integrated with structured queries.

6 Conclusion
This paper presents a simple, yet powerful formalization of full-text search for XML. While this paper makes an initial step in characterizing full-text query languages for XML, there are multiple directions we can explore to build on this work. First, we wish to explore how the formal characterization of fulltext search in terms of the relational model can enable the seamless integration of full-text languages with structured query languages, in terms of both query optimization and query evaluation. Second, we wish

9

to incorporate primitives such as stemming, thesaurus and stop-words in our framework. We believe that quantiﬁcation over tokens adds additional expressive power, and would enable these additional features.
References
[1] S. Amer-Yahia, C. Botev, J. Robie, J. Shanmugasundaram. TeXQuery: A Full-Text Search Extension to XQuery. http:/www.cs.cornell.edu/database/TeXQuery/.
[2] R. Baeza-Yates, B. Ribiero-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [3] J. M. Bremer, M. Gertz. XQuery/IR: Integrating XML Document and Data Retrieval. WebDB 2002. [4] E. W. Brown. Fast Evaluation of Structured Queries for Information Retrieval. SIGIR 1995. [5] T. T. Chinenyanga, N. Kushmerick. Expressive and Efﬁcient Ranked Querying of XML Data. WebDB 2001. [6] E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13(6): 377-387 (1970). [7] E.F. Codd. Relational Completeness of Database Sublanguages. In R. Rustin (ed.), Database Systems, Prentice-
Hall, 1972. [8] S. Cohen et al. XSEarch: A Semantic Search Engine for XML. VLDB 2003. [9] D. Florescu, D. Kossmann, I. Manolescu. Integrating Keyword Search into XML Query Processing. WWW
2000. [10] N. Fuhr, K. Grossjohann. XIRQL: An Extension of XQL for Information Retrieval. SIGIR 2000. [11] N. Fuhr, T. Ro¨lleke. A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database
Systems. ACM TOIS 15(1), 1997. [12] L. Guo, F. Shao, C. Botev, J. Shanmugasundaram. XRANK: Ranked Keyword Search over XML Documents.
SIGMOD 2003. [13] V. Hristidis, L. Gravano, Y. Papakonstantinou. Efﬁcient IR-Style Keyword Search over Relational Databases.
VLDB 2003. [14] Initiative for the Evaluation of XML Retrieval. http://www.is.informatik.uni-duisburg.de/projects/inex03/. [15] Library of Congress. http://lcweb.loc.gov/crsinfo/xml/. [16] J. Melton, A. Eisenberg. SQL Multimedia and Application Packages (SQL/MM). SIGMOD Record 30(4), 2001. [17] A. Salminen. A Relational Model for Unstructured Documents. SIGIR 1987. [18] G. Salton, M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [19] A. Theobald, G. Weikum. The Index-Based XXL Search Engine for Querying XML Data with Relevance
Ranking. EDBT 2002. [20] The World Wide Web Consortium. XQuery 1.0: An XML Query Language. W3C Working Draft.
http://www.w3.org/TR/xquery/. [21] The World Wide Web Consortium. XQuery and XPath Full-Text Use Cases. W3C Working Draft.
http://www.w3.org/TR/xmlquery-full-text-use-cases/. [22] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, G. Lohman. On Supporting Containment Queries in Relational
Database Management Systems. SIGMOD 2001. [23] E. Zimanyi. Query Evaluations in Probabilistic Relational Databases. Theoretical Computer Science, 1997.
10

A Proofs of Theorems

Theorem 1: Equivalence of the Full-Text Calculus and Algebra

Le' m8am7 a 1. For every full-text algebra expression that only uses position-based predicates from the set

E , there ex' i8sats7 an equivalent full-text calculus expression that only uses position-based predicates from

B§4the same set E

.

  )¢¡a( %'

¥  0' £ ¥ £ B§4  P5 r`o)of( S£¥kae8§tc¨!h#:$F$ W¨!#e$Fw$ ¢i¨ll9g9g9gp¨!r#o$Fv$UeT b th¦at for every algebra expression

, which6eav) al( uat%e's `1t2eo¨ ar¨e9gl9ga9g¨t'ioT nb

F  #  7 ¡ ¤£ ¦¥ ¨§ #8 & #8 ¤8 2e¨ ¨9g9g9g¨ ',T#@

, ther ae`12eex¨ ists¨9ga9g9g¨ c'aT lcb ulus8q#u' er& y6e£¢x2 p$r8e%ss$i`1o2cnb &7 £§7a`12e¨ ¤b 9g9g9

 &8 ¥ £ B§4 & w&it7 h fr£§e7ae`12eva¨ 'riT¥abbles6a) ( %'`12e¨ ,s¨u9g9gc9gh¨ 'tThb!a@6t d 5

E

B§4E   . )¢¡a( %'

The proof is by induction on the structure of

.

  B§4 ¥ £ B§4   8  T  ¥S U   Q8  ¥ £ B§4 U 7 ¡ ¤£ ¦¥ ¨§ Q8If`

)¢¡a( &7

%'Xd 0 £§7a`12e¨

b

243£5
e&7 §


7£§6 7a`12e8 ¨

6a) ,bbthe6na) (

( %'`12cbId %'`12cb

`

&7 £§7a`12e¨ b &7 £§7a`12e¨ bb E  2 8#' E & 6£¢2 $8 %$`12cb

£ B§4 ¡ ¤£ ¦¥ ¨§6a) ( %'E `12cb!@ E .

8#' is& a6lw£¢a2 y$s8 %tr$ue; therefore,

is equal to the full-text relation

by its deﬁnition.

  B§4 ¥ £ B§4 # & & 7 ¡ ¤£ ¦¥ ¨§ Q8  )¢¡a( %' &I&f7 £§7a`12e¨

d9 ¤b!@

A%B



A 6a) , then

(

%'`12e¨ ¤beds&7 £§7a`12e¨ 7 E £§7

b  a`12e¨ , i.e.

b

8#' & 6£¢2 $8 %$`12cb

E is equal to the full-text relation £ E by its deﬁnition.

  B§4 ¥ £ B§4  Y 3Y & 7 ¡ ¤£ ¦¥ ¨§ Q8  )¢¡a( %' d

6a) ( %'`12cbeds&7 £§¦V82i` G¨ ¤$p£§¦V82 b  a`12e¨ b 8#' & 6£¢2 $8 %$`12cb

& &8 & Y 3YI&f7 £§7a`12e¨ ¤b ¤C¡DF&E¡GIH 7 , th£§e¦Vn82i` ¨ $p£§¦V82 b!@

U

. 5QPSRUTWVFX

EU

is equal to the full-text relation

by its deﬁnition.

  B§4 5Y B§4 ¤Y  )¢¡a( ¥ £ B§4 )Y &  B§4 6YIf

%' d

 ¡¤£¡D¦¥

G¨§ ©UC

C¤§    § ©UC

C

`¥¤§¦©¨

2 gb   )¢¡a( %' 86 a) ,( wh%e'reA`12e¨ ¨9g9g9g¨

isb a full-  te)¢x¡at( alg%e' bra expression whose

Y ¥  0' 2 ( £ ¥ £ B§4 # C C # C # #8e5 qAu` i)va( le£¥nat8§c¨!a#lc$F$ulu¨!#s$Fq$ u¥e¨r9gy9g9g¨!e#x$Fp$ resb sion sis 

6a) ( %'`12e¨ ¨9g9ga9g¨ ndFb d

 &eva7 lua£§te7as`12eto¨  ¤b

U   &8 ¥ £ B§4 6Y &  & C 7 ¡ ¤U £ ¦¥ ¨§ Q89g9g9 &7 £§7a`12e¨ b 6a) ( , %' `12e¨, the¨n9g9g9g¨ b  a`12e¨ ¨9g9g9g¨ Fb 8#' & 6£¢2 $8 %E $`12cb

P  C  P Q8X¥ £ B§4 & C &  7 ¡ ¤£ ¦¥ ¨§ Q8 P    P &8 ¥ £ B§4 6Y &  Y

" "

§§ §§

&E 7 £§7a`12e¨ ¢b &7E £§7a`12e¨ qb
E

6a) ( 6a)

(

%'`12e¨ %' A`12e¨

¨9g9g9g¨ Fb!@ ¨9g9g9g¨

.
6=b! @6 d



R¤0 V  

§ (TU R¤§ 0


V

§ ("! § (TU

 a`12e¨
§    § ("!

`

¨9g9g9g¨ 5b
.

b

8#' & 6£¢2 $8 %$`12cb

  B§4 B§4 4 B§4 '  )¢¡a( C ¥  2If

%'

d#¤§¦©¨

8

24!
5

!$` )¤§( ¦©¨£¥ a88§ ¨!2#$$F,$

 
w¨9gh9ge9g¨!r#e$F$

)¢¡a( b

%'   )¢¡a( ed an¥d¨

%'

are full-text algebra ex-

¥ £ B§4 C &  ¥ £ B§4 #  ¦pressions that eva6luaa) te( to%' `12e¨

¥ £ B§4 &  Q8X¥ £ B§4  #  ¦ &  ¦ 7 ¡ ¤£ ¦¥ ¨§ Q8e6xap)re(ssio%n's`12ear¨ e ¨9g9g9g¨

P   ¦  P 8 ¥ £ B§4 &  8 ¥ £ B§4  #  ¦  '5

" W§  5   § U 

V &7 E

£§7a`1U 2eb ¨

6a) ¢b

¨9g9g9g¨ b

(

%'`12e!¨ 6a) (

for
%'`12eU  ¨

 d ! ¥fo¨ r

¨9g9g9g¨ ¨9g9g9g¨

R
U



U

R6, aa)nd(
,tb hVebn.6 aa`12e) ¨ (

th¨%9ge%'9gi9g'r`1¨ 2e`1e2e¨q¨uU i vU a¨ l9gVe9g9gn¨¨t9g9g9g9gc9g9g8¨¨a#lc' uUUl&u s6q£¢VV u2@b e$rdd8 y%$`12cb

.

  B§4 5Y B§4 ¤Y  )¢¡a( BY ¥ £ B§4 5Y & If

%'Id

'§%'& G¤¥ 1 ©UC 5 C§    § ©UC C43§ 5¤¦§    § 546 7 `( 8 2

b( , where

%' is a full-tex6t aa) lg( ebr%a' ex`12epr¨ ess¨io9g9gn9g¨ 'thT ab t

¥ £ B§4 &  ¥ £ B§4 6Y &  Q8  2 £  £ &  7evalu6ataes) t( o th%e'`1r2eel¨aton¨9g9g9g¨

¡ ¤£ ¦¥ ¨§ Q8 P   P Q8X¥ £ B§4 6Y &  Q8  2 £  £th8e#n' & ! Y' ( V¦021bR P

6£¢2 $8 %$`12cb 
P U §    § R P P) § S U §    § S10

7

5

"

'aT nbrdd th6e ae)qu( iva%le' nAt`12ec¨alcu¨l9gu9g9gs¨ 'qTubery%'e8x©pr` e#s$Fs$ io¨n9g9g9gi¨!s#$F$
§    § T &7 E £§7a`12e¨ ¢b 6a) ( %' `12e¨ ¨9g9g9g¨ 'T b %'
.

¨ 8©`

#¨$F$9g9g9g¨¨ 9g9g£ 9g¨!b .#$F a$ `12e¨¨

¨9g9g,9g¨ 'T b ¨9g9g9g¨ £ b!@6d

  B§4 B§4 4 B§4 '  )¢¡a( CLet

@%' d2¤§¦©¨

24! 3¤§¦©¨

2

  )¢¡a(

$8 5  d 8¥ ¨ , where

%'

  )¢¡a( and

%'

are full-text algebra ex-

¥ £ B§4 DC &  ¥ £ B§4 #  ¥ £ B§4 ¤ &  Tp6reas) si( ons%'tha`12et e¨ val¨u9ga9g9gt¨e'tT ob fo r d ¥¨ R and th6eair) e( qui%va'l`1e2en¨t ca¨lc9g9gu9 'luT sb qd uer6y ae)xp( res%si' o¢n`1s2ea¨ re¨9g9g9 'T b

¥ £ B§4 ' &  &  7 ¡ ¤£ ¦¥ ¨§ 38 P   P Q86a) ( %' `12e¨ ¨9g9g9 'T b  afo`12er ¨ ¨9g9g9g¨ 'R T, b then8#' & 6£¢2 $8 %$`12cb ¥ £ B§4 ¤ &  QT ¥ £ B§4 ' &   '` 6a) ( %' ¢`12e¨ ¨9g9g9 'T .b 6a) ( %' §`12e¨ ¨9g9g9 'T bb!@6d 5 @ 5



"

§    § T &7 E £§7a`12e¨ ¢b

.

  B§4 B§4 4 B§4 '  )¢¡a( CLet

B%' d2¤§¦©¨

24! 3¤§¦©¨

2

  )¢¡a(

$8 5  d 8¥ ¨ , where

%'

  )¢¡a( and

%'

are full-text algebra ex-

pressions that evaluate to for

R and their equivalent calculus query expressions are

11

¥ £ B§4 DC &  ¥ £ B§4 #  ¥ £ B§4 ¤ &  86a) ( %' `12e¨ ¨9g9g9g¨ 'T b  d ¥¨

6a) ( %'`12e¨ ¨9g9g9 'T b d 6a) ( %' ¢`12e¨ ¨9g9g9 'T b

¥ £ B§4 ' &  &  7 ¡ ¤£ ¦¥ ¨§ &8 P   P6a) ( %' `12e¨ ¨9g9g9 'T b  afo`12er ¨ ¨9g9g9g¨ 'R T, b then8#' & 6£¢2 $8 %$`12cb  8 ¥ £ B§4 ¤ &  38 ¥ £ B§4 ' &   '` 6a) ( %' ¢`12e¨ ¨9g9g9 '.T b 6a) ( %' §`12e¨ ¨9g9g9 'T§bb!@ d 5 AB 5

"

§    § T &7 E £§7a`12e¨ qb

.

  B§4 B§4 4 B§4 '  )¢¡a( CLet

%' d

9¤§¦©¨ 24! ¤§¦©¨ 2

  )¢¡a(

$8 5

W8 d ,¥¨where

%'

  )¢¡a( and

%'

are full-text algebra

¥ £ B§4 DC &  ¥ £ B§4 #  ¥ £ B§4 ¤ &  8e6xap)re(ssio%n' s`1t2eh¨at e¨v9ga9g9gl¨ u'aT teb to  d for¥¨

R 6aan) d( the%i'r `1e2eq¨ uiv¨a9gl9ge9 'nTt b cad lcu6luas) q( uer%y' ¢ex`12ep¨ress¨i9go9gn9 'sT bare

S ¥ £ B§4 ' &  &  7 ¡ ¤£ ¦¥ ¨§ 38 P   P Q86a) ( ¥ £ B§4 ¤ &  &8 ¥ £ B§4 ' &   '` 6a) (

%' §`12e¨ ¨9g9g9 'T§b fo ar`12e¨ ¨9g9g9gR ¨ ', T thb en 8#' & 6£¢2 $8 %$`12cb 
%' ¢`12e¨ ¨9g9g9 'T b . 6a) ( %' §`12e¨ ¨9g9g9 'T§bb!@ d 5 !9 5

"

§    § T &7 E £§7a`12e¨ qb

S.

¥ ¥ £ B§4This completes th)e ( str£¥uac8 tural induction. The requirement that fu6ll-at) ex( t a%lg' ebra queries evaluate to a

7 ¡ ¤£ ¦¥ ¨§ 38 ¥ £ B§4relation with a sing2le

attr i2bute8e#n' su& r6es£¢t2 h$a8 t%th$e`12ccb orre6spao) n(din%g'`12cb!@

expression will have only

¢one free variable - . Therefore,

is a valid calculus query.

Le' m8am7 a 2. For every full-text calculus expression that only uses position-based predicates from the set

E , there ex' i8sats7 an equivalent full-text algebra expression that only uses position-based predicates from

¥ £ B§4 X the same set E

.

6a) ( %'`12e¨ ¨9g9g9g¨ 'T b

#  £ B§4Proof Ske tc2eh¨ : W¨9ge9g9g¨w'iT#ll@ p¦rove  that for every query calculus expres  si)¢o¡an( %'

with free

¥  0' #  7 ¡ ¤£ ¦¥ ¨§ ¤8 P   P ¤8v5 a`r)iab( le£¥sa8§¨!#$F$ ¥ £ B§4 & 6a) ( %'`12e¨

¨!#$F$ ¢¨9g9g,9g¨!#$F$UT b , there exis ats`12ean¨ alg¨9ge9gb9g¨ r'aT eb xpre8ss#i' on& ¨9g9g9g¨ 'T b!@6d 5 , such that

6£¢2

$8

%$`1,2cwb hic h

e" va§ l u§ T a&tes7

to E

£§a7ar`1e2ela¨ tioqbn

¥ £ B§4. 6a) ( %'

  ¥ £ B§4   B§4The proof is by induction on the structure of

.

6a) ( If

%'`12e¨ b d &7 £§7a`12e¨ b

  )¢¡a(

E , then

%' d 9 A%B  A . The proof of the equivalence is the

  ¥ £ B§4   Y 3Y B§4same as the analogous case from Lemma 1.

6a) ( %'`12e¨ beds&7 £§¦V82i` G¨ ¤$p£§¦V82 b

  )¢¡a( %'0d

If

U

, then

¤C¡DFE¡GIH . The proof of the equivalence

is the same as the analogous case from Lemma 1.

  ¥ £ B§4 &   &  £  £ B§46a)

¡ ¤£ ¦¥ ¨§ 38 C  C &8 &  £  £   7I` f9

A%B
8#'

 &

( %'`12e¨ ¨9g9g9g¨

A
6£¢2

9g9g9% 9 A%B
$8 %$`12cb



'T brd Ab

%' 8©`

¨9g9g9g¨ 'Ta¨

q¨9g9g9g¨

£ b ,¦ then   )¢¡a(

%'05fd df'§ a%'`1& 2eG¤¥ ¨

 " , w§  h  e§ T re&th7 eE n£§u7am`12eb¨er 3ob f jo%in' s8©is` .¨O9g9g9gb¨ v'iTao¨usly¨,9g9g9g¨ £ b!@ .

1

%

§¨ 9g9g§ 9g% ¨ ¡ '§ 5¤T ¦b§  

 § 54687

  ¥ £ B§4 &  Y Y Y Y Y Y ¥ £ B§4 ¤ &  Y Y Q86a) ( %'`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ ¨ ¨9g9g9g¨ S brd 6a) ( %' q`12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ b

¥ £ B§4 ' &  YR Y RY Y R R ( £ ¥ £ B§4 4 R ¥ R £ B§4 'I6f a) ( %' `12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ S b

¦sd )¥¤ ¦¤ 6a) ( %'

6a) ( %'

R R B§4  B§4 ', where

,
 

)¢¡a(

%'

, an  d)¢¡a( %'

are cal-

 ¥  2Y 2Y ' ¥  2Y Y 2Y Yculus qu5 er¢y` )ex( pr£¥eas8§si¨!o#n$Fs$ w¨9gi9gt9gh¨!#e$Fq$UuTai¨!v#a$Fle$ n¨t 9ga9g9gl¨!g#e$Fb$ ra b expres5 sio`n)s ( £¥a8§¨!#$F$ ¨a9gn9g9gd¨!#$F$§¢3¨!#$F$ ¨9g9g,9g¨!w#h$F$ iSchb eval-

B§4u  a)¢t¡ae( to%' d `¥¤§¦©¨ 24!  &  Y Y Y Y Y Y 7 &  Y Y ©  8 &  Y Y Y Y © ' a`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g8¨  ¨

¨  9g¡¤9g9g£¡¨ D¦¥

S

G¨b§

¨©

©© § `1 2e § ¨ ¨

©

©

¤§¦©¨
¨9g9g9g¨

2
8£ ¢3¨

a$ nb d¨B 9g9g9g` ¨

 

¡¤£¡D¦b ¥

G¨§ ¨5©

§







§¨

3©

¤§¦©¨
`12e¨

&24! $¤§¦©¨ 8¨ 9g9g9g¨ £¢3¨ ¨9g9g9g8¨ 

2
S

b$

b , 5 thed n . 5 ¢@

&  R Y R Y R Y Y R Y Y 7¤¥ £ B§4 ¤ R & R  Y Y Q8 R R a`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ ¨ ¨9g9g9g¨ S b 6a) ( %' ¢`12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ b ¥ £ B§4 ' &R   R Y YR Y YR R R=6a) ( %' `12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ S b!@
R R , which is exactly what we wnated to show.

  ¥ £ B§4 &  Y Y Y Y Y Y ¥ £ B§4 ¤ &  Y Y QT6a) ( %'`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ ¨ ¨9g9g9g¨ S b 6a) ( %' ¢`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ b

¥ £ B§4 ' &  YR Y RY Y R R ( £ ¥ £ B§4 4 R ¥ R £ B§4 'I6f a) ( %' `12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ S b

¦sd = )¥¤ ¦¤ 6a) ( %'

6a) ( %'

R R B§4  B§4 ', where

,
 

)¢¡a(

%'

, an  d)¢¡a( %'

are cal-

 ¥  2Y 2Y ' ¥  2Y Y 2Y Yculus qu5 er¢y` )ex( pr£¥eas8§si¨!o#n$Fs$ w¨9gi9gt9gh¨!#e$Fq$UuTai¨!v#a$Fle$ n¨t 9ga9g9gl¨!g#e$Fb$ ra b expres5 sio`n)s ( £¥a8§¨!#$F$ ¨a9gn9g9gd¨!#$F$§¢3¨!#$F$ ¨9g9g,9g¨!w#h$F$ iSchb eval-

B§4u  a)¢t¡ae( to%' d `¥¤§¦©¨ 24!  &  Y Y Y Y Y Y 7 &  Y Y ©  T &  Y Y Y Y © ' a`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g8¨  ¨

¨  9g¡¤9g9g£¡¨ D¦¥

S

G¨b§

¨©

©© § `1 2e § ¨ ¨

©

©

¤§¦©¨
¨9g9g9g¨

2
8£ ¢3¨

a$ nb d¨@ 9g9g9g` ¨

 

¡¤£¡D¦b ¥

G¨§ ¨5©

§







§¨

3©

¤§¦©¨
`12e¨

&24! $¤§¦©¨ 8¨ 9g9g9g¨ £¢3¨ ¨9g9g9g8¨ 

2
S

b$

b , 5 thed n . 5 ¢@

&  R Y R Y R Y Y R Y Y 7¤¥ £ B§4 ¤ R & R  Y Y QT R R a`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ ¨ ¨9g9g9g¨ S b 6a) ( %' ¢`12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ b ¥ £ B§4 ' &R   R Y YR Y YR R R=6a) ( %' `12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ S b!@
R R , which is what we wanted to show.

12

  ¥ £ B§4 F  ¥ £ B§4 6Y &  ¥ £ B§4 ¤Y6a) ( %'`12e¨ ¨9g9g9g¨ 'T b d 6a) ( %' A`12e¨ ¨9g9g9g¨ 'T b

6a) ( %'

S B§4 YLet us consirder the case

  )¢¡a( %,' where

is
Y &  ¡  F B§4  7 ¡ ¤£ ¦¥ ¨§ 8 C  B§4 ¤Y C 8to

a5

cAa`1l2ecu¨ lus¨9gq9gu9g¨ e'rTyb

expre¦ssion . If ¦ 5

t hat is eq  u)¢iv¡aa( len%t'ftod d , t ah`1e2en¨ ¨9g9g9g¨ 'T b

the8` #9a'lgA%e& Bb6raA£¢2e x$p8 9gr%9ge9%$s`1s2cio9b nA%B



¥ £ B§4 6Y & th6e an)um( be%r' oA`1f2ejo¨ ins¨9gi9gs9g¨ 'T. b!@



A¥b
"

9  §§T

,)¢w¡a(hich%' evaluates &7 £§7a`12e, ¨wh3bere
E

S B§4 Y 7 ¡ ¤£ ¦¥ ¨§ 8¦sd 

  )¢¡a(

¡ ¤£ ¦¥ ¨§ 38 ¥ £ B§4 6YIf8#' & 6,£¢2th$e8 n%$`12cb

, %' d
6a)

which is what

0
(



243£5
%' A`12c§ b!7@ 6



wew9 an¤§t¦©ed¨
8

to show.
2
8 and

5

d

 2

8#' & 6£¢2 $8 %$`12cb

  ¥ £ B§4 & S  #  # Q8X¥ £ B§4 6Y &  # ¥ £ B§4 ¤Y6a) ( %'`12e¨ ¨9g9g9g¨ 'T brd 'T  &7£( £¥a8#`12e¨ 'T  b 6a) ( %' A`12e¨ ¨9g9g9g¨ 'T  b

6a) ( %'

U B§4 YIf   )¢¡a( %' , where

Y B§4 Y &  7 ¡ ¤£ ¦¥ ¨§ 8is

a

calcu5 lus

query  

U  # &  # © Y &  7 ¡ ¤£ ¦¥ ¨§ Q8  # ¥ £ B§4 6Y &  #ua'tTes to`12e¨ ,¨th9g9ge9gn¨ 'T 

e)¢¡ax(pre%ss' iond b 5@

t ah a`1¡¤2et£¡¨D¦i¥sG¨e§ %¨q¤9gu9g§ 9gi¨ v'§ %aT¡ l¤eb ntanto8d#t'5he&

dalge ab`12era¨ ex¨p9gr9ge9g¨ s'siT ob n 6£¢2 $8 %$`12cb 'T 

8#' & 6,£¢w2 $h8 ic%h$`1e2cvb al6a) ( %' A`12e¨ ¨9g9g9g¨ 'T 

b!@

U= .

  ¥ £ B§4 &   #  # W ¥ £ B§4 6Y &  #6a) ( %'`12e¨ ¨9g9g9g¨ 'T b d 'T  &7£( £¥a8#`12e¨ 'T  b

6a) ( %' A`12e¨ ¨9g9g9g¨ 'T  b

¥ £ B§4 6Y V B§4 YL6eat) ( %'

  )¢¡a,( wh%e' re

¥ £ B§4 #   # ¥ £ B§4 6Y &  #is a calc6ulau)s ( que%r'y`12eex¨ pre¨s9gs9gi9go¨ 'nT tbrhdat is 'eqT u ivale6nat ) to( th%e' al`1g2ee¨ bra¨9ge9gx9g¨ p'rTessibon

.

SXU SWe use the equation

and apply the

previous case..

¥2
For every calculus query, its query expression has only one fre)e ( va£¥riaa8 ble, , therefore the equivalent

¢algebra query evaluates to a relation that contains a single column,
query.

. Therfore, it is a valid algebra

The above two Leammas prove the equivalence of the full-text calculus and algebra.

¢

Theorem 3: Completeness of BOOL when is ﬁnite

£ 7 ¡ ¤£ ¦¥ ¨§ F8d  2 8#' & 6£¢2 $8 %$`12cb `12cb!@

Proof Sketch: Let

( E be a calculus query expression. We will prove

6 ¦' that there exists an equivalent Query expression in BOOL. Without loss of generality,¨ w¥e¨9ga9g9gs¨ sume that

£every quantiﬁed variable `1i2cnb has a unique name. Let these position variable names be

.

We ﬁrst normalize E using the sequence of equivalence transformations presented below.

#C C&7 £§7a`12e¨ Ab

&7 £§¦V82i` p¨$!b

1. (Sink Negations) Move all negations down to the predicates
  &8   

E

&7

£§7aa`1n2ed¨ b

U  

.

  W SXS   W SXU   38Replace

&any7

re£§p7ae`1t2eiti¨ veb

negations  

w&ith7

£§. 7aIn`12eve¨ rtb

quantiﬁers:  

E &7 £§7a`12ei¨ s rb eplac  ed

V S S&V U Swith E

and E

is replaced with

E

.

3C C&7 £§¦V82i` F¨$!b

&7 £§¦V82i` ¨$!b

 C S2. (Group) Move every expression of the form U

and U &7 £§¦V8o2 ut of the scope

of any quantiﬁer over a variable different from . This is possible because U

is applied on

&P ¥¤§¦©¨ 

¦¤  P © ¤ © 68 "T ¦  P £only

one

position
 

variab  le.¨

F@ormally  ,

th¨ ie @transformation

is

a

repeated

application

of

P

8 £ T £ U V FCwhere , , and has no free variable . Use the co&mm7 ut£§a7ati`1v2eit¨ y 3ob f

C C C C C ©and to group the above predicate expressions next to each oth er`1a2en¨d r3ib ght after E   ¨ @ .

£ £ U VWe get a propositional formula with propositions of the form

where

.

FC C W&7 £§7a`12e¨ 3b

 C C 8  V3. (Remove unive&rsa7 l qu£§a7an`12etiﬁ¨ c3ab tion) Remove any universal quantifers by replacing

E

C SXU C 38 ¦ C C Sw&ith7 £§7a`12e¨ 3bE `12e¨ b

. We get a propositional formula over propositions of the form

UE .

¦ C C`12e¨ 3b

4. (Local DNF) Convert each

to DNF.

  8   T   Y Y 8  Y T&7 £§7a`12e¨ b ` `12e¨ b

`12e¨ bb

` &7 £§7a`12e¨ b

`12e¨ bb

Y Y UY Y 8 Y Y C C 8 ¦ C U C ¦ C C5. (` Splirt)&R7epl£§a7ace`12e¨ b E `12e¨ bb

&7 £§7a`12e¨ 3b wit`1h2e¨ 3b

E

`12e¨ 3b

U  UE

to¨9g9g9g¨ T E

for every disjunct in

.

P P 38 ¥P R P R ¥PLet the ne&w7pos£§it7aio`12en¨ vaqrb iable#s`1b2ee¨ qb

. We get a propositional formula over propositions of the

U R R Rform

E

where is a conjunction.

13

¥P P £ U R P R P Q86. (G#l`1o2eb¨ alqDb NF) Consider to be a propositional formula over propositions of the form
R . Convert it to a DNF.

&7 £§7a`12e¨ qb E

96( ` b

£ 7 £We ¡ ¤£ ¦¥ ¨§ 8  XC8 P ¥ £C P ¡ ¤£ ¦¥ U¨§R F8 C  P X8 C P¥ R C P 78#W' e&

deﬁne

for a calculus query exprd ess i2 on

o6b£¢s2er$v8 e%$th`12catb

after
¢

&7 £§7a`12e¨ b

t§ ¥h@e,

normalization w`12eh¨ erb e each ¢

§

is either

8a#s'

th&e

6eq£¢u2 iv$8a%le$n`1t2cqb uer`¡y  in &7 £§7a`12e¨

B£Ob ¢ OL§ ¢.b!@

d¥¤ `12e¨ b

of the form

E

§¦  2 or of the

SXU 9 R 9   R £ C 7 ¡ ¤£ ¦¥ ¨§ F8 C £form

E

, as in ste¨ pd Gl o2 balDN8#F' f&ro6m£¢2 th$e8 %n$o`1r2cmb aliz¢ ati¨¥o@ n. Therefore, can be

9  ! 9  !decompo6se( d`
£ £ C P £ £ £see that

inbto=th` 6e (ca`lcul§ ubs

expressions AND ... AND

6(

`

§

U

bb

` 6( OR ... OR

`

P§

b

AND

...

aAnNdDitQiEs (noPt§

d© ibfﬁb cult to . Thus,

C P £we ! !  ! £  U    Q8 !  !  SXU   Q8

can focus only A§s`1s2ee¨enb above,

on converting each § . each§`12e¨ § bis of&th7e fo£§r¦Vm82i`

&7 G¨$!b

E

£§7a`12e¨ &7

b  §`12e¨ £§¦V82i£` G¨$!b

b or of the form

&7 £§7a`12e¨ b E

!  S£

, where £

¥`12eis¨ b U

or U

. In either case, we can consider there

£ C P U   38 are

no duplicates among Let us ﬁrst consider

£
the

case

w; hoethreerw§ised ,

we

c&an7

sim£§7apl`1y2e¨ E

elib min atP e
£

thP e`12em¨ .

b

.

  4 ' !    !   0''    0' C PI$ f thd er$ e exists

' and

such that £

U `12e¨

bids&7 U

£§¦V82i` G¨$

b ,£

9 £ C P " " £empty

, the6n (the` set.

co§ ¢nb d=itiAonNY”onAeNtDokeNnOTpe(r$

p6os5 iti9go9g9 n6”5

(S$ S ection 3.1)
)), where

V
is D

`12e¨ bid &7 £§¦V82i` G¨$ b U , and
vid ola t$ed¨.9g9gT9g¨h$ eS r@ efore, § is the
is the set of all

tokens. Intuitively, this query returns the empty set because it requires the result nodes to contain a

token that is not in D , which is impossible.

  4 ' !   !  ' C P 9 SIf there exists

' and

such that £

U `12e¨ b

 £ £ C Pthen th$ is6i5 s 9ga9gn9 6o5 b$vS ious contradiction and

d
§

&7 £§¦V82i` G¨$!b U and £

V `12e¨

bd 6(

is the empty set. We deﬁne

`

&7 £§¦V82i` G¨$!b

§ ¢b

U = ANY

AND

NOT(

)) as above.

  4 6' !   ! ' '  !   0YLe&tt7her£§e¦Ve8x2iis` tGs¨$!b and there does not exist

such§`1t2eha¨ t b£

U `12e¨

beds&7 U

£§¦V82i` G¨$!b &7 £§¦V82ia`nGd¨$£

b

V

`12e¨

bed

S  2Y    38  £ C P S 7 ¡ ¤£ ¦¥ ¨§ 8U $
9 C Ptoke&n7
U £E

d$ £§7a`12e.¨

. Then we can ignore any £ Tb he&lat7 ter£§a¦Vr8e2it` rGiv¨$!iab!l@ ly satisﬁed.

In

which contains
this case, § =

 2 6(

U , which is exactly the semantics for

U `

8#' & 6£¢2 $8fo%r$`1s2cobme

§ ¥b

$ =.

  8 £  C P 2C 7 ¡ ¤£ ¦¥ ¨§ 8 U   @8 S  2C @8T9g9gh9 e

las&tc7 as£§e¦Vi8s2i` G§¨$

d ¢b

 2

8#' & 6£¢2 $8 %$`12cb

&7 £§7a`12e¨ b E

&7 U2

£§¦V82i` G¨$

Ub

S 2C 2C P #T ¤T P P  P 2C £ C P 2C 7 ¡ ¤£ ¦¥ ¨§ &8  P P  &8ken9
! ¢7 ¡ ¤£ ¦¥ ¨§ &8 U   &8D`A&7  P 9 C P P U P&7
£U

fU  r£§o$ £§¦Vm¦VU 8¨82it9g2iUh9g` 9gGe`¨G$¨c¨$  o$ @ mU.b!b@pDleume9g9ge9ton.tth&T eh$ 7iﬁsU U ¨ne9g£§i9gxt¦V9gep¨8n$r2ieess` sGs@ i¨ooo$ nff#D c ba$,b!n@ U

¨b9ge9g9g¨i$nt e@ rpreted as the

§d ¤

 2  2

with8#r' eg& a6rd£¢s2 8#' & 6£¢2

= 6( `

. The latter is trivially equivalent to the query

condition

t$o8 %th$e`12csbet

$8 %$`12cb

§ ¢b

$ =

U

thatH D OR

& $ c7 oU n¨t£§9ga9g7a9gi¨n`1$2es¨ a@
&7 E £§7a`12e¨ E$
... OR

tod b b

.

£ C P SXU   8  9 £ C P 9 S £ C P S £ C PIn case § d

&7 £§7a`12e¨ b E

 P P `12e¨ b

6( `

£ , then

§ ¢b

6( = NOT

`

§ ¢b , where

§ is

transformed as in the previous case.

Theorem 5: Completeness of COMP
¥ E(39 6A6£ ¥ £ B§4   Proof Sketch: We will prove that every calculus query can6bea) re( pre%se'n`1t2ee¨d by¨9ga9g9g¨ C'OT bMP query

" 8' .

We use induction on the structure of the query expression

.

  ¥ £ B§4   ¥ E(39 6A  ¥ £ B§46a) ( %'`12e¨ beds&7 £§7a`12e¨ b

6£

If E , then

" 8' 6a) ( %' = HAS ANY. This is equivalent to

by deﬁnition.

  ¥ £ B§4   Y 3Y ¥ E(39 6A 6a) ( %'`12e¨ brd &7 £§¦V82i` G¨ $p£§¦V82 b

6£

¥ £ B§4If 6a) ( %'

U

, then

" 8' $p£§¦V82 = HAS ’ ’. This is equiv-

alent to

by deﬁnition.

14

  ¥ £ B§4 &   &  £  £ ¥ E(39 6A6a) ( ¥ £ B§4If

%'`12e¨

¨9g9g9g¨ 'T brd %' 8©` 6a) ( %'

¨9g9g9g¨ 'Ta¨

¨9g9g9g¨

£ b , then 6£

" 8'

d



2



`


!¨9g9g9g¨
E

¨W3

!¨9g9g9g¨W3

¨

b .

This is equivalent to

by deﬁnition.

  ¥ £ B§4 &  Y Y Y Y Y Y ¥ £ B§4 ¤ &  Y Y Q86a) ( %'`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ ¨ ¨9g9g9g¨ S b 6a) ( %' ¢`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ b

¥ £ B§4 ' &  YR Y RY Y R R ( £ ¥ £ B§4 4 R ¥ R £ B§4 'I6f a) ( %' `12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ S b

¦sd = )¥¤ ¦¤ 6a) ( %'

6a) (

R R ¥ E(39 6A¦ ¥ E(39 6A4', where

6, £ " 8' , and 6£

¥ E(39 6A ¥ E(39 6A  ¥ E(39 6A4' ¥ £ B§4c6u£lus que" r8y' expr6es£ sions w" 8it' h equivalen6t£ COMP" 8q' ueries be

6anad) ( %'

%' " 8' are cal-
, then

=

AND

. This is equivalent to

by deﬁnition.

  ¥ £ B§4 &  Y Y Y Y Y Y ¥ £ B§4 ¤ &  Y Y QT6a) ( %'`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ ¨ ¨9g9g9g¨ S b 6a) ( %' ¢`12e¨ ¨9g9g9g¨ £¢¨ ¨9g9g9g¨ b

¥ £ B§4 ' &  YR Y RY Y R R ( £ ¥ £ B§4 4 R ¥ R £ B§4 'I6f a) ( %' `12e¨ ¨9g9g9g¨ £¢3¨ ¨9g9g9g¨ S b

¦sd = )¥¤ ¦¤ 6a) ( %'

6a) (

R R ¥ E(39 6A¦ ¥ E(39 6A4', where

6, £ " 8' , and 6£

¥ E(39 6A ¥ E(39 6A  ¥ E(39 6A4' ¥ £ B§4c6u£lus que" r8y' expr6es£ sions w" 8it' h equival6en£ t COM" P8' queries be

6aan) d( %'

%' " 8' are cal-
, then

=

OR

. This is equivalent to

by deﬁnition.

  ¥ £ B§4 &  ¥ £ B§4 6Y &  ¥ £ B§4 ¤Y6a) ( %'`12e¨ ¨9g9g9g¨ 'T b d 6a) ( %' `12e¨ ¨9g9g9g¨ 'T b

6a) (

S ¥ E(39 6A Y ¥ E(39 6A ¥ E(39 6A YIf 6£ " 8' , where 6£

¥ £ B§4pression that is equiva6lean) t t( o th%e' COMP query

, then

%'

" 8' is a calcul6us£ query" e8x' -

= NOT

.

This is equivalent to

by deﬁnition.

  ¥ £ B§4 &   #  # Q8X¥ £ B§4 6Y &  # ¥ £ B§4 ¤Y6a) ( %'`12e¨ ¨9g9g9g¨ 'T b 'T  &7£( £¥a8#`12e¨ 'T  b 6a) ( %' A`12e¨

U ¥ E(39 6A¦Y ¥ E(39 6AIf =

6£

 # ¥ E(39 6A5Y ¥ £ B§4is a calcul'uTs query e6xp£ ressio" n8t'hat is equivalent to the COM6P aqu) e( ry %'

¨9g9g9g¨ 'T  b

6a) (

" 8' , w6he£ re " 8'

, then

%'

= SOME

(

). This is equivalent to

by deﬁnition.

  ¥ £ B§4 &   #  # W ¥ £ B§4 6Y &  #6a) ( %'`12e¨ ¨9g9g9g¨ 'T b d 'T  &7£( £¥a8#`12e¨ 'T  b

6a) ( %' `12e¨ ¨9g9g9g¨ 'T  b

¥ £ B§4 6Y V ¥ E(39 6A¦YI6fa) ( %'

6£ " 8' , where

¥ E(39 6A  # ¥ E(39 6A5Y ¥ £ B§46£ " 8' is a calculus q"uT e ry exp6re£ssion t"ha8t' is equivalent to the COMP qu6eary) ( %'

, then

¢ = EVERY

(

) . This is equivalent to

by deﬁnition.

15