Version: 0.9.5/2006-02-08
Status: Pending discussion
While discussing a possible MARC format focused context set, the current definition of proximity was raised as a possible solution to performing searches that made use of the MARC field and subfield structure. The ZeeRex context set already makes use of this for discovering combinations of context set and index name, for example.
The issues that were raised fall into the following broad groups:
Proximity is CQL's answer to structured searching. The prime use case is the user to search for words within a given distance of each other, for example cat within 4 words of hat. It is expressed in terms of distance between the terms and a relation for that distance, the unit for the distance, and whether the order of the terms is significant. Some examples:
- "cat" prox/distance=1/unit=word "in"
- "cat" prox/distance>2/unit=word/ordered "hat"
- "data" prox/distance=0/unit=paragraph "computer"
The distance relation may be one of: <, <=, =, <> >=, >=, >
The distance value is a non-negative integer, default is 1 for words and 0 for the other units.
The unit must be one of 'word', 'sentence', 'paragraph' and 'element'.
The ordering is either 'ordered' or 'unordered', with the latter being the default.
CQL defines the units for proximity. This limits what sort of structural or text focused searching can be carried by CQL. We do not want to limit implementations in this way. Secondly, we do not want to have to change the CQL definition every time an implementor adds a new unit to their particular product.
The proposed solution, like the rest of CQL, is to allow the units to be profilable in context sets.
We can use different sets with the same names, for example if we wanted to find people who lived within three streets of each other:
- "cat" prox/unit=sentence "hat"
- "Rob" prox/my.unit=street/distance<4 "Jane"
Currently, the CQL specification determines the default values for the proximity modifiers. This means that there is no way to allow a server to treat "a prox b" differently to the query "a prox/unit=word/distance>=1/unordered b" even if a and b are not single word terms, or are discovered in non-word indexes.
The proposed solution is to allow servers to use their own defaults when the values are not specified, but that a recommended best practice is to follow the CQL 1.1 defaults unless there is a particular reason not to.
Although there are only four units in CQL 1.1, they do not have a description of what they mean. They were drawn from Z39.50's proximity for 1.0 and the definitions were not copied along with them. With the transfer from CQL defined to context set defined semantics, this is a perfect opportunity to put this failing to rest.
The following proposal attempts to cover the use cases and issues brought up during the discussion.
The most commonly implemented type of proximity is word based. Even ignoring the special case of adjacency (phrase searching) it is still relatively easy to understand and implement word based proximity.
CQL and SRU allow the server to define the exact details of what a word is. For example it does not specify whether 'book-case' is one word or two. In abstract terms, a word should be treated as a single linguistic unit, as appropriate for the language of the text.
The use cases for word based proximity are relatively clear -- there may be many similar ways of describing something, and hence the user needs to find words together, rather than words spread out in the document.
For example, both "association rule mining" and "mining of association rules" could be discovered by the query: mining prox/unit=word/distance<4 association prox/unit=word/distance<4 mining but would not discover a document where the three words were separate.
When distance=0, it refers to the same unit. For example distance=0 and unit=sentence means that the two words need to be within the same sentence. Thus distance=0 and unit=word means that the terms need to be within the same word. You would then go on to conclude that this is useless, but this is not necessarily the case (and hence distance=0/unit=word should not be an error).
For example, the query:
txt.prefix=in prox/unit=txt.word/distance=0 txt.suffix=lyThis would match 'indistinguishably', 'indestructably', 'interminably' but not 'intellectually' because 'in' is not a prefix, it's part of 'intellect'. This cannot be achieved with term masking ("in*ly") as this would match many more words than wanted, including 'intellectually'.
Textual Structure units:
Physical Structure units:
The structure of most records can be thought of as a tree. XML is naturally tree-like. It's not much of a stretch to think of the MARC structure as a tree. While graphs are not necessarily tree-like, many properties are similar. Other structures may vary.
There are two dimensions in a tree. You can move up and down the parent/child heirarchy, or you can move backwards and forwards through sibling nodes. Searches should be able to express combinations of these two dimensions as well. This is an immediate problem as we can only express one dimension with 'distance'. The current proposal expresses searches that include both dimensions only as complete sub-trees for this reason.
The units defined in the Structural context set:
Example tree structure:
Example tree instantiated in XML:
<a>
bla bla introduction
<b>
<e>
bla bla association rule mining</e>
<f>
bla bla classification bla bla indexing</f>
bla bla data mining
</b>
<c>
bla bla text mining bla bla linguistics</c>
<d>
<g>
bla bla text indexing</g>
</d>
</a>
An element is a single structural unit akin to a word. The unit is not necessarily atomic, and may contain other sub-structures. The scope of element is thus dependant entirely on the structure of the data, and the location of the matching term within that structure. The base structure is the one in which the left hand term appears. The structural components should be treated the same, regardless of the type of the component. For example an XML attribute is in the same scope as a child element. An indicator in MARC is at the same level as a subfield. Element distances are horizontal, stepping through siblings rather than parents/children.
In the example all of A through G are elements.
The query:
association prox/unit=tree.element/distance=0 miningwould find a match in element E.
The query
association prox/unit=tree.element/distance=1 classificationwould match the text in elements E and F as they are siblings with a distance of 1.
The query
data prox/unit=tree.element/distance>0 classificationwould not match anything, because 'data mining' apppears in B, and 'classification' appears in F, which is a vertical distance, rather than a horizontal distance.
We can also search vertically in the tree using the 'parent' or 'child' proximity units. Parent searches upwards in the tree, child searches downwards. Note that there is not a default ordering (unless the server defaults the search to ordered), meaning that the two units function identically for unordered searches. This will be covered in the examples below, but the recommended best practice would be to default parent/child proximity searches to ordered. Distance=0 is the same as for element -- within the same node -- but the recommended best practice would be that distance is always 1 or greater, and default to 1.
The query:
linguistics prox/unit=tree.parent/distance=1 introductionwould match due to nodes C (linguistics) which has the parent A (introduction).
The query:
introduction prox/unit=tree.child/distance>1 indexingwould match A and F, as well as A and G.
The query:
mining prox/unit=tree.child/distance=1/unordered datawould match E (mining) and B (data), even though B is the parent of E, because the query says that it should not care which order the terms appear in.
We may also want to search vertically and horizontally through the tree at the same time.
For example, we might wish to find association (E) in the same subtree as linguistics (C). In this case, the top node A is the common parent of those nodes, but in a larger tree, A might be a child of another node. In order to accomodate this search, we have another proximity unit 'subTree'.
The distance is the vertical distance at which a common ancestor must be found.
So the query:
association prox/unit=tree.subtree/distance<3 linguisticswould match a subTree with A as the common parent.
The query
association prox/unit=tree.subtree/distance=2 linguisticswould not match, as the distance from linguistics (C) to A is 1, not 2.
However the query
indexing prox/unit=tree.subtree/distance=2 miningwould match nodes E and G, both of which have a distance of 2 to their common parent A.
MARC
The two indicators and subfields are the proximity elements, as they are the structure in which the data will be found.
In XML, an 'element' is an element or an attribute.
(
(
txt.object = cancer
prox/unit=txt.clause/distance=0
txt.verb = kill
)
prox/unit=txt.sentence/distance<2
txt.fulltext = seminoma
)
prox/unit=tree.subtree/distance<3
txt.fulltext = chemotherapy
Find a single clause that has the verb 'kill' with the object containing 'cancer', within the same or adjacent sentences as the word 'seminoma', all of which must be within a subtree distance less than 3 of the word 'chemotherapy'. (!)