Index Modifiers

Version: 0.6/2006-02-16
Status: Pending discussion

Introduction

The final version of the sort proposal for CQL 1.2 uses modifiers that are attached to an index instead of a relation or boolean. This raises the obvious question of whether these should be allowed in the main part of the query. These queries would look like:

 dc.title/mods.alternative any fish
 marc.008/substring="0,6" eq "200776"  
 dc.description/pos.noun any ram

Points of Discussion

Non-proliferation of Indexes

One modifier can replace several indexes by being combined with the base index. For example, we could create a 'role' modifier and qualify the 'dc.creator' index in several ways:

    dc.creator/role=illustrator any name
    dc.creator/role=editor any name
    dc.creator/role=photographer any name
As the role modifier further clarifies the semantics of the creator index, it removes the need to create the indexes:
    foo.illustrator any name
    foo.editor any name
    foo.photographer any name
However indexes are free. There is no limit to the number of indexes that we can support because there is no global pool from which they must all be drawn. This is not the same situation as the BIB1 attribute set where each index consumes a number. There is no real gain, because a new index does not cause a loss.

Ease of Documentation

Instead of having to document new indexes, with index modifiers you can document the modifier and allow them to be applied where appropriate.

However, this appropriateness needs to be documented.

The burden of documentation is not reduced, it is simply moved around. This will create more confusion as the documentation is split up rather than attached to the index.

Dynamic Index Names

We already have the notion of a context set where the documentation does not specify all possible indexes, instead it specifies how those index names can be constructed. The MARC context set is a perfect example.

By using this convention, we can have many of the advantages of index modifiers, without the added compexity for developers and users. Instead of

    marc.100/subfield=a
it is easier for everyone to have
    marc.100$a
even if the documentation does not enumerate all the possible field and subfield combinations.

ISO11179

The recommendation on how to construct compound index names using the ISO 11179 conventions is sufficient. During the discussion that lead to the best proactice document, splitting the index up into the 11179 components was discussed but discarded because it was easier to simply concatenate the components together into the index name. It is still easier to do this.

Explain

It is possible to include in ZeeRex that an index supports a relation modifier, and in the same way it would be possible to explain the support for an index modifier. However this stops with only simple modifiers that have just a type, and do not have a comparison symbol or value. For example:

    <index>
      <map><name set="dc">title</name></map>
      <configInfo>
        <supports type="indexModifier">pos.noun</supports>
      </configInfo>
    </index>
Whereas it would not be possible to explain that role=editor was supported.

Because the index name is not sufficient to perform the search, the complexity of transforming an explain document into a search interface is greatly increased. If the modifiers can take context set, or worse index, specific values, this becomes impossible without prior knowledge of the context sets in question. As we cannot know all context sets (they can be created at will, without registration), this is not possible.

Semantic Explicitness

By constraining the semantics of an index, we enable the search to be easily devolved if the more explicit combination is not supported. For example we could search dc.creator if dc.creator/role=illustrator is not supported.

This could not be done by the server without an extension. That extension could also specify in the request which index(es) to 'back off' to. The client could not automatically just repeat the search without the modifier as some modifiers would be critical to the search, for example the substring modifier applied to field 008 in the marc context set. There is not as great a gain here as it appears at first glance.

Implementation

CQL parsers will need to be updated for sort, so they can be updated at the same time to parse modifiers attached to any index, not just those following 'sortBy'. As there is a very limited number of CQL parser implementations (as far as is known) and most of them are developed by the people writing the spec, or are ports of those parsers to other languages, the effort involved will not be significant.

However, the individual server and client implementations still additional work to use the results of the updated parsers. It likely complicates the mapping from CQL into the internal search format. Even if the server uses CQL internally, there must at some point be a single list of terms to search, and getting from index to identifier for a term list is easier than getting from index with modifiers to that identifier.

Although currently there are only a few CQL parsers, it is surely hoped that there will be more implementations in the future. By adding index modifiers, we raise the bar on that implementation, reducing the likelihood of new parsers, and hence reducing the likelihood (albiet minimally) of new SRU implementations.

Relation Modifiers

By introducing a second place to put a modifier in search clause, we introduce confusion as to what should be a relation modifier and what should be an index modifier. For example, if you want to search for 'fish' as a noun in the title, is it:

    dc.title/pos.noun any fish
or:
    dc.title any/pos.noun fish

It would be possible to specify query term format and a different document term format:

    dc.title/string any/number 16
might search for 'sixteen'. But if this was desirable, it could be implemented as two relation modifiers:
    dc.title any/docFormat=string/queryFormat=number 16
and there has not been a requirement for this brought up anyway.

Anything that could be specified as an index modifier could be specified as a relation modifier, although less clearly than a new index:

    dc.title any/mods.alternative fish
However this does have some utility for the modifiers discussed in examples:
    dc.title any/pos.noun fish
    marc.008 any/substring="0,6" 200776

Some rule would need to be created to determine what should be an index modifier, and what should be a relation modifier. It seems very likely that some current relation modifiers would fail this test and need to be shifted, creating backwards incompatibilities.

Proximity

Many of the use cases for semantic qualification of indexes can be supported already by proximity. If there was some reason not to create new flat indexes, for example if there were too many dynamic values to be feasible, you could implement it as same element proximity.

    dc.subject/authority=lcsh any music
could become:
    dc.subject any music prox/unit=element/distance=0 authority = lcsh

While more verbose and potentially more difficult to implement, the capability to express the search is still available.

Syntactic Ambiguity

Possibly the most damning of all objections is that allowing modifiers on an index when it is part of a search clause introduces ambiguity into the language. Is:
    dc.title/role = editor
a complete search clause (= is a relation, editor is the term) or an index with a modifier (= is the comparison symbol for the modifier, editor is the modifier value). Such ambiguity is to be avoided at all costs.

It would be possible to create a new special character token, for example ':' and use this to separate index modifiers from their value. Eg:

    dc.title/role:editor
However the above would already be valid CQL if we allow index modifiers, it has a single modifier called 'role:editor' with no value.

New special characters should also be avoided if possible. If we created one new character to replace '=', we should also create new special tokens for <=, <, >, >= and <>. This makes the language considerably more complicated and less intuitive. It also means that many more search terms would need to be quoted.

Recommendation

There are two appropriate Mike-Taylor-isms:

Don't Do That Than.
Just Say No.

If there is any advantage to allowing index modifiers outside of the search keys, it is well and truly negated by the disadvantages. While it would be syntacitcally possible to allow only single named modifiers, the advantage over treating it as a relation modifier seems minimal if any. The other style leads to syntactic ambiguity or a radical change to the language, increasing complexity for little gain.

In the search keys, there is no ambiguity as it is completely separate from the main query. There is no reason to change the search specification, but there are reasons to not allow index modifiers in the main part of the query.