Index Search Server Configuration

The following describes the configuration changes required to enable Weave indexes on the server in order to allow fast indexed search.



For the client to have something to search for, indexes must be configured and entities must be indexed on the server. To provide the type of performance required for interactive searching like this, the indexes must be pre-generated and used for the search rather than actually searching through the database directly each time the user performs a search. To do this Weave creates a free text index from information that you provide it from at least one data definition.

To understand how the index needs to be set up it could be useful to describe how Weave performs the search. So starting from when the client has typed in some search terms into the search input box on the client, after waiting for a fraction of a second the client will send the contents of the search input box to the server (and then waits for the list of matches to be returned from the server for display), but it's what happens when the input gets to the server that we're interested in for now.

Now when the text arrives at the server it breaks the text into individual words by separating them based on the spaces in the text, and it's these words, individually called "keywords", that Weave will actually be using to search through the index. This leads us to the first important part of building our index, which is generating a list of keywords for each entity.

As part of building the index Weave creates what's referred to as a "document" for each and every entity that you want to be searched for (and an index is just a collection of documents). These documents contain the unique id of the entity, so the server can know what's been found, but also has a list of keywords associated with each entity. It's these keywords that are matched against what the user types in that determine what's returned as the results of the search.

The actual contents of the keyword fields in each document are obtained by Weave from a data definition that you have to set up. This data definition links each entity id to a list of columns in the database that contain the keywords that need to be attached to the document, and it's up to you to know what columns it is that is appropriate for the search. For example, you could attach registration numbers to dogs, owner names to properties, names to roads or business names to local businesses. Weave and the index builder doesn't particularly care what content you include in the keywords (apart from some smarts that we'll look at later) and it basically just adds the text to the documents and later when performing the search uses its smarts to match the user-supplied keywords with those included in each document.

So we've seen how the keywords the user supplies and the index builder associated with each document tell Weave what it is that you've searched for, but after that, we need to display the results to the user and to do this we use 'display' fields from the database. This is done basically the same way as when we associated keywords with each entity when adding them to the documents, but in this case, the contents of the display fields aren't indexed when performing the search, instead, they're just returned to the client and used in the input box drop-down list to display the results to the user. This is again done by associating a data definition with the index.

Before we have a look at an actual index definition the last piece of information we need to supply is the entity that the index is being generated for, which is done by adding an 'entity' tag to the index definition. By telling the index which entity it's associated with we're also providing it with the other piece of information that it needs to get started, and that's the actual geometry that goes along with each document. This information provides the index with the extent and centroid of each entity, by utilizing a spatial mapper that's associated with the entity, which is stored in the document for each entity along with the entity id, list of keywords and display fields (along with some other supporting fields) to allow most of the information the user required to be quickly available after the search.

If we look at an example of creating a simple index for roads where the user can search for a road in a suburb then we would need some supporting information available. Firstly we need the actual entity that we're going to be searching for and the spatial mapper that provided its geometry and would be something like:

<!-- Create our roads entity --> <entity:entity id="roads"> <label>Roads</label> </entity:entity> <!-- Link the roads entity to the ROADS layer in our spatial engine (not shown) --> <mapper:mapper id="roads.mapper"> <spatialEngine>spatialEngine</spatialEngine> <mapping> <entity>roads</entity> <table>ROADS</table> <key>ROAD_ID</key> </mapping> </mapper:mapper>

Next, we need a data definition that supplies the keywords and display fields, which could be separate data definitions, but we'll use the same one. An example of what our data definition may look like is as follows:

<!-- Provide a road name, type and suburb based on ROAD_ID from the ROADS table --> <data:datadefinition id="dd_index_roads"> <datasourcedataconnection datasource="datasource" key="ROAD_ID"> <prefix>DISTINCT</prefix> <from table="ROADS"/> <parameter name="name" column="NAME"/> <parameter name="type" column="TYPE"/> <parameter name="suburb" column="SUBURB"/> </datasourcedataconnection> </data:datadefinition>

Using the above information a simple index definition would look something like this:

<index:entity id="index.roads"> <entity>roads</entity> <display> <datadefinition>dd_index_roads</datadefinition> <level1>Road: ${name} ${type}</level1> <level2>Suburb: ${suburb}</level2> </display> <keywords> <datadefinition>dd_index_roads</datadefinition> <level1>${name} ${type}</level1> <level2>${suburb}</level2> </keywords> </index:entity>

So what we end up with here is an index called 'index.roads' which indexes 'roads' entities based on the road 'name', 'type' and 'suburb', and displays the road and suburb details to the user.

Sorting

Further to this we can now (as of version 1.1.0 of the index bundle) also sort the results.
To do this you add a <sort> tag to the configuration to tell the index builder what information to attach to each indexed document that's used to determine the sort order.

Sorting does not affect which results are returned to the user, the weighting of the individual documents still does that, the sorting just determines the order in which the results are shown to the user.

When adding sorting to your indexes it's best to either add sort information to all indexes or none if you're searching across all entities (if you've configured the search functionality on the client to only search the active entity then this doesn't apply).

This is because the searching operation is different if sorting is involved compared to when it isn't, and if you're trying to search over multiple indexes (or entities) then as to not impose additional overhead in having to search twice (once for sorted indexes once for non-sorted) Weave will only search sorted indexes or non-sored indexes. When performing a search across all entities the server will first look for all indexes that have sorting configured and use just those for the search, unless there are indexes that are configured for sorting in which case it will use all of the indexes (and assume that none are sorted). What this means is that if you're sorting across all entities and only some of your indexes have sorting configured then only those indexes will be searched and none of the entities in your non-sorted indexes will be found.

Anyway, adding sorting to an index is the same as adding display and keywords, but is limited to a single level. So when you add a sort to an index an extra processing step is performed to iterate over the data definition configured for the sort and construct a sort field for each indexed document that will then be used during the search to order the returned results (as opposed to the default order which is based on the weighing of the found documents).

So if we wanted our roads sorted so they're ordered by the suburb they're in followed by the road name then we could do the following:

Since the level in the sort works the same as the display and keywords it means that we can add additional text to the sort, and this can be used to our advantage to ensure that the order of the different types of entities found can be displayed to the user in a certain order. For example if we search enabled suburbs, roads and property addresses then we could use the sort field to ensure that suburbs are always listed first then followed by roads and then properties (regardless of how "well" the actual results match the search). To do this with our previous example we could change the level1 tag of the sort to

and then make sure that our sort for suburbs had the level prefix set to 0010 and properties set to 0030. This way we'll ensure that the sorting order of the suburbs, roads and properties will always be returned in that order, and the original sorting we specified (suburb and road name in the previous example) will be used within those groups.

A further example of this can be seen if you have address data that store house numbers as a separate field in your data, then when creating the sort field for your property addresses you would add the house number to the end of the sort field to ensure that the properties are returned in house number order (but remember to left pad the field in the data definition with 0's to ensure the sort is performed numerically rather than alphabetically)

Detail

In detail, what happens with this information when it comes time for Weave to build the index is that the index builder will iterate over each and every feature returned by the spatial mapper associated with the entity indicated by the index. For each feature it finds it creates a new document in the index and to that document it attaches the entity type, the entity id, the entity centroid and the entity extent.

After it's created a document for each available entity it then processes the display definition to add the fields to the document that will be displayed to the user. It does this by iterating over the data definition set in the display configuration and using the level1 and level2 information to substitute the fields retrieved from the data definition. It does this by replacing the ${} values with the matching parameter from the data definition and then using that text as the content of the field to be stored in the document. As you can see from the example above any text can be included in the display configuration, including HTML, in the example above "Road: " and "Suburb: " are examples of additional text that will be sent to the user. And the values from the data definition will replace the markers created using the ${} syntax.

A display configuration is limited to two-level tags, that is you can only specify level1 and level2.

The index builder then processes the keyword fields using the same process it used for the display fields, but in this case, there can be up to 5 levels set (we only use 2 in the example above). Again, you can add your own text to the keywords (which hasn't been done in the example above) but it doesn't make sense to include HTML in there, since this is the text that's going to be searched for a match. This could be done for example to add the word 'ROAD' to the keywords index to allow the user to help narrow down the search to just roads if there were other indexes setup for other entities by including 'road' in the search field (I know that's a bad example since road type will include 'road' anyway but you get the idea).

Finally, an additional run-through is perform if sorting is configured for the index.

Weighting

The levels in the keywords give us our introduction to weighting in the index. Weighting allows you to set higher priority to some database fields compared to others, and it does this by giving lower-numbered levels a higher weighting than higher-numbered levels. That is if a document is found that has a match in the level1 field it will be returned higher in the list than another document that may have the same value but in the level2 field, which will be returned before another document that has the same value in the level3 field, and so on.

From our example above we can see that road name and road type will be given the same weighting, and the highest, and suburb will be given a slightly lower weighting. It's also possible to attach a weighting explicitly to the index as a while by adding a 'weight' tag to the index definition that contains a number that's used to multiply the weighting, with the default being 1.0. So by setting a weight of 2.0 the documents in the index will be twice as likely to be returned than another index that has the same values but has the default weighting of 1.0, or setting the weight to 0.5 will halve the chance of those documents being returned.

As of version 1.8.17 of the com.cohga.server.index bundle you now have the ability to specify the weight values for individual records.  As can be seen in the example below, the administrator can define a weights element inside the index comprising of a data definition and a value.  The value is sourced from the data definition in this case weight.  The weight will be applied to the feature to increase its score value when searching through the index.

Keyword Smarts

As mentioned before there are some smarts built into the index builder when processing the keywords, the first is synonyms and the second is number range expansion.

Synonyms

Synonyms allow you to specify a text file that provides alternate keywords for those found in the database. This can be done for simple things like including "STREET", "STR" and "ST" as keywords when the database provides just "ST" (or just "STR" or just "STREET"), then the user can use either of those values to search for streets. Weave supplies a text file with a list of common street types' synonyms as part of the indexer installation that can be used with any index.

Synonyms can also be used to provide completely different words as synonyms, for example, 'pharmacist' can be set as a synonym for 'chemist', unlikely to be of much use in our roads example (unless you wanted to set up a synonyms list of the street names) but handy if you want people to be able to search for business types and want to catch the different business types that a business could be.

There are two formats of synonym files, one where each of the synonyms refers to the same word, for example in our street abbreviation file, where ST, STR and STREET are abbreviations for the same word. And the other for when the words are alternatives for the same word in one direction but may not apply in the reverse direction, for example, 'revoke' and 'abandon' could be a synonym for 'vacate' but 'revoke' shouldn't be a synonym for 'abandon'.

The first format has a single line for each group of synonyms with the alternatives separated by a comma, for example:

And the seconds format has the original word followed by an equals sign and a space-separated list of alternatives, for example:

To add synonyms to an index you must add at least one 'synonyms' tag to the index and rebuild the index.

Number Range Expansion

When adding keywords to a document the index builder will look for keywords that look like number ranges and expand those to include the individual numbers within the range. This way if the database contains "11-14" as one of the fields then the index builder will include "11-14" as one of the keywords, but it will also include "11", "12", "13" and "14" as separate keywords (at the same weighting as the original keyword). This is done to help the user find those likely matches when they search for part of the number range or a number within the range.

If fact the range expansion is more complex than that and can handle a wide range of different formats, including some of the following examples:

Original

Additional

12A

12

1/12

12

1/12A

12A 12 1/12

1A/12A

1/12A 12A 12 1/12

1A/12

1/12 12

10-14

10 14 11 12 13

1/10-14

10-14 10 14 11 12 13 1/10 1/14 1/11 1/12 1/13

1A/10-14

1/10-14 10-14 10 14 11 12 13 1/10 1/14 1/11 1/12 1/13 1A/10 1A/14 1A/11 1A/12 1A/13

Number range expansion is automatic and does not require any changes to the index definition to be enabled, which also means that at the moment it can't be disabled, but that may change in the future.

Stop words

Lucene, the document search technology behind the Weave index search, uses a pre-defined list of “stop words”. Stop words typically include common words like "the," "and," "in," "is," and so on, which appear frequently in a language but don't carry significant meaning on their own. As of Weave 2.6.9 stop words are no longer used when building a index, this was done because stop words are generally useful when indexing large pieces of text, whereas Weave indexes are created on small pieces of text and the stop words should be included.

To restore the previous list of stop words, or set your own, you can create an new configuration item that lists the available stop words that should be used, e.g.

Scheduling Updates

Because the index is built from the data that's available at the time it's built it may become stale over time and require rebuilding. This can be done manually at the OSGi console, (more on that later) or set up in the index definition using a schedule defined using a format similar to the Cron format.

By adding a schedule tag you can indicate to Weave when the index can be rebuilt down to the millisecond, and have it rebuilt at certain times each day or on certain days of the week (or a combination of these).

Unit

Range

Unit

Range

milliseconds

0-999

seconds

0-59

minutes

0-59

hours

0-23

day of week

1-7 (sunday-saturday)

day of month

1-31

month

1-12



Schedule

Description

Schedule

Description

0 0 30 2

will run at 2:30am each day

0 0 30 2,14

will run at 2:30am and 2:30pm each day

0 0 30 2,8,14,20

will run at 2:30am, 8:30am, 2:30pm and 8:30pm each day

0 0 30 2 3

will run at 2:30am each Tuesday

0 0 30 2 * 1

will run at 2:30am on the first of each month (do not use*)

0 0 30 2 * 1 2

will run at 2:30am on the first of February each year (do not use*)

0 0 30 2 5 * 2

will run at 2:30am on each Thursday of February each year (do not use*)

0 0 30 2 4 1 2

will run at 2:30am on each Wednesday and on the first of February each year

0 0 15,45

will run every half hour at quarter past and quarter to

*It appears that using * for the day of the week or the day of the month may cause issues (like the index continually being built).

Note that since building the indexes can be CPU intensive you should stagger the rebuilding so that you don't try and rebuild more than one index at a time.

Scheduling an index build for 2:30am each day

Command Line

The indexing in Weave provides a number of commands that can be used at the OSGi prompt to work with the indexes.

Command

Parameters

Description

is



return a list of all indexes

ib

[<index#>|<indexId>]

rebuild an index or all indexes if no index is specified

ik

[<index#>|<indexId>]

update keyword fields for an index or all indexes if no index is specified

id

[<index#>|<indexId>]

update display fields for an index or all indexes if no index is specified

ig

[<index#>|<indexId>]

update geometry field for an index or all indexes if no index is specified

io

[<index#>|<indexId>]

update sort field for an index or all indexes if no index is specified

iu

[<index#>|<indexId>]

unlock an index or all indexes if no index is specified

ir

[<index#>|<indexId>]

remove an index or all indexes if no index is specified

it

"<search terms>"|id:<entityKey> [<entityId>|<indexId>] [<limit>]

test index

<> substitute, [] = optional, | alternate

<index#> is the value listed  in the "Index" column of the is command, and is used to indicate which index to perform the operation on

<indexId> is the value listed in the "Id" column of the is command, and is used to indicate which index to perform the operation on

"<search terms>" is the text to search for, enclosed in double quotes if it contains spaces

id:<entityKey> is the text id: followed by the key value of a specific entity (not a type of entity), e.g. id:45142

<entityId> is the id (or type) of the entity to search for, and is the value listed in the "Entity" column of the is command

Update: The ib, ik, id, ig, iu, io and ir commands now also accept a list of space-separated index id's indexes or no index id parameters to perform the operation on all indexes.

Also, if multiple commands are submitted at once they'll be queued up so that only one command is performed at a time (this also goes for commands that are triggered through a schedule). This is to ensure that the server isn't overloaded with building indexes (you can imagine if you had 10 indexes and happened to type ib in the console and triggered the concurrent build of 10 indexes).

At the OSGi console you can use 'is' to see what indexes are currently registered in Weave

From this we can see that the 'idx.roads' index has not actually been built, from here we can use the 'ib' command to build the index (assuming we haven't setup a schedule that would build the index for us)

And then we can actually test our index without having to start the client



Manually updating indexes

The server status page contains links for initiating an index update, and the 'build' link in that page can be accessed by an external application to start an index build.
The link to start an index build would http://hostname:8080/weave/server/index/build/<indexid>

Troubleshooting

The osgi console has the ability to perform index searches, using the 'it' (index test) command, and it shows more details about the results, it also does it at a slightly lower level than the client.
You should run any test you do through this command to see what's actually going on.
You may want to check out http://lucene.apache.org/core/old_versioned_docs/versions/2_9_4/queryparsersyntax.html for details on the search syntax when using the 'it' command.
Note: By default the Weave index search adds an * to the end of the search term, so to replicate what the client is doing you should also include an * at the end of the search term when using the it command.

There's also a standalone tool you can download http://www.getopt.org/luke/ that will allow you to open and look at the index directly.
Note: One thing to keep in mind is to change the Analyzer to the StandardAnalyzer from the KeywordAnalyzer, in the Analysis tab under the Search tab if you do any searches.
The Documents tab is handy because it allows you to cycle through the stored documents directly and see exactly what's stored for each one.

You need to make sure you're using synonyms if you're storing things like street types where the database contains 'RD' but the user may type in 'road' or 'rd'.

Also, punctuation can cause problems if used in a keyword field, meaning the user typing 'road' won't necessarily match the keywords if it's derived from the value '123 Main Road, Smallville'.

Updates for Weave 2.5

As of Weave 2.5 the way indexes are built has changed.
Previously they were generated based on the geometry first, now they're generated based on the attributes first, but only if the data definitions used for the keywords, display, sort and fields is the same.

To switch to the older method of generating indexes you can set

it as an option within the index config (and ensure all data definitions are the same). This flag was introduced in Weave 2.5.4.

Additionally, since all data definitions are the same they can be set once at the top level of the index config, rather than duplicated within each section.

The index builder doesn't like it if there is a large change in the number of items in the index between one build and the next, since it normally indicates some sort of error, so rather than replacing the index with "bad" content it aborts the update. There are however situations where this is not an error, in which can you can turn off the check by adding <check>false</check> to the index definition.

If the key column values are unique for each entity in the spatial table(s) you can add <unique>true</unique> to the index config to provide a hint to the index builder that this is the case and it can optimize the building of the indexes.

Auto-building indexes

As of Weave 2.5.29, indexes will be built when they're created if their content does not already exist. Also, indexes that aren't built will be built when Weave starts. This can be disabled by setting the system property weave.index.autobuild to false or adding <autobuild>false</autobuild> to the index config.

Coordinate Reference System

You can specify what CRS to use for the stored centroid and envelope by setting a crs tag in the index config. If you don't specify a crs it will be the same as the original geometry.