From: Howe, David [mailto:David.Howe@auspost.com.au]
Sent: 07 March 2018 09:56
To: solr-user@lucene.apache.org
Subject: Highlighter throwing InvalidTokenOffsetsException for field with large number of synonyms
Hi all,
We are using Solr for indexing address data and one of the fields that we have contains the locality (e.g. suburb, town) with synonyms for the surrounding localities. This has to handle multi-word synonyms as the original locality may have one word but the surrounding locality may contain two words. We have found that when we have a large number of surrounding localities, the highlighting breaks with the exception:
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.lucene.search.highlight.InvalidTokenOffsetsException"],
"msg":"org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wail exceeds length of provided text sized 258",
"trace":"org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wail exceeds length of provided text sized 258\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:648)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingOfField(DefaultSolrHighlighter.java:480)\n\tat
This has only just started happening when we realised that the default length on a Solr text field is 256 characters and not everything was being indexed, so we increased the length using maxTokenLength on the StandardTokenizerFactory. Prior to this, only a limited number of surrounding localities were being processed but highlighting was working with no errors. When we increased the length so that we got all surrounding localities loaded, this error started happening without us making any other changes and running an automated test suite.
These surrounding localities are stored in a database, so we have written our own token filter to handle building the synonyms. When we build the index, the localties are in a token that looks like:
lcx__balmoral__cannum__clear_lake__lower_norton
so the StandardTokenizer keeps this as a single token. Our filter looks for tokens that start with “lcx__” and then creates synonyms from the following data. For the above, we end up with tokens being output as follows:
Position 1 Position 2
balmoral lake
cannum (SYNONYM) norton (SYNONYM)
clear (SYNONYM)
clearlake (SYNONYM)
lower (SYNONYM)
lowernorton (SYNONYM)
As you can see, we also combine two words into one word as a synonym as well. I have attached the full output from the Solr analyser for this example below this email. The definition of the field type for this field is:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type" : {
"name":"localitySynonymType2",
"class":"solr.TextField",
"indexAnalyzer": {
"tokenizer":{
"class":"solr.StandardTokenizerFactory",
"maxTokenLength": 4000
},
"filters": [
{
"class":"solr.LowerCaseFilterFactory"
},
{
"class":"au.com.auspost.postal.ame.solr.LocalityTokenFilterFactory"
}
]
},
"queryAnalyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory"
},
"filters": [
{
"class":"solr.LowerCaseFilterFactory"
}
]
}
}
}' http://localhost:8983/solr/address/schema
echo "Creating surroundingLocalityNamesSynonym field"
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"surroundingLocalityNamesSynonym",
"type":"localitySynonymType2",
"stored":true,
"indexed":true
}
}' http://localhost:8983/solr/address/schema
The term that is mention in the error message above is “wail” which is the 48th locality in the list. On another test it is the 64th locality in the list, so I think it has something to do with the length of the synonyms (as evidenced by the fact that if we remove the maxTokenLength from the StandardTokenizerFactory for these fields then everything goes back to working). It also appears to work without problem for addresses when the locality list is short.
I’m not sure where the length of 258 is coming from in the error message, as it doesn’t match up with anything that I can see.
I have attached the full analysis for one of the data values that is causing the problem. In building our token filter, I have tried to follow what the standard Solr synonym filter produces as an example but I may have missed something.
Does anybody have any ideas about what might be causing this?
Thanks,
David
Field Value (Index)
lcx__balmoral__cannum__clear_lake__lower_norton
|
|
cannum |
[63 61 6e 6e 75 6d] |
0 |
47 |
1 |
SYNONYM |
1 |
1 |
false |
clear |
[63 6c 65 61 72] |
0 |
47 |
1 |
SYNONYM |
1 |
1 |
false |
clearlake |
[63 6c 65 61 72 6c 61 6b 65] |
0 |
47 |
1 |
SYNONYM |
1 |
1 |
false |
|
lowernorton |
[6c 6f 77 65 72 6e 6f 72 74 6f 6e] |
0 |
47 |
1 |
SYNONYM |
1 |
1 |
false |
ake |
[6c 61 6b 65] |
0 |
47 |
1 |
<ALPHANUM> |
1 |
2 |
false |
norton |
[6e 6f 72 74 6f 6e] |
0 |
47 |
1 |
SYNONYM |
1 |
2 |
false |
David Howe |
----------------------
Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.
The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.
If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.
Please consider the environment before printing this email.
No comments:
Post a Comment