Sunday, April 1, 2018

FW: WordDelimiterGraphFilter expected behaviour ?

-----Original Message-----
From: Kelvyn Scrupps [mailto:Kelvyn.Scrupps@alliescomputing.com]
Sent: 30 March 2018 01:18
To: solr-user@lucene.apache.org
Subject: WordDelimiterGraphFilter expected behaviour ?

Hi

First posting to list, but here goes .

I'm using WordDelimiterGraphFilter on a field and came across a curious
additional positional "hole" generated by the filter while playing with the
analysis tool.
For input "wibble , wobble" (space either side of the comma so it's a
separate token), the output introduces an additional positional hole after
the comma, i.e.

Term position
Wibble 1
, 2
Wobble 4 *

The positionlength for each is 1, so no obvious graph-span going on.

Its not just comma, any punctuation would do, e.g. "wibble ! wobble"

I know it's a bit contrived, and it doesn't break anything in production but
it just puzzled me.

The question is - is this by design ?. Its not the behaviour of the old
WordDelimiterFilter filter.

Setup:

Solr 6.6.3

Field:
<fieldType name="text_en_allies" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" splitOnNumerics="0" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1" stemEnglishPossessive="1"/>
...
</analyzer>

Thanks for any insight.

Kelvyn Scrupps
Developer for Allies Computing

 
 
 


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service
(http://www.symanteccloud.com) for Allies Computing Ltd
______________________________________________________________________

No comments:

Post a Comment