Sunday, April 1, 2018

FW: WordDelimiterGraphFilter expected behaviour ?

-----Original Message-----
From: Kelvyn Scrupps [mailto:Kelvyn.Scrupps@alliescomputing.com]
Sent: 02 April 2018 04:32
To: solr-user@lucene.apache.org
Subject: RE: WordDelimiterGraphFilter expected behaviour ?

It's been a holiday here in the UK, hence the delay, but thank you for your
far more prompt response.

It makes sense that the filter is removing the punctuation-only term, and
that it only looks odd when alongside the original with
preserveOriginal=true. Fortunately it was just a curio that came up while I
was testing a downstream (and typically flaky) custom filter I'm working on
that gets it's own positional increments in a twist, otherwise I don't think
I'd have noticed it. We don't - or shouldn't - actually send
punctuation-only tokens, so its not really a production concern.

Thanks for the reminder about FlattenGraphFilterFactory too btw.

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 29 March 2018 22:59
To: solr-user@lucene.apache.org
Subject: Re: WordDelimiterGraphFilter expected behaviour ?

On 3/29/2018 1:48 PM, Kelvyn Scrupps wrote:
> I'm using WordDelimiterGraphFilter on a field and came across a curious
additional positional "hole" generated by the filter while playing with the
analysis tool.
> For input "wibble , wobble" (space either side of the comma so it's a
separate token), the output introduces an additional positional hole after
the comma, i.e.
>
> Term position
> Wibble 1
> , 2
> Wobble 4 *
>
> The positionlength for each is 1, so no obvious graph-span going on.
>
> Its not just comma, any punctuation would do, e.g. "wibble ! wobble"

The wrinkle here is enabling preserveOriginal at the same time that you have
a term which is completely removed by the filter (in this case, the comma). 
If preserveOriginal is disabled, they both behave the same.  I don't know if
this is a bug or not.  My instinct is to say it's a bug, but it's possible
that this is expected.

Having a term that's just a punctuation character in the index is generally
not very useful ... but there are OTHER situations with this filter where
preserveOriginal *is* the behavior you want.  I would imagine that as long
as you don't have terms that completely disappear when the filter runs, it
would behave correctly.  Try replacing the ","
with "x," to see what I mean.

Also, FYI, when using a Graph filter, the index analysis chain must also
have this filter (but not the query analysis):

        <filter class="solr.FlattenGraphFilterFactory"/>

Adding that didn't seem to fix the behavior that concerns you, but the docs
do say it's required on the index analysis whenever using a Graph filter.

Thanks,
Shawn


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service
(http://www.symanteccloud.com) for Allies Computing Ltd
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service
(http://www.symanteccloud.com) for Allies Computing Ltd
______________________________________________________________________

No comments:

Post a Comment