Sunday, April 1, 2018

FW: querying vs. highlighting: complete freedom?

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 26 March 2018 22:05
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: querying vs. highlighting: complete freedom?

Arturas:

Thanks for the "atta boy's", but I have to confess I poked a developer's
list and the person (David Smiley) who, you know, like understands the
highlighting code replied, and I passed it on ;

I have great respect for the SO forum, but don't post to it since there's
only so much time in a day, so please feel free to put that explanation over
there.

As for the rest, I'll have to pass today, the aforementioned time
constraints are calling....

Best,
Erick

On Mon, Mar 26, 2018 at 12:12 AM, Arturas Mazeika <mazeika@gmail.com> wrote:
> Hi Erick,
>
> Adding a field-qualify to the hl.q parameter solved the issue. My
> excitement is steaming over the roof! What a thorough answer: the
> explanation about the behavior of solr, how it tries to interpret what
> I mean when I supply a keyword without the field-qualifier. Very
impressive.
> Would you care (re)posting this answer to stackoverflow? If that is
> too much of a hassle, I'll do this in a couple of days myself on your
behalf.
>
> I am impressed how well, thorough, fast and fully the question was
answered.
>
> Steven hint pushed me into this direction further: he suggested to use
> the query part of solr to filter and sort out the relevant answers in
> the 1st step and in the 2nd step he'd highlight all the keywords using
> CTR+F (in the browser or some alternative viewer). This brought be to
> the next
> question:
>
> How can one match query terms with the analyze-chained documents in an
> efficient and distributed manner? My current understanding how to
> achieve this is the following:
>
> 1. Get the list of ids (contents) of the documents that match the
> query 2. Use the http://localhost:8983/solr/#/trans/analysis to
> re-analyze the document and the query 3. Use the matching of the
> substrings from the original text to last filter/tokenizer/analyzer in
> the analyze-chain to map the terms of the query 4. Emulate CTRL+F
> highlighting
>
> Web Interface of Solr offers quite a bit to advance towards this goal.
> If one fires this request:
>
> * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955)
> was a German-born theoretical physicist[5] who developed the theory of
> relativity, one of the two pillars of modern physics (alongside
> quantum mechanics).&
> * analysis.query=reletivity theory
>
> to one of the cores of solr, one gets the steps 1-3 done:
>
> http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=x
> ml&analysis.showmatch=true&analysis.fieldvalue=Albert%20Einstein%20(14
> %20March%201879%20%E2%80%93%2018%20April%201955)%20was%20a%20German-bo
> rn%20theoretical%20physicist[5]%20who%20developed%20the%20theory%20of%
> 20relativity,%20one%20of%20the%20two%20pillars%20of%20modern%20physics
> %20(alongside%20quantum%20mechanics).&analysis.query=reletivity%20theo
> ry&analysis.fieldtype=text_en
>
> Questions:
>
> 1. Is there a way to "load-balance" this? In the above url, I need to
> specify a specific core. Is it possible to generalize it, so the core
> that receives the request is not necessarily the one that processes
> it? Or this already is distributed in a sense that receiving core and
> processing cores are never the same?
>
> 2. The document was already analyze-chained. Is is possible to store
> this information so one does not need to re-analyze-chain it once more?
>
> Cheers
> Arturas
>
> On Fri, Mar 23, 2018 at 9:15 PM, Erick Erickson
> <erickerickson@gmail.com>
> wrote:
>
>> Arturas:
>>
>> Try to field-qualify your hl.q parameter. That looks like:
>>
>> hl.q=trans:Kundigung
>> or
>> hl.q=trans:Kündigung
>>
>> I saw the exact behavior you describe when I did _not_ specify the
>> field in the hl.q parameter, i.e.
>>
>> hl.q=Kundigung
>> or
>> hl.q=Kündigung
>>
>> didn't show all highlights.
>>
>> But when I did specify the field, it worked.
>>
>> Here's what I think is happening: Solr uses the default search field
>> when parsing an un-field-qualified query. I.e.
>>
>> q=something
>>
>> is parsed as
>>
>> q=default_search_field:something.
>>
>> The default field is controlled in solrconfig.xml with the "df"
>> parameter, you'll see entries like:
>> <str name="df">my_field</str>
>>
>> Also when I changed the "df" parameter to the field I was
>> highlighting on, I didn't need to specify the field on the hl.q
parameter.
>>
>> hl.q=Kundigung
>> or
>> hl.q=Kündigung
>>
>> The default field is usually "text", which knows nothing about the
>> German-specific filters you've applied unless you changed it.
>>
>> So in the absence of a field-qualification for the hl.q parameter
>> Solr was parsing the query according to the analysis chain specifed
>> in your default field, and probably passed ü through without
>> transforming it. Since your indexing analysis chain for that field
>> folded ü to just plain u, it wasn't found or highlighted.
>>
>> On the surface, this does seem like something that should be changed,
>> I'll go ahead and ping the dev list.
>>
>> NOTE: I was trying this on Solr 7.1
>>
>> Best,
>> Erick
>>
>> On Fri, Mar 23, 2018 at 12:03 PM, Arturas Mazeika <mazeika@gmail.com>
>> wrote:
>> > Hi Erick,
>> >
>> > Thanks for the update and the infos. Your post brought quite a bit
>> > of
>> light
>> > into the picture and now I understand quite a bit more about what
>> > you are saying. Your explanation makes sense and can be quite
>> > useful in certain scenarious.
>> >
>> > What stroke me from your description is that you are saying that
>> > the analyzer-chain needs to be applied for the highlighting queries as
well.
>> > The tragedy is that I am not able to get this for a german
>> > collection: if the query is set (no explicit highlighting query),
>> > the highlighting is correct. It is also correct, if I replace the
>> > umaults into the corresponding latin chars. Getting the analyzer
>> > chain for the
>> highlighting
>> > terms remains the challenge.
>> >
>> > Do you think you have a look at the following stakoverflow link?
>> > Maybe something comes to your mind...
>> >
>> > *https://stackoverflow.com/questions/49276093/solr-
>> highlighting-terms-with-umlaut-not-found-not-highlighted
>> > <https://stackoverflow.com/questions/49276093/solr-
>> highlighting-terms-with-umlaut-not-found-not-highlighted>*
>> >
>> > *Cheers,*
>> >
>> > *Arturas*
>> > On Fri, Mar 23, 2018, 17:43 Erick Erickson
>> > <erickerickson@gmail.com>
>> wrote:
>> >
>> >> bq: this is not a typical case that one searches for a keyword but
>> >> highlights something else
>> >>
>> >> This isn't really an unusual case, apparently I mislead you.
>> >>
>> >> What I was trying to convey is that the analysis chain used is
>> >> firmly attached to a particular _field_. There's no way to say
>> >> "use one analysis chain for the query and another for highlighting
>> >> on the _same_ field".
>> >>
>> >> You can use two different fields with different analysis chains,
>> >> one for each purpose. So something like
>> >>
>> >> q=f1:something&hl.fl=f2,f3&hl.q=other
>> >>
>> >> is certainly reasonable. It'll search for "something" in f1, and
>> >> highlight "other" in f2 and f3
>> >>
>> >> Each fields processes its input with the analysis chain defined in
>> >> the schema.
>> >>
>> >> The rest about stored="true" can be ignored, it's just me
>> >> wandering off into the weeds about an optimization that only
>> >> stores the data once rather than redundantly in multiple fields.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Fri, Mar 23, 2018 at 4:37 AM, Arturas Mazeika
>> >> <mazeika@gmail.com>
>> >> wrote:
>> >> > Hi Mathesis (Stefan),
>> >> >
>> >> > Thanks for the questions. This made me look at the problem from
>> >> > a
>> >> distance
>> >> > and re-frame the situation. Good questions indeed.
>> >> >
>> >> > Trying to go around: consider a user who describes herself as
>> >> > being a
>> BMW
>> >> > fan, being convinced that all BMW need to be the blackest color
>> possible
>> >> > (for a sake of argument) who would like to search and later
>> >> > browse the entries in the discussion forum (of course not
>> >> > everything but BMW of
>> the
>> >> > blackest color), and what interest her are the snippets that
>> >> > have understood, craziest as keywords or the like (because she
>> >> > is looking
>> for
>> >> a
>> >> > dozen of discussions that she saw before).
>> >> >
>> >> > What I was not able to achieve so far is: (i) combine query term
>> >> > for filtering and highlighting, (ii) using the analyzer-chain
>> >> > from the attribute to rewrite the highlight query (or define one
>> >> > in the search)
>> >> >
>> >> > CTR+F technique is a very powerful one, indeed. Works most of
>> >> > CTR+the
>> time.
>> >> The
>> >> > difficulties with it are query rewriting, enriching, etc.
>> >> >
>> >> > Cheers,
>> >> > Arturas
>> >> >
>> >> > On Fri, Mar 23, 2018 at 11:29 AM, Stefan Matheis <
>> >> matheis.stefan@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Perhaps we try it the other way round .. what's your use case
>> >> >> for
>> this?
>> >> I'm
>> >> >> trying to think of a situation where I'd need this a as user?
>> >> >>
>> >> >> The only reason I see myself doing this is CTRL+F in a page
>> >> >> when the
>> >> search
>> >> >> result is not immediately visible for me ;)
>> >> >>
>> >> >> On Mar 23, 2018 9:41 AM, "Arturas Mazeika" <mazeika@gmail.com>
>> wrote:
>> >> >>
>> >> >> > Hi Erick et al,
>> >> >> >
>> >> >> > From your answer I understand that this is not a typical case
>> >> >> > that
>> one
>> >> >> > searches for a keyword but highlights something else. Since
>> >> >> > we have
>> >> two
>> >> >> > parameters (q vs hl.q) I thought they are freely combinable.
>> >> >> > From
>> your
>> >> >> > answer I understand that this is not really the case. My
>> >> >> > current understanding came from [1] that says:
>> >> >> >
>> >> >> > hl.q
>> >> >> >
>> >> >> > A query to use for highlighting. This parameter allows you to
>> >> highlight
>> >> >> > different terms than those being used to retrieve documents.
>> >> >> > what I hear from you is something different: i.e., that this
>> >> >> > is not
>> >> >> enough
>> >> >> > just to combine the q with hl.q, that there are caveats to
>> >> >> > achieve
>> the
>> >> >> task
>> >> >> > (multiple fields, FastVectorHighlighter).
>> >> >> >
>> >> >> > Your infos are very helpful.
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Arturas
>> >> >> >
>> >> >> > [1]
>> >> >> > https://lucene.apache.org/solr/guide/7_2/highlighting.html
>> >> >> >
>> >> >> > On Thu, Mar 22, 2018 at 4:07 PM, Erick Erickson <
>> >> erickerickson@gmail.com
>> >> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> > > Basically you need to use a copyField, but in several variants:
>> >> >> > >
>> >> >> > > If you use the field _exclusively_ for highlighting then
>> >> >> > > store
>> the
>> >> raw
>> >> >> > > content there and have the field use whatever analyzer you
want.
>> You
>> >> >> > > do _not_ need to have indexed="true" set for the field if
>> >> >> > > you're highlighting on the fly. So you're searching against
>> >> >> > > field1
>> (which
>> >> has
>> >> >> > > indexed="true" stored="false" set) but highlighting against
>> field2
>> >> >> > > (which has indexed="false" stored="true" set). Of course
>> >> >> > > any time
>> >> you
>> >> >> > > want to return the contents in a doc your fl needs to
>> >> >> > > specify field2...
>> >> >> > >
>> >> >> > > The above does not bloat your index at all since the cost
>> >> >> > > of stored="true" indexed="true" is the same as if you use
>> >> >> > > two
>> fields,
>> >> >> > > each with only one option turned on.
>> >> >> > >
>> >> >> > > The second approach if you want to use
>> >> >> > > FastVectorHighlighter or
>> the
>> >> >> > > like is simply to index both fields.
>> >> >> > >
>> >> >> > > Best,
>> >> >> > > Erick
>> >> >> > >
>> >> >> > > On Thu, Mar 22, 2018 at 2:18 AM, Arturas Mazeika <
>> mazeika@gmail.com
>> >> >
>> >> >> > > wrote:
>> >> >> > > > Hi Solr-Users,
>> >> >> > > >
>> >> >> > > > I've been playing with a german collection of documents,
>> >> >> > > > where
>> I
>> >> >> tried
>> >> >> > to
>> >> >> > > > search for one word (q=Tag) and highlighted another:
>> >> >> (hl.q=Kundigung).
>> >> >> > Is
>> >> >> > > > this a "legal" use case? My key question is how can I
>> >> >> > > > tell solr
>> >> which
>> >> >> > > query
>> >> >> > > > analyzer to use for highlighting? Strictly speaking, I
>> >> >> > > > should
>> use
>> >> >> > > > hl.q=Kündigung to conceptually look for relevant
>> >> >> > > > information,
>> but
>> >> in
>> >> >> > this
>> >> >> > > > case, no highlighting is returned (as all umlauts are
>> >> >> > > > left out
>> in
>> >> the
>> >> >> > > > index) .
>> >> >> > > >
>> >> >> > > > Additional infos:
>> >> >> > > >
>> >> >> > > > solr version: 7.2
>> >> >> > > > urls to query:
>> >> >> > > >
>> >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
>> >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
>> >> >> > > >
>> >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
>> >> >> > > true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.snippets=3&wt=xml&r
>> >> >> > > ows=1
>> >> >> > > > <http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
>> >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > Managed-schema:
>> >> >> > > >
>> >> >> > > > <fieldType name="text_de" class="solr.TextField"
>> >> >> > > positionIncrementGap="100">
>> >> >> > > > <analyzer>
>> >> >> > > > <tokenizer class="solr.StandardTokenizerFactory"/>
>> >> >> > > > <filter class="solr.LowerCaseFilterFactory"/>
>> >> >> > > > <filter class="solr.StopFilterFactory"
format="snowball"
>> >> >> > > > words="lang/stopwords_de.txt" ignoreCase="true"/>
>> >> >> > > > <filter class="solr.GermanNormalizationFilterFactory"/>
>> >> >> > > > <filter class="solr.GermanLightStemFilterFactory"/>
>> >> >> > > > </analyzer>
>> >> >> > > > </fieldType>
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > Other additional infos:
>> >> >> > > > https://stackoverflow.com/questions/49276093/solr-
>> >> >> > > highlighting-terms-with-umlaut-not-found-not-highlighted
>> >> >> > > >
>> >> >> > > > Cheers,
>> >> >> > > > Arturas
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>

No comments:

Post a Comment