Sunday, April 1, 2018

FW: Routing a subquery directly to the shard a document came from

-----Original Message-----
From: Jeff Wartes [mailto:jwartes@whitepages.com]
Sent: 28 March 2018 04:30
To: solr-user@lucene.apache.org
Subject: Routing a subquery directly to the shard a document came from


I have a large 7.2 index with nested documents and many shards.
For each result (parent doc) in a query, I want to gather a relevance-ranked
subset of the child documents. It seemed like the subquery transformer would
be ideal:
https://lucene.apache.org/solr/guide/7_2/transforming-result-documents.html#
TransformingResultDocuments-_subquery_

(the [child] transformer allows for a filter, but the results have an
effectively random sort)

So maybe something like this:
q=<something>
fl=id,subquery:[subquery]
subquery.q=<something>
subquery.fq={!cache=false} +{!terms f=_root_ v=$row.id}

This actually works fine, but there's a lot more work going on than
necessary. Say we have X shards and get N documents back:

Query http requests = 1 top-level query + X distributed shard-requests
Subquery http requests = N rows + N * X distributed shard-requests So with
N=10 results and X=50 shards, that is: 1+50+10+500 = 561 http requests
through the cluster.

Some of that is unavoidable, of course, but it occurs to me that all the
child docs are indexed in the same shard (segment) that the parent doc is.
Meaning that if you know the parent doc id, (and I do) you can use the
document routing to know exactly which shard to send the subquery request
to. This would save 490 of the http requests in the scenario above.

Is there any form of query that allows for explicitly following the document
routing rules for a given document ID?

I'm aware of the "distrib=false" and "shards=foo" parameters, but using
those would require me to recreate the document routing in the client.
There's also the "fl=[shard]" thing, but that would still require me to
handle the subqueries in the client.

No comments:

Post a Comment