Saturday, March 17, 2018

FW: Why are cursor mark queries recommended over regular start, rows combination?

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 15 March 2018 00:24
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Why are cursor mark queries recommended over regular start,
rows combination?

I'm pretty sure you can use Streaming Expressions to get all the rows back
from a sharded collection without chewing up lots of memory.

Try:
search(collection,
q="id:*",
fl="id",
sort="id asc",
qt="/export")

on a sharded SolrCloud installation, I believe you'll get all the rows back.

NOTE:
1> Some while ago you couldn't _stop_ the stream part way through.
down in the SolrJ world you could read from a stream for a while and call
close on it but that would just spin in the background until it reached EOF.
Search the JIRA list if you need (can't find the JIRA right now, 6.6 IIRC is
OK and, of course, 7.3).

This shouldn't chew up memory since the streams are sorted, so what you get
in the response is the ordered set of tuples.

Some of the join streams _do_ have to hold all the results in memory, so
look at the docs if you wind up using those.


Best,
Erick

On Wed, Mar 14, 2018 at 9:20 AM, S G <sg.online.email@gmail.com> wrote:
> Thanks everybody. This is lot of good information.
> And we should try to update this in the documentation too to help
> users make the right choice.
> I can take a stab at this if someone can point me how to update the
> documentation.
>
> Thanks
> SG
>
>
> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter
> <hossman_lucene@fucit.org>
> wrote:
>
>>
>> : > 3) Lastly, it is not clear the role of export handler. It seems
>> that the
>> : > export handler would also have to do exactly the same kind of
>> thing as
>> : > start=0 and rows=1000,000. And that again means bad performance.
>>
>> : <3> First, streaming requests can only return docValues="true"
>> : fields.Second, most streaming operations require sorting on
>> something
>> : besides score. Within those constraints, streaming will be _much_
>> : faster and more efficient than cursorMark. Without tuning I saw
>> 200K
>> : rows/second returned for streaming, the bottleneck will be the
>> speed
>> : that the client can read from the network. First of all you only
>> : execute one query rather than one query per N rows. Second, in the
>> : cursorMark case, to return a document you and assuming that any
>> field
>> : you return is docValues=false
>>
>> Just to clarify, there is big difference between the /export handler
>> and "streaming expressions"
>>
>> Unless something has changed drasticly in the past few releases, the
>> /export handler does *NOT* support exporting a full *collection* in
>> solr cloud -- it only operates on an individual core (aka:
shard/replica).
>>
>> Streaming expressions is a feature that does work in Cloud mode, and
>> can make calls to the /export handler on a replica of each shard in
>> order to process the data of an entire collection -- but when doing
>> so it has to aggregate the *ALL* the results from every shard in
>> memory on the coordinating node -- meaning that (in addition to the
>> docvalues caveat) streaming expressions requires you to "spend" a lot
>> of ram usage on one node as a trade off for spending more time &
>> multiple requests to get teh same data from cursorMark...
>>
>> https://lucene.apache.org/solr/guide/exporting-result-sets.html
>> https://lucene.apache.org/solr/guide/streaming-expressions.html
>>
>> An additional perk of cursorMakr that may be relevant to the OP is
>> that you can "stop" tailing a cursor at anytime (ie: if you're post
>> processing the results client side and decide you have "enough"
>> results) but a simila feature isn't available (AFAICT) from streaming
expressions...
>>
>> https://lucene.apache.org/solr/guide/pagination-of-
>> results.html#tailing-a-cursor
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>

No comments:

Post a Comment