Saturday, March 17, 2018

FW: Navigation/Paging

-----Original Message-----
From: Sebastian Riemer []
Sent: 14 March 2018 13:26
Subject: AW: Navigation/Paging

Dear Shawn,

thank you so much for taking the time for this detailed answer! It helps me
very much and I'm very grateful.

1) As you've suggested, we already load the data for detail pages from our
relational db, just using the documentId from Solr to look it up.
2) Our index size won't ever reach millions of records as it is common in
other users' scenarios. Having 60000 Documents as search result is currently
the maximum as single client can ever get when not specifying _any_ filter

-> I'll have to think about whether to prevent the user from deep paging
into big search results, or just take a possible performance hit (as you've
pointed out, usually a typical user won't page further than a couple of
pages). The same goes for jumping to the very end of a search result.
Currently I kind of like this feature so I'll try to keep it in.

For retrieving the previous/next documentId if I'm on the start/end of the
current page, I'll use the approach you (and Rick) suggested -thanks!

Best wishes,


-----Ursprüngliche Nachricht-----
Von: Shawn Heisey []
Gesendet: Mittwoch, 14. März 2018 00:19
Betreff: Re: Navigation/Paging

On 3/13/2018 10:26 AM, Sebastian Riemer wrote:
> However, now we want to introduce a similar navigation in our detail
views, where only ever one document is displayed. Again, the navigation bar
looks like this:
> << First < Prev 1 - 15 of 62181 Next
> But now, Prev / Next shall open up the previous / next _document_ instead
of the next page. The same goes for First and Last, it shall open the first
/ last _document_ not the page.
> Our first approach to this was to simply add the param "fl=id" so we only
get the IDs of documents and set page size to ALL (i.e. no restriction on
param "rows"). That way, it was easy to extract the current document id from
the result list, and check which id was preceding and succeeding the current
id, as well as getting the very first id and the very last id, in order to
render the navigation bar.
> This lead to solr being heavily under load since it must load 62181
documents (in this example) in order to return the ids. I somehow thought
this would be easy for solr to do, but it isn't.

This will indeed be very slow.  And you only have 62181 documents in your
result set, which is pretty easy for Solr to handle.  For a search that has
100 million results, this approach is *impossible*.  I do have searches like
this on my index, and my index is not all that big compared to some of the
indexes that the community has built.

> Our second approach was, to simply keep the same value for params "start"
and "rows" since the user is always selecting a document from the list -
thus the selected document already is within the page. However, the edge
cases are, the selected document is the very first on the page or the very
last one, thus the previous or next document id is not within the page
result from solr -> I guess this we could handle by simply checking and
sending a second query where the param "start" would be adjusted

Detail pages often include information that you do not want to store in
Solr.  A well-tuned Solr install will have responses that contain everything
that the application needs to build a search result grid, but for really
detailed information, the application should probably be using the id
information received from Solr to go to the main data repository and
retrieve full details.

Additionally, you should not allow the user to navigate to the last page or
to navigate to the last document, or even a page/document anywhere near the
end of the resultset.  The reason for this is that really high start values
are a serious performance killer.  61K is definitely a start value high
enough to see performance drops.  If the user tries to page too deeply into
results, your application should simply refuse to go any further.  For
comparison purposes -- the last time I checked how deeply Google would let
me go into a search result, I could get to page 39, but no further.  The
number of results for my search was MILLIONS, but Google wouldn't let me
view them all.  The performance issues for deep paging are universal for
search engines, especially when it is possible to jump to an arbitrary page

I recommend limiting how many results a user can page through to about
5000 or 10000.  If there are 50 results per page, this allows them to get to
at least page 99.  In general, most users of search engines will never go
deeper than about page 3.  There are some kinds of applications where a
typical user might visit the first few dozen pages ... but anything deeper
is NOT common.  If you have an atypical user, they are probably prepared for
large page numbers to take a lot longer to load. The main reason you should
be limiting how deep users can go is that when one user is going thousands
of documents into a result set, performance of the other queries on the
system CAN drop dramatically.

> However I would not know how to retrieve the id of the very first
> document and the very last document (except for executing separate
> queries with I guess start=0, rows=1 and start=62181 and rows=1)

When you display a page of results, your application already has N document
IDs received from Solr to display a page of results.  Using that
information, you can navigate through the documents one at a time. Then if
you reach the end of what you have on that page, you can issue another query
for the next page or the previous page.  If you are restricting how deep a
user can go, the performance of this approach should be pretty good.

> For any query and a documentId (of which it is known it is within the
query result), what is a simple and efficient enough way, to get the
following navigational information:
> - Previous document Id
> - Next document id
> - First document id
> - Last document id

Having this information available is nearly impossible.

The values for each document will depend on the sort you use.  Change the
sort, and all the values will be wrong.  And if you delete documents or add
documents, those values will likely change, and the values for an individual
document could change several times per second.  Solr cannot automatically
provide this information, and it is pretty much impossible to have accurate
and up to date information if you calculate it at index time and add it

Side note:  When sorting by relevance score, which is the default sort
order, changing the query also changes the sort.


Note that there *is* a Solr solution for the performance problems of deep
paging ... but cursorMark (the name of the feature) does not support jumping
directly to an arbitrary page number.  If you want page
25000 when using cursorMark, you have to retrieve the first 24999 pages
before you will have the cursor value for page 25000.  But once you HAVE
that value, retrieving page 25000 will be just as fast as page 1, which is
definitely not the case when using start/rows to get pages.

Newer versions of Solr also have things like the export handler and
streaming expressions, which are designed to provide REALLY large result
sets without putting major load on the server.  Very large result sets do
still take a lot of TIME, so they're only usable for offline activities like
research and data mining, not live usage in an application.  But they won't
kill the server when they are used.  I do not know how to use these
features, but information is available in the Solr Reference Guide.


No comments:

Post a Comment