Saturday, March 17, 2018

FW: Recommendations for non-narrative data

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 16 March 2018 21:51
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Recommendations for non-narrative data

For an index that size, you have a lot of options. I'd completely ignore any
discussion that starts with "but our index will be bigger if we do that"
until it's proven to be a problem. For reference, I commonly see 200G-300G
indexes so....

Ok, to your problem.
Your update rate is very low so don't worry about it. In this case I'd set
my autocommit setting to as long as you can tolerate (say 15 seconds? 5
seconds?). If you can batch up your updates it'll help (i.e. let's say you
update your Solr index once a minute. Collect all of the records that have
changed in the last minute, batch them up in a single request and send it).

If your update pattern _is_ something like above, it really doesn't matter
what your autocommit interval is since it'll only be triggered every minute
in my example. At this size/rate I wouldn't worry about soft commits at all,
just leave it out or set it to -1 (never fires).

As for your use-cases, pre-and-postfix wildcards are tricky. In the naive
case where you just index them regularly, they're quite expensive since to
find the matching terms you must enumerate all terms in a field. However, at
this size this is the first thing I'd try, it might be fast enough. If it's
not, the trick is to use ngrams (say bigrams). So if I'm indexing "erick",
it becomes "er" "ri" "ic"
"ck". Now a search for *ric* becomes simpler as it's a phrase search for
"ri" followed by "ic". Again, at your size the index increase not a problem
I'd guess.

So StandartTokenizer + LowercaseFilter + NgramFilter is where I'd start.
You'll find the admin/analysis page _extremely_ valuable for understanding
how these interact.

Do be careful to try edge cases, particularly ones involving punctuation.
You'll discover that switching to something like WhitespaceTokenizer all of
the sudden stops removing punctuation for instance.....

Best,
Erick

On Fri, Mar 16, 2018 at 6:46 AM, Christopher Schultz
<chris@christopherschultz.net> wrote:
> All,
>
> I'm using Solr to index and search a database of user data (username,
> email, first and last name), so there aren't really "terms" in the
> data to search for, like you might search for words that describe
> products in a catalog, for example.
>
> I have set up my schema to include plain-old text fields for each of
> the data mentioned above, plus I have a copy-field called "all" which
> includes everything all together, plus I have a first + last field
> which uses a phonetic index and query analyzer.
>
> Since I don't need things such as term-replacement (spanner ==
> wrench), stemming (first name 'chris' -> 'chri'), and possibly other
> features that I don't know about, I'm wondering what might be a
> recommended set of tokenizer(s), analyzer(s), etc. for such data.
>
> We will definitely want to be able to search by substring (to find
> 'cschultz' as a username with 'schultz' as input) but some substrings
> are probably useless (such as @gmail.com for email addresses) and
> don't need to be supported.
>
> What are some good options to look at for this type of data?
>
> In production, we have fewer than 5M records to handle, so this is
> more of an academic exercise than an actual performance requirement
> (since Solr is at least an order of magnitude faster than our current
> RDBMS-searching implementation).
>
> If it makes any difference, we are trying to keep the index up-to-date
> with all user changes made in real time (okay, maybe delayed by a few
> seconds, but basically realtime). We have a few hundred new-user
> registrations per day and probably half as many changes to user
> records as that, so perhaps 2 document-updates per minute on average
> (during ~12 business hours in the US on weekdays).
>
> Thanks for any advice anyone may have, -chris
>

No comments:

Post a Comment