Saturday, March 17, 2018

FW: Recommendations for non-narrative data

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqryqwACgkQHPApP6U8
pFgu3RAAvpHd2XAuLQ1fdRXv8yLL27Ld+fshWyBfd/7YaFyJQw7PJ9v3ut/QuhkW
/kEgfWNGoO8OR8aCyIYpv5Q1itOgWh5SScIUet1MFt8qgJas6h8ROjRdEoiyugYt
/5sxahpWZ2A1DvXvQD7Qo+z+9yN4jgVqF1uwll0T6RBskhdZTo1XR2NVVIAFzteZ
odncRy1+P54s72REhjZggxYrtSaGy9+ibqTblxR7cAI5JpqljuVJEqjC5j6h5p4e
XptWn5DgyTx88Ncfpuzr7wD0UYiNUiq5Fe5mPgiJeiXh12m0vQbqXxzen5V4IuXl
o8Ti+Ah0pyw2txA5GfL+AbMbkrBapj4WNf96lyz4ueO+5AaZTwBcUXY8QHqz41TJ
BmRjZp37St+KeD9CCn74tqcAPQhgF2413/vIDtmQnzKSL2qYeR/fNy3kUw2j7jKt
uwayVZ20mnP9JcTgMBxclPvNEFaackEyz6BhZQ9Vbbsl95AKVGGQuKo3V+tOdssK
MzPGrNZ0TdFXR2KwDua2eBD+uR+TN4tzZlUv4UkGdewUCvCl8xtqtZsWQNOu1dgC
i81HDYdhCO60pUvig2/c/ZT6TUOwEjg1O1xQZOG8+NSgox2/xzKOWuZWRvKPCjMZ
8BofLB+YY6YLvlqZuormDgKfMYICm67124tplXmLxhK7tMFcl5A=
=yODi
-----END PGP SIGNATURE-----
-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 16 March 2018 19:16
To: solr-user@lucene.apache.org
Subject: Recommendations for non-narrative data

All,

I'm using Solr to index and search a database of user data (username, email,
first and last name), so there aren't really "terms" in the data to search
for, like you might search for words that describe products in a catalog,
for example.

I have set up my schema to include plain-old text fields for each of the
data mentioned above, plus I have a copy-field called "all" which includes
everything all together, plus I have a first + last field which uses a
phonetic index and query analyzer.

Since I don't need things such as term-replacement (spanner == wrench),
stemming (first name 'chris' -> 'chri'), and possibly other features that I
don't know about, I'm wondering what might be a recommended set of
tokenizer(s), analyzer(s), etc. for such data.

We will definitely want to be able to search by substring (to find
'cschultz' as a username with 'schultz' as input) but some substrings are
probably useless (such as @gmail.com for email addresses) and don't need to
be supported.

What are some good options to look at for this type of data?

In production, we have fewer than 5M records to handle, so this is more of
an academic exercise than an actual performance requirement (since Solr is
at least an order of magnitude faster than our current RDBMS-searching
implementation).

If it makes any difference, we are trying to keep the index up-to-date with
all user changes made in real time (okay, maybe delayed by a few seconds,
but basically realtime). We have a few hundred new-user registrations per
day and probably half as many changes to user records as that, so perhaps 2
document-updates per minute on average (during ~12 business hours in the US
on weekdays).

Thanks for any advice anyone may have,
-chris

No comments:

Post a Comment