Saturday, March 17, 2018

FW: Defining a phonetic analyzer and searcher via the schema API

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 12 March 2018 23:05
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Defining a phonetic analyzer and searcher via the schema API

Chris:

LGTM, except maybe ;).....

You'll want to look closely at your admin UI/Analysis page for the field (or
fieldType) once it's defined. Uncheck the "verbose" box when you look the
first time, it'll be less confusing. That'll show you _exactly_ what the
results are and whether they match your expectations. "right" is such an
existential question after all...

When you're using that page, think outside the box. For instance, I can't
say offhand whether the phonetic filter you chose gives different results
when words are capitalized or not. what about when they have numbers? Put
some punctuation in. Try an e-mail address.
Etc. etc. etc.

For instance. If you swap out StandardTokenizer for WhitespaceTokenizer,
you'll now have punctuation in the mix. Most people don't notice if they
have WordDelimiterGraphFilterFactory in the analysis chain too....

bq: Actually, I have the script that builds the schema in VCS, so it's
roughly the same.

We're on the same page here. I don't particularly care how the schema gets
saved, as long as I can back up to the last known good schema and start
over....

I'll mention in passing that there's no problem whatsoever with using the
"classic" schema. The managed stuff is cool, and enables spiffy front-ends
etc. Personally I'm comfortable enough with hand-editing the schemas that I
find it faster so I usually use it.

BTW, bin/solr has a set of commands that allow you to move upload/download
configs, try "bin/solr zk -help".....

Walter:

"I don't usually test my code, but when I do it's in production".

These young whipper-snappers don't appreciate how _very_ many ways things
can go wrong ;)

My tongue-in-cheek way to distinguish novice from "veteran" programmers:

Novice: The code compiles and she's surprised when it doesn't work the first
time.

Veteran: The code ran perfectly the first time. She immediately goes over it
with a fine-tooth comb to see whether it's still running canned test cases.

Best,
Erick


On Mon, Mar 12, 2018 at 10:14 AM, Christopher Schultz
<chris@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Erick,
>
> On 3/12/18 1:00 PM, Erick Erickson wrote:
>> bq: which you aren't supposed to edit directly.
>>
>> Well, kind of. Here's why it's "discouraged":
>> https://lucene.apache.org/solr/guide/6_6/schema-api.html.
>>
>> But as long as you don't mix-and-match hand-editing with using the
>> schema API you can hand edit it freely. You're then in charge of
>> pushing it to ZK and reloading your collections that use it yourself
>> however.
>
> No Zookeeper (yet), but I suspect I'll end up there. I'm mostly
> toying-around with it right now, but it won't be long before I'll want
> to go live with it and having a single Solr instance isn't going to
> help me sleep well at night. I'm sure I'll end up with two instances
> to begin with, which requires ZK, right?
>
>> As a side note, even if I _never_ hand-edited it I'd make it a
>> practice to regularly pull it from ZK and put it in some VCS system
>> ;)
>
> Actually, I have the script that builds the schema in VCS, so it's
> roughly the same.
>
> As for the schema modifications... did I get those right?
>
> Thanks,
> - -chris
>
>> On Mon, Mar 12, 2018 at 9:51 AM, Christopher Schultz
>> <chris@christopherschultz.net> wrote: All,
>>
>> I'd like to add a new synthesized field that uses a phonetic analyzer
>> such as Beider-Morse. I'm using Solr 7.2.
>>
>> When I request the current schema via the schema API, I get a list of
>> existing fields, dynamic fields, and analyzers, none of which appear
>> to be what I'm looking for.
>>
>> Conceptually, I think I'd like to do something like this:
>>
>> add-field: { name: phoneticname, type: phonetic, multiValued: true }
>>
>> ... but how do I define what type of data "phonetic" should be?
>>
>> I can see the example XML definition in this document:
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Fil
>> t
> er
>>
>>
> Descriptions-Beider-MorseFilter
>>
>> But I'm not sure how to add an analyzer to the schema using the
>> schema API:
>> https://lucene.apache.org/solr/guide/7_2/schema-api.html
>>
>> Under "Add a new field type", it says that new analyzers can be
>> defined, but I'm not entirely sure how to do that ... the API docs
>> refer to the field type definitions page[1] which just shows what XML
>> you'd have to put into your schema XML -- which you aren't supposed
>> to edit directly.
>>
>> When looking at the JSON version of my schema, I can see for example
>> thi s:
>>
>> "fieldTypes":[{ "name":"ancestor_path", "class":"solr.TextField",
>> "indexAnalyzer":{ "tokenizer":{
>> "class":"solr.KeywordTokenizerFactory"}}, "queryAnalyzer":{
>> "tokenizer":{ "class":"solr.PathHierarchyTokenizerFactory",
>> "delimiter":"/"}}},
>>
>> So should I create a new field type like this?
>>
>> "add-field-type" : { "name" : "phonetic", "class" :
>> "solr.TextField",
>>
>> "analyzer" : { "tokenizer": { "class" :
>> "solr.StandardTokenizerFactory" },
>>
>> "filters" : [{ "class": "solr.BeiderMorseFilterFactory",
>> "nameType": "GENERIC", "ruleType": "APPROX", "concat": "true",
>> "languageSet": "auto" }] } }
>>
>> Then, use copy-field as "usual":
>>
>> "add-field":{ "name":"phonetic", "type":"phonetic", multiValued:
>> true, "stored":false },
>>
>> "add-copy-field":{ "source":"first_name", "dest":"phonetic" },
>>
>> "add-copy-field":{ "source":"last_name", "dest":"phonetic" },
>>
>> This seems to work but I wanted to know if I was doing it the right
>> way.
>>
>> Thanks, -chris
>>
>> [1]
>> https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-p
>> r
> op
>>
>>
> erties.html#field-type-definitions-and-properties
>>
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmtY4dHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFhdIA/9GkZ/yimVmkwB725L
> uS4kcy4YJowyYw+eMtvurpIq/ZV/U8H4hFJY/ddsT+bdrjeZMsTdc7B9Tdlha8xt
> dmuj1VcvDn3uyIUGooTOob6ZvZwjeJEZIJrbwUM5gNq7uJW8xpCU0/3+iP6Km7OY
> 1Nia5uCuwarLWcsRFdtjCvR3M7ZppBYHec3kVGGOUL637AC6ISgpxhuzOnuTHAss
> wCjuR1y6AdTjRbHpis3MJdiVIjEENfyzGpEnqvumsu1e+0F/A0DNbhU9nAPv+73d
> aOLfOW9Fs6jjnq96qzIBAkHLWkqU1GHKYNYHql7/59x8rFcjGkGC7ziSY69lKc+f
> ivrIEqLH1Go7kawz+1og3dPyl/n0CFWE3UK+wj5QeTY5XLduq0x6EmFKW6D790BS
> ywmFuqr4cmvKbs3N6BbxHz5QVbjgRsWO4jp4kJi3KDCepd8vKW+2xwHfX/zAcBKY
> rSDuVkM3KtxQal8xgm4tsvyU3g1dXpNEVa7PFXYJzd3uA2yij9OU6s83NS9LHK3N
> 2zssPfNDj7QddAEhYan0O4r4wSUN2UNT9nMhBVXXYRpoD6WzrhC5TdRUDh66rkOB
> AvhAUKsV0rfjct+MUBpQA9W+SUG7i911wNSBJJmB58MYbyxMAJb8NKGk1yEs1MyH
> FQHEgiEEFRCD9ZFd/fqwfuPyKQo=
> =Vqz6
> -----END PGP SIGNATURE-----

No comments:

Post a Comment