Saturday, March 17, 2018

FW: Scoping SolrCloud setup

-----Original Message-----
From: Scott Prentice [mailto:sp14@leximation.com]
Sent: 15 March 2018 06:12
To: solr-user@lucene.apache.org
Subject: Re: Scoping SolrCloud setup

Walter...

Thanks for the additional data points. Clearly we're a long way from needing
anything too complex.

Cheers!
...scott


On 3/14/18 1:12 PM, Walter Underwood wrote:
> That would be my recommendation for a first setup. One Solr instance per
host, one shard per collection. We run 5 million document cores with 8 GB of
heap for the JVM. We size the RAM so that all the indexes fit in OS
filesystem buffers.
>
> Our big cluster is 32 hosts, 21 million documents in four shards. Each
host is a 36 processor Amazon instance. Each host has one 8 GB Solr process
(Solr 6.6.2, java 8u121, G1 collector). No faceting, but we get very long
queries, average length is 25 terms.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On Mar 14, 2018, at 12:50 PM, Scott Prentice <sp14@leximation.com> wrote:
>>
>> Erick...
>>
>> Thanks. Yes. I think we were just going shard-happy without really
understanding the purpose. I think we'll start by keeping things simple ..
no shards, fewer replicas, maybe a bit more RAM. Then we can assess the
performance and make adjustments as needed.
>>
>> Yes, that's the main reason for moving from our current non-cloud Solr
setup to SolrCloud .. future flexibility as well as greater stability.
>>
>> Thanks!
>> ...scott
>>
>>
>> On 3/14/18 11:34 AM, Erick Erickson wrote:
>>> Scott:
>>>
>>> Eventually you'll hit the limit of your hardware, regardless of VMs.
>>> I've seen multiple VMs help a lot when you have really beefy
>>> hardware, as in 32 cores, 128G memory and the like. Otherwise it's
iffier.
>>>
>>> re: sharding or not. As others wrote, sharding is only useful when a
>>> single collection grows past the limits of your hardware. Until that
>>> point, it's usually a better bet to get better hardware than shard.
>>> I've seen 300M docs fit in a single shard. I've also seen 10M strain
>>> pretty beefy hardware.. but from what you've said multiple shards
>>> are really not something you need to worry about.
>>>
>>> About balancing all this across VMs and/or machines. You have a
>>> _lot_ of room to balance things. Let's say you put all your
>>> collections on one physical machine to start (not recommending, just
>>> sayin'). 6 months from now you need to move collections 1-10 to
>>> another machine due to growth. You:
>>> 1> spin up a new machine
>>> 2> build out collections 1-10 on those machines by using the
>>> Collections API ADDREPLICA.
>>> 3> once the new replicas are healthy. DELETEREPLICA on the old hardware.
>>>
>>> No down time. No configuration to deal with, SolrCloud will take
>>> care of it for you.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice <sp14@leximation.com>
wrote:
>>>> Emir...
>>>>
>>>> Thanks for the input. Our larger collections are localized content,
>>>> so it may make sense to shard those so we can target the specific
>>>> index. I'll need to confirm how it's being used, if queries are
>>>> always within a language or if they are cross-language.
>>>>
>>>> Thanks also for the link .. very helpful!
>>>>
>>>> All the best,
>>>> ...scott
>>>>
>>>>
>>>>
>>>>
>>>> On 3/14/18 2:21 AM, Emir Arnautović wrote:
>>>>> Hi Scott,
>>>>> There is no definite answer - it depends on your documents and
>>>>> query patterns. Sharding does come with an overhead but also
>>>>> allows Solr to parallelise search. Query latency is usually
>>>>> something that tells you if you need to split collection to
>>>>> multiple shards or not. In caseIf you are ok with latency there is
>>>>> no need to split. Other scenario where shards make sense is when
>>>>> routing is used in majority of queries so that enables you to query
only subset of documents.
>>>>> Also, there is indexing aspect where sharding helps - in case of
>>>>> high indexing throughput is needed, having multiple shards will
>>>>> spread indexing load to multiple servers.
>>>>> It seems to me that there is no high indexing throughput
>>>>> requirement and the main criteria should be query latency.
>>>>> Here is another blog post talking about this subject:
>>>>> http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning
>>>>> .html
>>>>> <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-plannin
>>>>> g.html>
>>>>>
>>>>> Thanks,
>>>>> Emir
>>>>> --
>>>>> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
>>>>> Elasticsearch Consulting Support Training - http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>> On 14 Mar 2018, at 01:01, Scott Prentice <sp14@leximation.com> wrote:
>>>>>>
>>>>>> We're in the process of moving from 12 single-core collections
>>>>>> (non-cloud
>>>>>> Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge,
>>>>>> ranging in size from 50K to 150K documents with one at 1.2M docs.
>>>>>> Our max query frequency is rather low .. probably no more than
>>>>>> 10-20/min. We do update frequently, maybe 10-100 documents every 10
mins.
>>>>>>
>>>>>> Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and
>>>>>> we've got each collection split into 2 shards with 3 replicas
>>>>>> (one per VM). Also, Zookeeper is running on each VM. I understand
>>>>>> that it's best to have each ZK server on a separate machine, but
hoping this will work for now.
>>>>>>
>>>>>> This all seemed like a good place to start, but after reading
>>>>>> lots of articles and posts, I'm thinking that maybe our smaller
>>>>>> collections (under 100K docs) should just be one shard each, and
>>>>>> maybe the 1.2M collection should be more like 6 shards. How do you
decide how many shards is right?
>>>>>>
>>>>>> Also, our current live system is separated into dev/stage/prod
>>>>>> tiers, not, all of these tiers are together on each of the cloud
>>>>>> VMs. This bothers some people, thinking that it may make our
>>>>>> production environment less stable. I know that in an ideal
>>>>>> world, we'd have them all on separate systems, but with the
>>>>>> replication, it seems like we're going to make the overall system
more stable. Is this a correct understanding?
>>>>>>
>>>>>> I'm just wondering anyone has opinions on whether we're going in
>>>>>> a reasonable direction or not. Are there any articles that
>>>>>> discuss these initial sizing/scoping issues?
>>>>>>
>>>>>> Thanks!
>>>>>> ...scott
>>>>>>
>>>>>>
>

No comments:

Post a Comment