Sunday, April 1, 2018

FW: CDCR performance issues

-----Original Message-----
From: Tom Peters [mailto:tpeters@synacor.com]
Sent: 23 March 2018 22:54
To: solr-user@lucene.apache.org
Subject: Re: CDCR performance issues

Thanks for responding. My responses are inline.

> On Mar 23, 2018, at 8:16 AM, Amrit Sarkar <sarkaramrit2@gmail.com> wrote:
>
> Hey Tom,
>
> I'm also having issue with replicas in the target data center. It will
> go
>> from recovering to down. And when one of my replicas go to down in
>> the target data center, CDCR will no longer send updates from the
>> source to the target.
>
>
> Are you able to figure out the issue? As long as the leaders of each
> shard in each collection is up and serving, CDCR shouldn't stop.

I cannot replicate the issue I was having. In a test environment, I'm able
to knock one of the replicas into recovery mode and can verify that CDCR
updates are still being sent.
>
> Sometimes we have to reindex a large chunk of our index (1M+ documents).
>> What's the best way to handle this if the normal CDCR process won't
>> be able to keep up? Manually trigger a bootstrap again? Or is there
>> something else we can do?
>>
>
> That's one of the limitations of CDCR, it cannot handle bulk indexing,
> preferable way to do is
> * stop cdcr
> * bulk index
> * issue manual BOOTSTRAP (it is independent of stop and start cdcr)
> * start cdcr

I plan on testing this, but if I issue a bootstrap, will I run into the
https://issues.apache.org/jira/browse/SOLR-11724
<https://issues.apache.org/jira/browse/SOLR-11724> bug where the bootstrap
doesn't replicate to the replicas?

> 1. Is it accurate that updates are not actually batched in transit
> from the
>> source to the target and instead each document is posted separately?
>
>
> The batchsize and schedule regulate how many docs are sent across target.
> This has more details:
> https://lucene.apache.org/solr/guide/7_2/cdcr-config.html#the-replicat
> or-element
>

As far as I can tell, I'm not seeing batching. I'm using tcpdump (and a
script to decompile the JavaBin bytes) to monitor what is actually being
sent and I'm seeing documents arrive one-at-a-time.

POST
/solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199&wt=javabin&versio
n=2 HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[
SolrInputDocument(fields: [solr_id=Mytest,
_version_=1595749902502068224]):null]]}
----------
POST
/solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199&wt=javabin&versio
n=2 HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[
SolrInputDocument(fields: [solr_id=Mytest,
_version_=1595749902600634368]):null]]}
----------
POST
/solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199&wt=javabin&versio
n=2 HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[
SolrInputDocument(fields: [solr_id=Mytest,
_version_=1595749902698151936]):null]]}

>
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <tpeters@synacor.com> wrote:
>
>> I'm also having issue with replicas in the target data center. It
>> will go from recovering to down. And when one of my replicas go to
>> down in the target data center, CDCR will no longer send updates from
>> the source to the target.
>>
>>> On Mar 12, 2018, at 9:24 AM, Tom Peters <TPeters@synacor.com> wrote:
>>>
>>> Anyone have any thoughts on the questions I raised?
>>>
>>> I have another question related to CDCR:
>>> Sometimes we have to reindex a large chunk of our index (1M+ documents).
>> What's the best way to handle this if the normal CDCR process won't
>> be able to keep up? Manually trigger a bootstrap again? Or is there
>> something else we can do?
>>>
>>> Thanks.
>>>
>>>
>>>
>>>> On Mar 9, 2018, at 3:59 PM, Tom Peters <TPeters@synacor.com> wrote:
>>>>
>>>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that
>>>> the
>> requests to the target data center are not batched in any way. Each
>> update comes in as an independent update. Some follow-up questions:
>>>>
>>>> 1. Is it accurate that updates are not actually batched in transit
>>>> from
>> the source to the target and instead each document is posted separately?
>>>>
>>>> 2. Are they done synchronously? I assume yes (since you wouldn't
>>>> want
>> operations applied out of order)
>>>>
>>>> 3. If they are done synchronously, and are not batched in any way,
>>>> does
>> that mean that the best performance I can expect would be roughly how
>> long it takes to round-trip a single document? ie. If my average ping
>> is 25ms, then I can expect a peak performance of roughly 40 ops/s.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <
>> daniel.davis@nih.gov> wrote:
>>>>>
>>>>> These are general guidelines, I've done loads of networking, but
>>>>> may
>> be less familiar with SolrCloud and CDCR architecture. However, I
>> know it's all TCP sockets, so general guidelines do apply.
>>>>>
>>>>> Check the round-trip time between the data centers using ping or
>>>>> TCP
>> ping. Throughput tests may be high, but if Solr has to wait for a
>> response to a request before sending the next action, then just like
>> any network protocol that does that, it will get slow.
>>>>>
>>>>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also
>> check whether some proxy/load balancer between data centers is causing it
>> to be a single connection per operation. That will *kill* performance.
>> Some proxies default to HTTP/1.0 (open, send request, server send
>> response, close), and that will hurt.
>>>>>
>>>>> Why you should listen to me even without SolrCloud knowledge -
>> checkout paper "Latency performance of SOAP Implementations". Same
>> distribution of skills - I knew TCP well, but Apache Axis 1.1 not so
well.
>> I still improved response time of Apache Axis 1.1 by 250ms per call
>> with 1-line of code.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tom Peters [mailto:tpeters@synacor.com]
>>>>> Sent: Wednesday, March 7, 2018 6:19 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: CDCR performance issues
>>>>>
>>>>> I'm having issues with the target collection staying up-to-date
>>>>> with
>> indexing from the source collection using CDCR.
>>>>>
>>>>> This is what I'm getting back in terms of OPS:
>>>>>
>>>>> curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>>>> {
>>>>> "responseHeader": {
>>>>> "status": 0,
>>>>> "QTime": 0
>>>>> },
>>>>> "operationsPerSecond": [
>>>>> "zook01,zook02,zook03/solr",
>>>>> [
>>>>> "mycollection",
>>>>> [
>>>>> "all",
>>>>> 49.10140553500938,
>>>>> "adds",
>>>>> 10.27612635309587,
>>>>> "deletes",
>>>>> 38.82527896994054
>>>>> ]
>>>>> ]
>>>>> ]
>>>>> }
>>>>>
>>>>> The source and target collections are in separate data centers.
>>>>>
>>>>> Doing a network test between the leader node in the source data
>>>>> center
>> and the ZooKeeper nodes in the target data center show decent enough
>> network performance: ~181 Mbit/s
>>>>>
>>>>> I've tried playing around with the "batchSize" value (128, 512,
>>>>> 728,
>> 1000, 2000, 2500) and they've haven't made much of a difference.
>>>>>
>>>>> Any suggestions on potential settings to tune to improve the
>> performance?
>>>>>
>>>>> Thanks
>>>>>
>>>>> --
>>>>>
>>>>> Here's some relevant log lines from the source data center's leader:
>>>>>
>>>>> 2018-03-07 23:16:11.984 INFO
>>>>> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 511 updates to target mycollection
>>>>> 2018-03-07 23:16:23.062 INFO
>>>>> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 510 updates to target mycollection
>>>>> 2018-03-07 23:16:32.063 INFO
>>>>> (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 511 updates to target mycollection
>>>>> 2018-03-07 23:16:36.209 INFO
>>>>> (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 512 updates to target mycollection
>>>>> 2018-03-07 23:16:42.091 INFO
>>>>> (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 512 updates to target mycollection
>>>>> 2018-03-07 23:16:46.790 INFO
>>>>> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 511 updates to target mycollection
>>>>> 2018-03-07 23:16:50.004 INFO
>>>>> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
>> r:core_node9) [c:mycollection s:shard1 r:core_node9
>> x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded
>> 512 updates to target mycollection
>>>>>
>>>>>
>>>>> And what the log looks like in the target:
>>>>>
>>>>> 2018-03-07 23:18:46.475 INFO (qtp1595212853-26) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.500 INFO (qtp1595212853-25) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.525 INFO (qtp1595212853-24) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.550 INFO (qtp1595212853-3793) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.575 INFO (qtp1595212853-30) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.600 INFO (qtp1595212853-26) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.625 INFO (qtp1595212853-25) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.651 INFO (qtp1595212853-24) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.676 INFO (qtp1595212853-3793) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.701 INFO (qtp1595212853-30) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
>> o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr
>> path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdc
>> r.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>>
>>>>>
>>>>>
>>>>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying,
>> storing, or distribution of this e-mail or any attached file by
>> anyone other than the intended recipient is strictly prohibited. If
>> you have received this message in error, please notify the sender by
>> reply email and delete the message and any attachments. Thank you.
>>>>
>>>>
>>>>
>>>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying,
>> storing, or distribution of this e-mail or any attached file by
>> anyone other than the intended recipient is strictly prohibited. If
>> you have received this message in error, please notify the sender by
>> reply email and delete the message and any attachments. Thank you.
>>>
>>>
>>>
>>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying,
>> storing, or distribution of this e-mail or any attached file by
>> anyone other than the intended recipient is strictly prohibited. If
>> you have received this message in error, please notify the sender by
>> reply email and delete the message and any attachments. Thank you.
>>
>>
>>
>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying,
>> storing, or distribution of this e-mail or any attached file by
>> anyone other than the intended recipient is strictly prohibited. If
>> you have received this message in error, please notify the sender by
>> reply email and delete the message and any attachments. Thank you.
>>





This message and any attachment may contain information that is confidential
and/or proprietary. Any use, disclosure, copying, storing, or distribution
of this e-mail or any attached file by anyone other than the intended
recipient is strictly prohibited. If you have received this message in
error, please notify the sender by reply email and delete the message and
any attachments. Thank you.

No comments:

Post a Comment