-----Original Message-----
From: Rahul Singh [mailto:rahul.xavier.singh@gmail.com]
Sent: 20 March 2018 20:10
To: solr-user@lucene.apache.org; solr-user@lucene.apache.org
Subject: RE: Question liste solr
Parallel processing in any way will help, including Spark w/ a DFS like S3
or HDFS. Your three machines could end up being a bottleneck and you may
need more nodes.
On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext
<mariano.lopez-cortes-ext@pole-emploi.fr>, wrote:
> CSV file is 5GB aprox. for 29 millions.
>
> As you say Christopher, at the beggining we thougth that reading chunk
> by chunk from Oracle and writing to Solr was the best strategy.
>
> But, from our tests we've remarked:
>
> CSV creation via PL/SQL is really really fast. 40 minutes for the full
dataset (with bulk collect).
> Multiple SELECT calls from java slows down the process. I think Oracle is
the bottleneck here.
>
> Any other ideas/alternatives?
>
> Some other points to remark:
>
> We are going to enable autoCommit for every 10 minutes / 10000 rows. No
commit from client.
> During indexing, whe call all the time a front-end load-balancer that
redirect calls to the 3-node cluster.
>
> Thanks in advance!!
>
> ==>Great maillist and really awesome tool!!
>
> -----Message d'origine-----
> De : Christopher Schultz [mailto:chris@christopherschultz.net]
> Envoyé : lundi 19 mars 2018 18:05
> À : solr-user@lucene.apache.org
> Objet : Re: Question liste solr
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Mariano,
>
> On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote:
> > Hello
> >
> > We have an index Solr with 3 nodes, 1 shard et 2 replicas.
> >
> > Our goal is to index 42 millions rows. Indexing time is important.
> > The data source is an oracle database.
> >
> > Our indexing strategy is :
> >
> > * Reading from Oracle to a big CSV file.
> >
> > * Reading from 4 files (big file chunked) and injection via
> > ConcurrentUpdateSolrClient
> >
> > Is it the optimal way of injecting such mass of data into Solr ?
> >
> > For information, estimated time for our solution is 6h.
>
> How big are the CSV files? If most of the time is taken performing the
various SELECT operations, then it's probably a good strategy.
>
> However, you may find that using the disk as a buffer slows everything
down because disk-writes can be very slow.
>
> Why not perform your SELECT(s) and write directly to Solr using one of the
APIs (either a language-specific API, or through the HTTP API)?
>
> Hope that helps,
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE
> s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH
> I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3
> 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+
> r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5
> BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6
> ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX
> ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey
> 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg
> GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy
> tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD
> VH6PlwgqcrO28Jx799mJvpIotoE=
> =aMPk
> -----END PGP SIGNATURE-----
No comments:
Post a Comment