Siddhast lab

Tuesday, March 6, 2018

FW: Nested documents vs. flattening document structure?

-----Original Message-----
From: Dc Tech [mailto:dctech1000@gmail.com]
Sent: 06 March 2018 20:53
To: solr-user@lucene.apache.org
Subject: Re: Nested documents vs. flattening document structure?

Thank you Erick.
That was my instinct as well.

On Tue, Mar 6, 2018 at 10:05 AM, Erick Erickson <erickerickson@gmail.com>
wrote:

> Flattening the nested documents is usually preferred if at all
> possible. Nested documents to, indeed, have a series of restrictions
> that often make them harder to work with than flattened docs.
>
> Best,
> Erick
>
> On Tue, Mar 6, 2018 at 6:48 AM, Dc Tech <dctech1000@gmail.com> wrote:
> > We are evaluating using nested documents vs. simply flattening the
> document.
> >
> > Looking through the documentation, it is not very clear to me if the
> nested
> > documents are fully mature, and support the full richness of SOLR
> > (streaming, mature faceting) etc...
> >
> > Any opinions or guidance on that?
> >
> >
> > For *flattening*, we are thinking of setting up three groups of fields:
> > 1. Fields for search - 3-4 groups of fields that glom together the
> document
> > fields in order of boosting priority (e.g. f1 has just the title ,
> > f2 has
> > title+authors....)
> > 2. Fields for faceting if needed
> > 3. and Fields for display (or the original document fields) e.g.
> > author_name|author_unique_id...
>

FW: Solr dih extract text from inline images in pdf

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 06 March 2018 20:52
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Solr dih extract text from inline images in pdf

It's often much easier to approach this by running Tika separately.
Here's a blog on both the reasoning and sample code:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Among other things, you have a lot more control over how Tika operates.

Best,
Erick

On Tue, Mar 6, 2018 at 12:36 AM, lala <labishahla@gmail.com> wrote:
> Hi,
>
> I am working with solr7, indexing multilingual files existing in a
> folder, using DIH (FileListEntityProcessor for the basic entity, &
> TikaEntityProcessor for the child entity in configuration file).
>
> My problem relies here: I want to extract texts from images inside PDF
> files, that works fine with the /update/extract request handler where
> I set the "parseContext.config" attribute to an xml file lets say
"context.xml"
> where I set the property "extractInlineImages" for the entry
> [PDFParserConfig] to true. But I have no Idea how to set the
> parseContext.Config in the DIH configuration??
>
> I tried these approaches, none of them worked:
>
> - set tikaConfig attribute in dih config file to my "context.xml",
> obviously won't work since tika config is different :.
> - set the parseContext.config attribute to my "\dataImport"
> requestHandler, didn't work
>
> I googled a lot with no result...I really really appreciate any help
here!!
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

FW: Copying a SolrCloud collection to other hosts

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 06 March 2018 20:48
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Copying a SolrCloud collection to other hosts

this is part of the "different replica types" capability, there are NRT (the
only type available prior to 7x), PULL and TLOG which would have different
names. I don't know of any way to switch it off.

As far as moving the data, here's a little known trick: Use the replication
API to issue a fetchindexk, see:
https://lucene.apache.org/solr/guide/6_6/index-replication.html As long as
the target cluster can "see" the source cluster via http, this should work.
This is entirely outside SolrCloud and ZooKeeper is not involved. This would
even work with, say, one side being stand-alone and the other being
SolrCloud (not that you want to do that, just illustrating it's not part of
SolrCloud)...

So you'd specify something like:
http://target_node:port/solr/core_name/replication?command=fetchindex&master
Url=http://source_node:port/solr/core_name

"core_name" in these cases is what appears in the "cores" dropdown on the
admin UI page. You do not have to shut Solr down at all on either end to use
this, although last I knew the target node would not serve queries while
this was happening.

An alternative is to not hard-code the names in your copy script, rather
look at the information in ZooKeeper for your source and target information,
you could do this by using the CLUSTERSTATUS collections API call.

Best,
Erick

On Tue, Mar 6, 2018 at 6:47 AM, Patrick Schemitz <ps@solute.de> wrote:
> Hi List,
>
> so I'm running a bunch of SolrCloud clusters (each cluster is: 8
> shards on 2 servers, with 4 instances per server, no replicas, i.e. 1
> shard per instance).
>
> Building the index afresh takes 15+ hours, so when I have to deploy a
> new index, I build it once, on one cluster, and then copy (scp) over
> the data/<main_index>/index directories (shutting down the Solr instances
first).
>
> I could get Solr 6.5.1 to number the shard/replica directories nicely
> via the createNodeSet and createNodeSet.shuffle options:
>
> Solr 6.5.1 /var/lib/solr:
>
> Server node 1:
> instance00/data/main_index_shard1_replica1
> instance01/data/main_index_shard2_replica1
> instance02/data/main_index_shard3_replica1
> instance03/data/main_index_shard4_replica1
>
> Server node 2:
> instance00/data/main_index_shard5_replica1
> instance01/data/main_index_shard6_replica1
> instance02/data/main_index_shard7_replica1
> instance03/data/main_index_shard8_replica1
>
> However, while attempting to upgrade to 7.2.1, this numbering has changed:
>
> Solr 7.2.1 /var/lib/solr:
>
> Server node 1:
> instance00/data/main_index_shard1_replica_n1
> instance01/data/main_index_shard2_replica_n2
> instance02/data/main_index_shard3_replica_n4
> instance03/data/main_index_shard4_replica_n6
>
> Server node 2:
> instance00/data/main_index_shard5_replica_n8
> instance01/data/main_index_shard6_replica_n10
> instance02/data/main_index_shard7_replica_n12
> instance03/data/main_index_shard8_replica_n14
>
> This new numbering breaks my copy script, and furthermode, I'm worried
> as to what happens when the numbering is different among target clusters.
>
> How can I switch this back to the old numbering scheme?
>
> Side note: is there a recommended way of doing this? Is the
> backup/restore mechanism suitable for this? The ref guide is kind of
> terse here.
>
> Thanks in advance,
>
> Ciao, Patrick

FW: Nested documents vs. flattening document structure?

-----Original Message-----
From: Dc Tech [mailto:dctech1000@gmail.com]
Sent: 06 March 2018 20:19
To: solr-user@lucene.apache.org
Subject: Nested documents vs. flattening document structure?

We are evaluating using nested documents vs. simply flattening the document.

Looking through the documentation, it is not very clear to me if the nested
documents are fully mature, and support the full richness of SOLR
(streaming, mature faceting) etc...

Any opinions or guidance on that?

For *flattening*, we are thinking of setting up three groups of fields:
1. Fields for search - 3-4 groups of fields that glom together the document
fields in order of boosting priority (e.g. f1 has just the title , f2 has
title+authors....)
2. Fields for faceting if needed
3. and Fields for display (or the original document fields) e.g.
author_name|author_unique_id...

FW: Copying a SolrCloud collection to other hosts

-----Original Message-----
From: Patrick Schemitz [mailto:ps@solute.de]
Sent: 06 March 2018 20:18
To: solr-user@lucene.apache.org
Subject: Copying a SolrCloud collection to other hosts

Hi List,

so I'm running a bunch of SolrCloud clusters (each cluster is: 8 shards on 2
servers, with 4 instances per server, no replicas, i.e. 1 shard per
instance).

Building the index afresh takes 15+ hours, so when I have to deploy a new
index, I build it once, on one cluster, and then copy (scp) over the
data/<main_index>/index directories (shutting down the Solr instances
first).

I could get Solr 6.5.1 to number the shard/replica directories nicely via
the createNodeSet and createNodeSet.shuffle options:

Solr 6.5.1 /var/lib/solr:

Server node 1:
instance00/data/main_index_shard1_replica1
instance01/data/main_index_shard2_replica1
instance02/data/main_index_shard3_replica1
instance03/data/main_index_shard4_replica1

Server node 2:
instance00/data/main_index_shard5_replica1
instance01/data/main_index_shard6_replica1
instance02/data/main_index_shard7_replica1
instance03/data/main_index_shard8_replica1

However, while attempting to upgrade to 7.2.1, this numbering has changed:

Solr 7.2.1 /var/lib/solr:

Server node 1:
instance00/data/main_index_shard1_replica_n1
instance01/data/main_index_shard2_replica_n2
instance02/data/main_index_shard3_replica_n4
instance03/data/main_index_shard4_replica_n6

Server node 2:
instance00/data/main_index_shard5_replica_n8
instance01/data/main_index_shard6_replica_n10
instance02/data/main_index_shard7_replica_n12
instance03/data/main_index_shard8_replica_n14

This new numbering breaks my copy script, and furthermode, I'm worried as to
what happens when the numbering is different among target clusters.

How can I switch this back to the old numbering scheme?

Side note: is there a recommended way of doing this? Is the backup/restore
mechanism suitable for this? The ref guide is kind of terse here.

Thanks in advance,

Ciao, Patrick

FW: Analytics componen exception

-----Original Message-----
From: solrdj@seznam.cz [mailto:solrdj@seznam.cz]
Sent: 06 March 2018 19:01
To: solr-user@lucene.apache.org
Subject: Analytics componen exception

A would like to use Analytisc component. I configured it by https://lucene.
apache.org/solr/guide/7_2/analytics.html.
When I try to send query to solr, exception is thrown.

Reason: <pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.
IllegalAccessError: tried to access field org.apache.solr.handler.component.
ResponseBuilder._isOlapAnalytics from class org.apache.solr.handler.
component.AnalyticsComponent at org.apache.solr.handler.component.
AnalyticsComponent.prepare(AnalyticsComponent.java:46) at org.apache.solr.
handler.component.SearchHandler.handleRequestBody(SearchHandler.java:269) at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:177) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503) at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710) at org.
apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)

FW: Solr Cloud: query elevation + deduplication?

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: 06 March 2018 17:38
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud: query elevation + deduplication?

Hi,

I would not use ID (uniqueKey) as signature field, query elevation would
never work properly with such a set up, change a document's content, and it
'll get a new ID.

If i remember correctly this factory still deletes duplicates if
signatureField is not uniqueKey.

Regarding SOLR-3473, nobody seems to be working on that.

Regards,
Markus

-----Original message-----
> From:Ronja Koistinen <ronja.koistinen@helsinki.fi>
> Sent: Monday 5th March 2018 15:32
> To: solr-user@lucene.apache.org
> Subject: Solr Cloud: query elevation + deduplication?
>
> Hello,
>
> I am running Solr Cloud 6.6.2 and trying to get query elevation and
> deduplication (with SignatureUpdateProcessor) working at the same time.
>
> The documentation for deduplication
> (https://lucene.apache.org/solr/guide/6_6/de-duplication.html) does
> not specify if the signatureField needs to be the uniqueKey field
> configured in my schema.xml. Currently I have my uniqueKey set to the
> field containing the url of my documents.
>
> The query elevation seems to reference documents by the uniqueKey in
> the "id" attributes listed in elevate.xml, so having the uniqueKey be
> the url would be beneficial to my process of maintaining the query
> elevation list.
>
> Also, what is the status of this issue I found?
> https://issues.apache.org/jira/browse/SOLR-3473
>
> --
> Ronja Koistinen
> University of Helsinki
>
>

Tuesday, March 6, 2018

FW: Nested documents vs. flattening document structure?

FW: Solr dih extract text from inline images in pdf

FW: Copying a SolrCloud collection to other hosts

FW: Nested documents vs. flattening document structure?

FW: Copying a SolrCloud collection to other hosts

FW: Analytics componen exception

FW: Solr Cloud: query elevation + deduplication?

Blog Archive

About Me