Sunday, April 1, 2018

FW: Three Indexing Questions

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 30 March 2018 05:56
To: solr-user@lucene.apache.org
Subject: Re: Three Indexing Questions

On 3/29/2018 3:59 PM, Terry Steichen wrote:
> First question: When indexing content in a directory, Solr's normal
> behavior is to recursively index all the files found in that directory
> and its subdirectories.  However, turns out that when the files are of
> the form *.eml (email), solr won't do that.  I can use a wildcard to
> get it to index the current directory, but it won't recurse.

At first I had no idea what program you were using.  I may have figured it
out, see below.

> I note this message that's displayed when I begin indexing: "Entering
> auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,
> ots,rtf,htm,html,txt,log

That looks like the simple post tool included with Solr.  If it is, type
"bin/post -help" and you will see that there is a -filetypes option that
lets you change the list of extensions that are considered valid.

Note that the post tool included with Solr is a SIMPLE post tool.  It's
designed as a way to get your feet wet, not for heavy production usage. It
does not have extensive capability.  We strongly recommend that you graduate
to a better indexing program.  Usually that means that you're going to have
to write one yourself, to be sure that it does everything YOU want it to
do.  The one included with Solr probably can't do some of the things that
you want it to do.

Also, indexing files using the post tool is going to run Tika extraction
inside Solr.  Tika is a separate Apache project.  Solr happens to include a
subset of Tika's capability that can run inside Solr.  That program is known
to sometimes behave explosively when it processes documents.  If an
explosion happens in Tika and it's running inside Solr, then Solr itself
might crash.  Running Tika outside Solr, usually in a program that you write
yourself, is highly recommended.  Doing this will also give you access to
the full range of Tika's capabilities.

Here's an example of a program that uses both JDBC and Tika to index to
Solr:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

If you search google for "tika index solr" (without the quotes), you'll find
some other examples of custom programs that use Tika to index to Solr. 
There may be better searches you can do on Google as well.

Thanks,
Shawn

No comments:

Post a Comment