Sunday, April 1, 2018

FW: Three Indexing Questions

-----Original Message-----
From: Terry Steichen [mailto:terry@net-frame.com]
Sent: 30 March 2018 03:29
To: solr-user@lucene.apache.org
Subject: Three Indexing Questions

First question: When indexing content in a directory, Solr's normal behavior
is to recursively index all the files found in that directory and its
subdirectories.  However, turns out that when the files are of the form
*.eml (email), solr won't do that.  I can use a wildcard to get it to index
the current directory, but it won't recurse.

I note this message that's displayed when I begin indexing: "Entering auto
mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rt
f,htm,html,txt,log

Is there a way to get it to recurse through files with different extensions,
for example, like .eml?  When I manually add all the subdirectory content,
solr seems to parse the content very well, recognizing all the standard
email metadata.  I just can't get it to do the indexing recursively.

Second question: if I want to index files from many different source
directories, is there a way to specify these different sources in one
command? (Right now I have to issue a separate indexing command for each
directory - which means I have to sit around and wait till each is
finished.)

Third question: I have a very large directory structure that includes a
couple of subdirectories I'd like to exclude from indexing.  Is there a way
to index recursively, but exclude specified directories?

No comments:

Post a Comment