Tuesday, March 6, 2018

FW: Solr dih extract text from inline images in pdf

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 06 March 2018 20:52
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Solr dih extract text from inline images in pdf

It's often much easier to approach this by running Tika separately.
Here's a blog on both the reasoning and sample code:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Among other things, you have a lot more control over how Tika operates.

Best,
Erick

On Tue, Mar 6, 2018 at 12:36 AM, lala <labishahla@gmail.com> wrote:
> Hi,
>
> I am working with solr7, indexing multilingual files existing in a
> folder, using DIH (FileListEntityProcessor for the basic entity, &
> TikaEntityProcessor for the child entity in configuration file).
>
> My problem relies here: I want to extract texts from images inside PDF
> files, that works fine with the /update/extract request handler where
> I set the "parseContext.config" attribute to an xml file lets say
"context.xml"
> where I set the property "extractInlineImages" for the entry
> [PDFParserConfig] to true. But I have no Idea how to set the
> parseContext.Config in the DIH configuration??
>
> I tried these approaches, none of them worked:
>
> - set tikaConfig attribute in dih config file to my "context.xml",
> obviously won't work since tika config is different :.
> - set the parseContext.config attribute to my "\dataImport"
> requestHandler, didn't work
>
> I googled a lot with no result...I really really appreciate any help
here!!
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

No comments:

Post a Comment