Running Apache Tika in Server Mode

We are using Apache Tika for plain-text extraction of pdf files. Tika is doing a good job here except for the fact that it takes quite long to get results. As an example, extracting the text from a 234 slides pdf presentation takes about 3.5 seconds on my laptop. You might become a performance problem here, if you do not only want to extract the text of a single file but let's say for 12.000 files.

Here is the command with which I figured out, how long it takes to get the plain text of a document:

$ time java -jar tika-app-1.3.jar -h some.pdf
[...]
real	0m2.935s
user	0m4.640s
sys	0m0.178s

Now Tika can also be run in a server mode. Here is the command to start Tika as a server:

$ java -jar tika-app-1.3.jar -t --server --port 12345

You can now pass your (pdf) documents to that server (e.g. with NetCat) and get your results as before. As you can see, things become a lot faster:

$ time nc 127.0.0.1 12345 < some.pdf
real	0m0.386s
user	0m0.003s
sys	0m0.015s

So if you have the same performance problems with Tika as we had, this might be a solution!


 
Inhalt © Michael Knoll 2009-2017  •  Powered by TYPO3  •  TypoScript Blogging by Fabrizio Branca  •  TYPO3 Photo Gallery Management by yag  •  Impressum