Running Apache Tika in Server Mode
We are using Apache Tika for plain-text extraction of pdf files. Tika is doing a good job here except for the fact that it takes quite long to get results. As an example, extracting the text from a 234 slides pdf presentation takes about 3.5 seconds on my laptop. You might become a performance problem here, if you do not only want to extract the text of a single file but let's say for 12.000 files.
Here is the command with which I figured out, how long it takes to get the plain text of a document:
$ time java -jar tika-app-1.3.jar -h some.pdf [...] real 0m2.935s user 0m4.640s sys 0m0.178s
Now Tika can also be run in a server mode. Here is the command to start Tika as a server:
$ java -jar tika-app-1.3.jar -t --server --port 12345
You can now pass your (pdf) documents to that server (e.g. with NetCat) and get your results as before. As you can see, things become a lot faster:
$ time nc 127.0.0.1 12345 < some.pdf real 0m0.386s user 0m0.003s sys 0m0.015s
So if you have the same performance problems with Tika as we had, this might be a solution!