Howto parse / reindex all archived mails

Article Details
URL: http://www.mailspect.com/esupport/index.php?_m=knowledgebase&_a=viewarticle&kbarticleid=54
Article ID: 54
Created On: 12 May 2009 03:28 PM

Answer
In May 2009 we have improved our email indexing and added support for some MS Office Documents.  This article describes how to install the latest indexing script as well as how to add support for doc and docx files.   There are 2 components to our indexing scheme, a parsing script (fetchdata.pl) and and indexer (sphinx).   It is assumed Sphinx is installed and running.

You will need the latest version of fetchdata.pl:

Adding support for .doc and .docx files.

1) MPP will use antiword if "antiword" application exists in PATH to process DOC documents (Word 98-2003).
To install Antiword:
wget -c http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
tar xzvf antiword-0.37.tar.gz
cd antiword-0.37
sudo make -f Makefile.Linux
sudo make -f Makefile.Linux global_install


2) To process DOCX (Word 2007) if "docx2txt" application exists in PATH
To install Docx2txt use:
http://garr.dl.sourceforge.net/sourceforge/docx2txt/docx2txt-0.3.tgz
tar xzvf docx2txt-0.3.tar.gz
cd docx2txt-0.3
sudo make install

3) To process PDF documents if "pdftotext" application exists in PATH
Pdftotext is part of poppler (http://poppler.freedesktop.org/) and you should install the right binaries for your OS.

4) Processing OpenOffice documents is possible using Openoffice::OODoc module.
 On RHEL/Fedora/CentOS where RPMForge repository is in use, one can install using: yum install perl-OODoc

 Installing using CPAN is also possible:
 perl -MCPAN -e shell
 install Openoffice::OODoc

Note: OS X users please use  /usr/local/mppbase/bin/perl -MCPAN -e shell

How to rebuild your email archive index:

Warning!!!  Re-indexing can take many hours for a large database.  Full text search will not be available during this period but other services will not be affected.  This process is CPU intensive.

1) stop Sphinx searchd daemon:
killall searchd

2) remove existing index files:
rm -f /usr/local/sphinx/var/data/mpp*

3) drop data from content_index and content_counter tables of MPP Archive DB
mysql -uroot -p
use mppdb;
truncate content_counter;
truncate content_index;

4) temporary disable cronjobs for fetchdata and indexer.
Use: crontab -e and comment out the following
#5 * * * * /usr/local/MPP/scripts/fetchdata.pl >/dev/null 2>&1 </dev/null
#45 * * * * /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf mppdeltaindex --rotate >/dev/null 2>&1 </dev/null

5) Download and install latest fetchdata.pl from ftp://ftp.messagepartners.com/pub/mpp4/scripts/fetchdata.pl in /usr/local/MPP/scripts/fetchdata.pl
cd /usr/local/MPP/scripts/
mv fetchdata.pl fetchdata.pl.old
wget -c ftp://ftp.messagepartners.com/pub/mpp4/scripts/fetchdata.pl
chmod 755 fetchdata.pl

Note: Edit MySQL credentials and set $metadata = 1 if you are using MySQL only for metadata.

6) Edit MySQL credentials in fetchdata.pl to meet your DB requirements and also set $metadata variable to 0 or 1 depending on your setup

7) Run fetchdata.pl parser (it could take some time if there are many messages in DB)
perl /usr/local/MPP/scripts/fetchdata.pl

8) Index parsed data
/usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --all

9) start Sphinx searchd daemon
/usr/local/sphinx/bin/searchd --config /usr/local/sphinx/etc/sphinx.conf

10) enable cronjobs to parse / index data back
Use: crontab -e and uncomment the following
5 * * * * /usr/local/MPP/scripts/fetchdata.pl >/dev/null 2>&1 </dev/null
45 * * * * /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf mppdeltaindex --rotate >/dev/null 2>&1 </dev/null