fuzzyocr for Debian ------------------- --- config file The main config file is installed in /etc/spamassassin/FuzzyOcr.cf.real When the package is installed, there is a symlink FuzzyOcr.cf -> FuzzyOcr.cf.real (so, when the package is removed, but not purged, then the symlink is absent, and spamassassin does not try to initialize the plugin). --- spamc/spamd In the main config file, the settings for focr_logfile and focr_digest_db do not make sense when an user is using spamc/spamd (as I do). Currently both are then disabled. This way, FuzzyOcr works out-of-the-box with spamc/spamd It is still possible, though, for an user to use those features; for example, I added into /home/debdev/.spamassassin/user_prefs focr_verbose 2 focr_logfile /home/debdev/var/FuzzyOcr.log focr_enable_image_hashing 1 focr_digest_db /home/debdev/var/FuzzyOcr.hashdb -- A Mennucc1 , Sun, 28 Sep 2008 09:26:50 +0200 This following is an upstream introduction to FuzzyOcr: FuzzyOcr is a plugin for SpamAssassin which is aimed at unsolicited bulk mail (also known as "Spam") containing images as the main content carrier. Using different methods, it analyzes the content and properties of images to distinguish between normal mails (Ham) and spam mails. The methods mainly are: * Optical Character Recognition using different engines and settings * Fuzzy word matching algorithm applied to OCR results * Image hashing system to learn unique properties of known spam images * Dimension, size and integrity checking of images * Content-Type verification for the containing email For a brief description of features, resource aspects and scalability, see the detailed list below: * Matching and learning techniques o Flexible Optical Character Recognition interface + Official Support for gocr and ocrad + Generic support for TesserAct and others upcoming (planned for 3.5) o Fuzzy word matching algorithm applied to OCR results o Recognition of duplicate (already processed) or similar images using feature vectors (Hashing) + Efficient MLDBM database + Mysql Support (planned for 3.5) o Dimension, size and integrity checking o Content-Type checking of containing email * Resource saving techniques o Only scan mails which where not recognized yet as Ham or Spam by other SpamAssassin rules or plugins (using score thresholds) o Optional skip of other scanning facilities once one scores already with a given threshold (planned for 3.5) o Mail skipping based on direct feature analysis (Dimensions and file size) (planned for 3.5) * Safety measures o Configurable timeout against Denial of Service attacks against the third party tools o Context based word sets instead of simple lists to prevent false positives (planned for 3.5)