Corpora used for CEAS-04 Testing

For the CEAS conference due to be held at the end of July 2004, I co-wrote a paper with Brendon Whateley, looking at SpamBayes (mostly things that had already been discussed on spambayes-dev). There isn't really room to go into the details of the corpora or the results in the paper (rather than looking at the results, it's using the results to confirm various things). I would like to have the raw results available, though, so that they can be checked, or other facets can be examined, so that is the function of this page.

Five corpora are examined in the paper. The three that are my personal (or wife's personal) email are January to April 2004 - I wanted to use reasonably recent mail, and include all mail (previous testing has excluded mailing list traffic), and only really started keeping all mail around the start of 2004. There are consequences of this, of course, most particularly that SpamBayes likes a reasonably balanced corpus, and these are not balanced (none of my personal streams are close to 1::1 ham::spam).

  1. My wife's email. For various reasons, she gets a lot of spam, and not much ham at all (although much of the ham looks (to the human eye) quite spammy). She's on a couple of mailing lists that churn out about one message per day - total mail is probably about 20 messages per day, around half spam.
  2. My work email. The line between my work email and personal email is quite thin - what I actually did was separate based on the email address the mail was delivered to. The work email has very little mailing list traffic (just a few internal lists), and not much spam (although it's increasing - somebody got infected with a virus and now all the addresses are out there). A false positive in this corpus would be much worse than one in the personal mail corpus - however, there are a couple of messages that are 'spam with one-line comment' (support people sending "don't trust that last message" type thing). One of these is a false positive with bigrams, and some are unsures either way - I don't really care about these - even if they were false positives.
  3. My personal email. This includes most of my daily mailing list traffic, and so is much higher volume, and has a low percentage of spam because in many cases the mailing lists are filtering it out before it gets to me. There are a few discussions about spam that makes it to the various lists in here, too - again, that really is off-topic (except for the spambayes list, of course), and I don't care if it gets misclassified.
  4. Brendon's mail. I believe this is all his total incoming mail stream, but not split up, and for a longer period. The total number of messages is higher than any of the other corpora, and the ham::spam ratio is much more even, which SpamBayes would like.
  5. The SpamAssassin Public Corpus. This is as close as I've seen to a standard testing corpus, which is why it was used. The page is a little unclear, but I believe that there are five sets (starting with '2003'), and all five of those were used (to make one ham corpus and one spam corpus). This is a fairly small corpus, although it is reasonably balanced. I did not check this for any misclassifications, although I presume that they would have been found by other users of the corpus by now. Unfortunately, because the corpus is made up of collections of messages from here and there and now and then, this is really ill-suited to incremental testing (although the most recent 100 or so days do look ok, but it's dubious as to whether the results can be relied on). The BruceG archive might be nicer to use, but (like many corpora) it doesn't have any associated ham, and it's risky using ham from one source and spam from another.

When I find time, I'll discuss the results from the paper here in a bit more detail, but until now, that's it - or you can mail me or post to the SpamBayes-dev list.

Creative Commons License This work is licensed under a Creative Commons License
Created by Tony Meyer, 16th April 2004.
Last modified 21st April 2004.