Corpora used for CEAS-04 Testing
For the CEAS conference due to be held at
the end of July 2004, I co-wrote
a paper
with Brendon Whateley, looking at SpamBayes
(mostly things that had already been discussed on spambayes-dev).
There isn't really room to go into the details of the corpora or the
results in the paper (rather than looking at the results, it's using the
results to confirm various things). I would like to have the raw results
available, though, so that they can be checked, or other facets can be
examined, so that is the function of this page.
Five corpora are examined in the paper. The three that are my personal
(or wife's personal) email are January to April 2004 - I wanted to use
reasonably recent mail, and include all mail (previous testing has excluded
mailing list traffic), and only really started keeping all mail around the
start of 2004. There are consequences of this, of course, most
particularly that SpamBayes likes a reasonably balanced corpus, and these
are not balanced (none of my personal streams are close to 1::1 ham::spam).
- My wife's email. For various reasons, she gets a lot
of spam, and not much ham at all (although much of the ham looks (to the
human eye) quite spammy). She's on a couple of mailing lists that churn
out about one message per day - total mail is probably about 20 messages
per day, around half spam.
- My work email. The line between my work email and personal email is
quite thin - what I actually did was separate based on the email address
the mail was delivered to. The work email has very little mailing list
traffic (just a few internal lists), and not much spam (although it's
increasing - somebody got infected with a virus and now all the addresses
are out there). A false positive in this corpus would be much worse
than one in the personal mail corpus - however, there are a couple of
messages that are 'spam with one-line comment' (support people sending
"don't trust that last message" type thing). One of these is a false
positive with bigrams, and some are unsures either way - I don't really
care about these - even if they were false positives.
- My personal email. This includes most of my daily mailing list
traffic, and so is much higher volume, and has a low percentage of spam
because in many cases the mailing lists are filtering it out before it
gets to me. There are a few discussions about spam that makes it to the
various lists in here, too - again, that really is off-topic (except
for the spambayes
list, of course), and I don't care if it gets misclassified.
- Brendon's mail. I believe this is all his total incoming mail
stream, but not split up, and for a longer period. The total number of
messages is higher than any of the other corpora, and the ham::spam
ratio is much more even, which SpamBayes would like.
- The SpamAssassin
Public Corpus. This is as close as I've seen to a standard testing
corpus, which is why it was used. The page is a little unclear, but
I believe that there are five sets (starting with '2003'), and all five
of those were used (to make one ham corpus and one spam corpus). This
is a fairly small corpus, although it is reasonably balanced. I did
not check this for any misclassifications, although I presume that they
would have been found by other users of the corpus by now. Unfortunately,
because the corpus is made up of collections of messages from here and
there and now and then, this is really ill-suited to incremental testing
(although the most recent 100 or so days do look ok, but it's dubious
as to whether the results can be relied on). The
BruceG archive might be
nicer to use, but (like many corpora) it doesn't have any associated
ham, and it's risky using ham from one source and spam from another.
When I find time, I'll discuss the results from the paper here in
a bit more detail, but until now, that's it - or you can
mail me or post to the
SpamBayes-dev
list.