Luka on Fri, 14 Feb 2003 02:42:57 +0100 (CET)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

<nettime> a plan for spam


hi

this text was written by paul graham, lisp guru. it proposes to filter 
spam with help of the bayes rule. it says spam should be recognized at 
the users mail reader, by keeping a corpus of each users 'personal' 
spam and non-spam messages. new messages are compared to the statistics 
of previous mails and automatically classified as spam or non-spam.

after being featured on slashdot, it generated enormous response and 
many software implementations have since been developed based on these 
ideas. if you care, read more at http://www.paulgraham.com/antispam.html

LF

*************************************************

A Plan for Spam

by Paul Graham

August 2002


I think it's possible to stop spam, and that  content-based filters are 
the way to do it. The Achilles heel of the spammers is their message. 
They can circumvent any other barrier you set up.  They have so far, at 
least.  But they have to deliver their message, whatever it is.  If we 
can write software that recognizes their messages, there is no way they 
can get around that.

_ _ _

To the recipient, spam is easily recognizable.  If you hired  someone 
to read your mail and discard the spam, they would have little trouble 
doing it.  How much do we have to do, short of AI, to automate this 
process?

I think we will be able to solve the problem with fairly simple 
algorithms.  In fact, I've found that you can filter present-day spam 
acceptably well using nothing more than a Bayesian combination of the 
spam probabilities of individual words.  Using a slightly tweaked (as 
described below) Bayesian filter, we now miss less than 5 per 1000 
spams, with 0 false positives.

The statistical approach is not usually the first one people try when 
they write spam filters.  Most hackers' first instinct is to try to 
write software that recognizes individual properties of spam.  You look 
at spams and you think, the gall of these guys to try sending me mail  
that begins "Dear Friend" or has a subject line that's all uppercase 
and ends in eight exclamation points.  I can filter out that stuff with 
about one line of code.

And so you do, and in the beginning it works.  A few simple rules will 
take a big bite out of your incoming spam.  Merely looking for the word 
"click" will catch 79.7% of the emails in my spam corpus, with only 
1.2% false positives.

I spent about six months writing software that looked for individual 
spam features before I tried the statistical approach.  What I found 
was that recognizing that last few percent of spams got very hard, and 
that as I made the filters stricter I got more false positives.

False positives are innocent emails that get mistakenly identified as 
spams. For most users, missing legitimate email is an order of 
magnitude worse than receiving spam, so a filter that yields false 
positives is like an acne cure that carries a risk of death to the 
patient.

The more spam a user gets, the less likely he'll be to notice one 
innocent mail sitting in his spam folder.  And strangely enough, the 
better your spam filters get, the more dangerous false positives 
become, because when the filters are really good, users will be more 
likely to ignore everything they catch.

I don't know why I avoided trying the statistical approach for so long. 
  I think it was because I got addicted to trying to identify spam 
features myself, as if I were playing some kind of competitive game 
with the spammers.  (Nonhackers don't often realize this, but most 
hackers are very competitive.) When I did try statistical analysis, I 
found immediately that it was much cleverer than I had been. It 
discovered, of course, that terms like "virtumundo" and "teens" were 
good indicators of spam.  But it also discovered that "per" and "FL" 
and "ff0000" are good  indicators of spam.  In fact, "ff0000" (html for 
bright red) turns out to be as good an indicator of spam as any   
pornographic term.

_ _ _

Here's a sketch of how I do statistical filtering.  I start with one 
corpus of spam and one of nonspam mail.  At the moment each one has 
about 4000 messages in it.  I scan the entire text, including headers 
and embedded html and javascript, of each message in each corpus. I 
currently consider alphanumeric characters, dashes, apostrophes, and 
dollar signs to be part of tokens, and everything else to be a token 
separator.  (There is probably room for improvement here.)  I ignore 
tokens that are all digits, and I also ignore html comments, not even 
considering them as token separators.

I count the number of times each token (ignoring case, currently) 
occurs in each corpus.  At this stage I end up with two large hash    
tables, one for each corpus, mapping tokens to number of occurrences.

Next I create a third hash table, this time mapping each token to the 
probability that an email containing it is a spam, which I calculate as 
follows [1]:


(let ((g (* 2 (or (gethash word good) 0)))
       (b (or (gethash word bad) 0)))
    (unless (< (+ g b) 5)
      (max .01
           (min .99 (float (/ (min 1 (/ b nbad))
                              (+ (min 1 (/ g ngood))
                                 (min 1 (/ b nbad))))))))) where word is 
the token whose probability we're calculating, good and bad are the 
hash tables I created in the first step, and ngood and nbad are number 
of nonspam and spam messages respectively.

I explained this as code to show a couple of important details. I want 
to bias the probabilities slightly to avoid false positives, and by 
trial and error I've found that a good way to do it is to double all 
the numbers in good. This helps to distinguish between words that 
occasionally do occur in legitimate email and words that almost never 
do.  I only consider words that occur more than five times in total 
(actually, because of the doubling, occurring three  times in nonspam 
mail would be enough).  And then there is the question of what 
probability to assign to words that occur in one corpus but not the 
other.  Again by trial and    error I chose .01 and .99.  There may be 
room for tuning here, but as the corpus grows such tuning will happen 
automatically anyway.

The especially observant will notice that while I consider each corpus 
to be a single long stream of text for purposes of counting 
occurrences, I use the number of emails in each, rather than their 
combined length, as the divisor      in calculating spam probabilities. 
  This adds another slight bias to protect against false positives.

When new mail arrives, it is scanned into tokens, and the most 
interesting fifteen tokens, where interesting is   measured by how far 
their spam probability is from a neutral .5, are used to calculate the 
probability that the mail is spam.  If probs is a list of the fifteen 
individual probabilities, you calculate the  combined probability thus:


(let ((prod (apply #'* probs)))
   (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x))
                                      probs))))) One question that 
arises in practice is what probability to assign to a word you've never 
seen, i.e. one that doesn't occur in the hash table of word 
probabilities.  I've found, again by trial and error, that .4 is a good 
number to use.  If you've never seen a word before, it is probably 
fairly innocent; spam words tend to be all too familiar.

There are examples of this algorithm being applied to actual emails in 
an appendix at the end.

I treat mail as spam if the algorithm above gives it a probability of 
more than .9 of being spam.  But in practice it would not matter much 
where I put this threshold, because few probabilities end up in the 
middle of the range.

_ _ _

One great advantage of the statistical approach is that you don't have 
to read so many spams.  Over the past six months, I've read literally 
thousands of spams, and it is really kind of demoralizing.  Norbert 
Wiener said if you compete with slaves you become a slave, and there is 
something similarly degrading about competing with spammers.   To 
recognize individual spam features you have to try to get into the mind 
of the spammer, and frankly I want to spend as little time inside the 
minds of spammers as possible.

But the real advantage of the Bayesian approach, of course, is that you 
know what you're measuring.  Feature-recognizing filters like 
SpamAssassin assign a spam "score" to email.  The Bayesian approach 
assigns an actual probability.  The problem with a "score" is that no 
one knows what it means.  The user doesn't know what it means, but 
worse still, neither does the developer of the filter.  How many points 
should an email get for having the word "sex" in it?  A probability can 
of course be mistaken, but there is little ambiguity about what it 
means, or how evidence should be combined to calculate it.  Based on my 
corpus, "sex" indicates a .97 probability of the containing email being 
a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, 
equally unambiguous, says that an email containing both words would, in 
the (unlikely) absence of any other evidence, have a 99.97% chance of 
being a spam.

Because it is measuring probabilities, the Bayesian approach considers 
all the evidence in the email, both good and bad. Words that occur 
disproportionately rarely in spam (like "though" or "tonight" or 
"apparently") contribute as much to decreasing the probability as bad 
words like "unsubscribe" and "opt-in" do to increasing it.  So an 
otherwise innocent email that happens to include the word "sex" is not 
going to get tagged as spam.

Ideally, of course, the probabilities should be calculated individually 
for each user.  I get a lot of email containing the word "Lisp", and 
(so far) no spam that does.  So a word like that is effectively a kind 
of password for sending mail to me.  In my earlier spam-filtering 
software, the user could set up a list of such words and mail 
containing them would automatically get past the filters.  On my list I 
put words like "Lisp" and also my zipcode, so that (otherwise rather 
spammy-sounding) receipts from online orders would get through.  I 
thought I was being very clever, but I found that the Bayesian filter 
did the same thing for me, and moreover discovered of a lot of words I 
hadn't thought of.

When I said at the start that our filters let through less than 5 spams 
per 1000 with 0 false positives, I'm talking about filtering my mail 
based on a corpus of my mail.  But these numbers are not misleading, 
because that is the approach I'm advocating: filter each user's mail 
based on the spam and nonspam mail he receives.  Essentially, each user 
should have two delete buttons, ordinary delete and delete-as-spam. 
Anything deleted as spam goes into the spam corpus,    and everything 
else goes into the nonspam corpus.

You could start users with a seed filter, but ultimately each user 
should have his own per-word probabilities based on the actual mail he 
receives.  This (a) makes the filters more effective, (b) lets each 
user decide their own precise definition of spam, and (c) perhaps best 
of all makes it hard for spammers to tune mails to get through the 
filters.  If a lot of the   brain of the filter is in the individual 
databases, then  merely tuning spams to get through the seed filters 
won't guarantee anything about how well they'll get through individual 
users' varying and much more trained filters.

Content-based spam filtering is often combined with a whitelist, a list 
of senders whose mail can be accepted with no filtering. One easy way 
to build such a whitelist is to keep a list of every address the user 
has ever sent mail to.  If a mail reader has a delete-as-spam button 
then you could also add the from address of every email the user has 
deleted as ordinary trash.

I'm an advocate of whitelists, but more as a way to save   computation 
than as a way to improve filtering.  I used to think that whitelists 
would make filtering easier, because you'd only have to filter email 
from people you'd never heard from, and someone sending you mail for 
the first time is constrained by convention in what they can say to 
you. Someone you already know might send you an email talking about 
sex, but someone sending you mail for the first time would not    be 
likely to.  The problem is, people can have more than one  email 
address, so a new from-address doesn't guarantee that the sender is 
writing to you for the first time. It is not unusual for an old friend 
(especially if he is a hacker) to suddenly send you an email with a new 
from-address, so you can't risk false positives by filtering mail from 
unknown   addresses especially stringently.

In a sense, though, my filters do themselves embody a kind of whitelist 
(and blacklist) because they are based on entire messages, including 
the headers.  So to that extent they "know" the email addresses of 
trusted senders and even the routes by which mail gets from them to me. 
    And they know the same about spam, including the server    names, 
mailer versions, and protocols.

_ _ _

If I thought that I could keep up current rates of spam filtering, I 
would consider this problem solved.  But it doesn't mean much to be 
able to filter out most present-day spam, because spam evolves. Indeed, 
most  antispam techniques so far have been like pesticides that do 
nothing more than create a new, resistant strain of bugs.

I'm more hopeful about Bayesian filters, because they evolve with the 
spam.  So as spammers start using "c0ck"    instead of "cock" to evade 
simple-minded spam filters      based on individual words, Bayesian 
filters automatically notice.  Indeed, "c0ck" is far more damning 
evidence than "cock", and Bayesian filters know precisely how much more.

Still, anyone who proposes a plan for spam filtering has to be able to 
answer the question: if the spammers knew exactly what you were doing, 
how well could they get past you?  For example, I think that if 
checksum-based spam filtering becomes a serious obstacle, the spammers 
will just switch to mad-lib techniques for generating message bodies.

To beat Bayesian filters, it would not be enough for spammers to make 
their emails unique or to stop using individual naughty words.  They'd 
have to make their mails indistinguishable from your ordinary mail.  
And this I think would severely constrain them.  Spam is mostly sales 
pitches, so unless your regular mail is all sales pitches, spams will 
inevitably have a different character.  And     the spammers would 
also, of course, have to change (and keep  changing) their whole 
infrastructure, because otherwise the headers would look as bad to the 
Bayesian filters as ever, no matter what they did to the message body.  
I don't know enough about the infrastructure that spammers use to know 
how hard it would be to make the headers look innocent, but my guess is 
that it would be even harder than making the     message look innocent.

Assuming they could solve the problem of the headers, the spam of the 
future will probably look something like this:


Hey there.  Thought you should check out the following:
http://www.27meg.com/foo because that is about as much sales pitch as 
content-based filtering will leave the spammer room to make.  (Indeed, 
it will be hard even to get this past filters, because if everything 
else in the email is neutral, the spam probability will hinge on the 
url, and it will take some effort to make that look neutral.)

Spammers range from businesses running so-called opt-in lists who don't 
even try to conceal their identities, to guys who hijack mail servers 
to send out spams promoting porn sites.  If we use filtering to whittle 
their options down to mails like the one above, that should pretty much 
put the spammers on the "legitimate" end of the spectrum out of 
business; they feel obliged by various state laws to include 
boilerplate about why their spam is not spam, and how to cancel your 
"subscription,"  and that kind of text is easy to    recognize.

(I used to think it was naive to believe that stricter laws would 
decrease spam.  Now I think that while stricter laws   may not decrease 
the amount of spam that spammers send, they can certainly help filters 
to decrease the amount of   spam that recipients actually see.)

All along the spectrum, if you restrict the sales pitches spammers can 
make, you will inevitably tend to put them out of business.  That word 
business is an important one to remember.  The spammers are 
businessmen.  They send spam because it works.  It works because 
although the response rate is abominably low (at best 15 per million, 
vs 3000 per million for a catalog mailing), the cost, to them, is   
practically nothing.  The cost is enormous for the recipients,    about 
5 man-weeks for each million recipients who spend   a second to delete 
the spam, but the spammer doesn't have to pay that.

Sending spam does cost the spammer something, though. [2] So the lower 
we can get the response rate-- whether by filtering, or by using 
filters to force spammers to dilute their pitches-- the fewer 
businesses will find it worth their while to send spam.

The reason the spammers use the kinds of  sales pitches that they do is 
to increase response rates. This is possibly even more disgusting than 
getting inside the mind of a spammer, but let's take a quick look 
inside the mind of someone who responds to a spam.  This person is 
either astonishingly credulous or deeply in denial about their    
sexual interests.  In either case, repulsive or idiotic as the spam 
seems to us, it is exciting to them.  The spammers wouldn't say these 
things if they didn't sound exciting.  And "thought you should check 
out the following" is just not going to have nearly the pull with the 
spam recipient as the kinds of things that spammers say now. Result: if 
it can't contain exciting sales pitches, spam becomes less effective as 
a marketing vehicle, and fewer businesses want to use it.

That is the big win in the end.  I started writing spam filtering 
software because I didn't want have to look at the stuff anymore. But 
if we get good enough at filtering out spam, it will stop working, and 
the spammers will actually stop sending it.

_ _ _

Of all the approaches to fighting spam, from software to laws, I 
believe Bayesian filtering will be the single most effective.  But I 
also think that the more different kinds of antispam efforts we 
undertake, the better, because any measure that constrains spammers 
will tend to make filtering easier. And even within the world of 
content-based filtering, I think it will be a good thing if there are 
many different kinds of software being used simultaneously.  The more 
different  filters there are, the harder it will be for spammers to 
tune spams to get through them.



Appendix: Examples of Filtering

Here is an example of a spam that arrived while I was writing this 
article.  The fifteen most interesting words in this spam are:


qvp0045
indira
mx-05
intimail
$7500
freeyankeedom
cdo
bluefoxmedia
jpg
unsecured
platinum
3d0
qves
7c5
7c266675 The words are a mix of stuff from the headers and from the 
message body, which is typical of spam.  Also typical of spam is that 
every one of these words has a spam probability, in my database, of 
.99.  In fact there are more than fifteen words with probabilities of 
.99, and these are just the first fifteen seen.

Unfortunately that makes this email a boring example of the use of 
Bayes' Rule.  To see an interesting variety of probabilities we have to 
look at this actually quite atypical spam.

The fifteen most interesting words in this spam, with their 
probabilities, are:
madam           0.99
promotion       0.99
republic        0.99
shortest        0.047225013
mandatory       0.047225013
standardization 0.07347802
sorry           0.08221981
supported       0.09019077
people's        0.09019077
enter           0.9075001
quality         0.8921298
organization    0.12454646
investment      0.8568143
very            0.14758544
valuable        0.82347786  This time the evidence is a mix of good and 
bad.  A word like   "shortest" is almost as much evidence for innocence 
as a word like "madam" or "promotion" is for guilt.  But still the case 
for guilt is stronger.  If you combine these numbers according to 
Bayes' Rule, the resulting probability is .9027.

"Madam" is obviously from spams beginning "Dear Sir or Madam."  They're 
not very common, but the word "madam" never occurs in my legitimate 
email, and it's all about the ratio.

"Republic" scores high because it often shows up in Nigerian scam 
emails, and also occurs once or twice in spams referring to Korea and 
South Africa. You might say that it's an accident that it thus helps 
identify this spam.  But I've found when examining spam probabilities 
that there are a lot of these accidents, and they have an uncanny 
tendency to push things in the right direction rather than the wrong 
one. In this case, it is not entirely a coincidence that the word 
"Republic" occurs in Nigerian scam emails and this spam. There is a 
whole class of dubious business propositions involving less developed 
countries, and these in turn are more likely to have names that specify 
explicitly (because they aren't) that they are republics.[3]

On the other hand, "enter" is a genuine miss.  It occurs mostly in 
unsubscribe instructions, but here is used in a completely innocent 
way.  Fortunately the statistical approach is fairly robust, and can 
tolerate quite a lot of misses before the results start to be thrown 
off.

For comparison,  here is an example of that rare bird, a spam that gets 
through the filters.  Why?  Because by sheer chance it happens to be 
loaded with words that occur in my actual email:


perl       0.01
python     0.01
tcl        0.01
scripting  0.01
morris     0.01
graham     0.01491078
guarantee  0.9762507
cgi        0.9734398
paul       0.027040077
quite      0.030676773
pop3       0.042199217
various    0.06080265
prices     0.9359873
managed    0.06451222
difficult  0.071706355 There are a couple pieces of good news here.  
First, this mail probably wouldn't get through the filters of someone 
who didn't happen to specialize in programming languages and have a 
good friend called Morris.  For the average user, all the top five 
words here  would be neutral and would not contribute to the spam 
probability.

Second, I think filtering based on word pairs  (see below) might well 
catch this one:  "cost effective", "setup fee", "money back" -- pretty 
incriminating stuff.  And of course if they continued to spam me (or a 
network I was part of), "Hostex" itself would be recognized as  a spam 
term.

Finally, here is an innocent email. Its  fifteen most interesting words 
are as follows:


continuation  0.01
describe      0.01
continuations 0.01
example       0.033600237
programming   0.05214485
i'm           0.055427782
examples      0.07972858
color         0.9189189
localhost     0.09883721
hi            0.116539136
california    0.84421706
same          0.15981844
spot          0.1654587
us-ascii      0.16804294
what          0.19212411 Most of the words here indicate the mail is an 
innocent one. There are two bad smelling words,  "color" (spammers love 
colored fonts) and "California" (which occurs in testimonials and also 
in menus in forms), but they are not enough to outweigh obviously 
innocent words like "continuation" and "example".

It's interesting that "describe" rates as so thoroughly innocent.  It 
hasn't occurred in a single one of my 4000 spams.  The data turns out 
to be full of such surprises.  One of the things you learn when you 
analyze spam texts is how narrow a subset of the language spammers 
operate in.  It's that fact, together with the equally characteristic 
vocabulary of any individual user's mail, that makes Bayesian filtering 
a good bet.



Appendix: More Ideas

One idea that I haven't tried yet is to filter based on word pairs, or 
even triples, rather than individual words. This should yield a much 
sharper estimate of the probability. For example, in my current 
database, the word "offers" has a probability of .96.  If you based the 
probabilities   on word pairs, you'd end up with "special offers" and 
"valuable offers" having probabilities of .99 and, say, "approach 
offers" (as in "this approach offers") having a probability of .1 or 
less.

The reason I haven't done this is that filtering based on individual 
words already works so well.  But it does mean that there is room to 
tighten the filters if spam gets harder to detect. (Curiously, a filter 
based on word pairs would be in effect a Markov-chaining text generator 
running in reverse.)

Specific spam features (e.g. not seeing the recipient's address in the 
to: field) do of course have value in  recognizing spam.  They can be 
considered in this algorithm by treating them as virtual words.  I'll 
probably do this in future versions, at least for a handful of the most 
egregious spam indicators. Feature-recognizing spam filters are right 
in many details; what they lack is an overall discipline for combining 
evidence.

Recognizing nonspam features may be more important than recognizing 
spam features.  False positives are such a worry that they demand 
extraordinary measures.  I will probably in future versions add a 
second level of testing designed specifically to avoid false positives. 
  If a mail triggers this second level of filters it will be accepted 
even if its spam probability is above the threshold.

I don't expect this second level of filtering to be Bayesian. It will 
inevitably  be not only ad hoc, but based on guesses, because the 
number of false positives will not tend to be large enough to notice 
patterns. (It is just as well, anyway, if a backup system doesn't rely 
on the same technology as the primary system.)

Another thing I may try in the future is to focus extra attention on 
specific parts of the email.  For example, about 95% of current spam 
includes the url of a site they want you to visit.  (The remaining 5% 
want you to call a phone number, reply by email or to a US mail 
address, or in a few cases to buy a certain stock.)   The url is in 
such cases practically enough by itself to determine whether the email 
is spam.

Domain names differ from the rest of the text in a (non-German) email 
in that they often consist of several words stuck together.  Though 
computationally expensive  in the general case, it might be worth 
trying to  decompose them.  If a filter has never seen the token 
"xxxporn" before it will have an individual spam probability of .4, 
whereas "xxx" and "porn" individually have probabilities (in my corpus) 
of .9889 and .99 respectively, and a combined probability of .9998.

I expect decomposing domain names to become more important as spammers 
are gradually forced to stop using incriminating words in the text of 
their messages.  (A url with an ip address is of course an extremely 
incriminating sign, except in the mail of a few sysadmins.)

It might be a good idea to have a cooperatively maintained list of urls 
promoted by spammers.  We'd need a trust metric of the type studied by 
Raph Levien to prevent malicious or incompetent submissions, but if we 
had such a thing it would provide a boost to any filtering software.   
It would also be a convenient basis for boycotts.

Another way to test dubious urls would be to send out a crawler to look 
at the site before the user looked at the email mentioning it.  You 
could use a Bayesian filter to rate the site just as you would an 
email, and whatever was found on the site could be included in 
calculating the probability of the email being a spam.  A url that led 
to a redirect would of course be especially suspicious.

One cooperative project that I think really would be a good idea would 
be to accumulate a giant corpus of spam.  A large, clean corpus is the 
key to making Bayesian filtering work well.  Bayesian filters could 
actually use the corpus as input.  But such a corpus would be useful 
for other kinds of filters too, because it could be used to test them.

Creating such a corpus poses some technical problems.  We'd need trust 
metrics to prevent malicious or incompetent submissions, of course.  
We'd also need ways of erasing    personal information (not just 
to-addresses and ccs, but also e.g. the arguments to unsubscribe urls, 
which often encode the to-address) from mails in the corpus.  If anyone 
wants to take on this project, it would be a good thing for the world.



Appendix: Defining Spam

I think there is a rough consensus on what spam is, but it would be 
useful to have an explicit definition.  We'll need to do this if we 
want to establish a central corpus of spam, or even to compare spam 
filtering rates meaningfully.

To start with, spam is not unsolicited commercial email. If someone in 
my neighborhood heard that I was looking for an old Raleigh three-speed 
in good condition, and sent me an email offering to sell me one, I'd be 
delighted, and yet this email would be both commercial and unsolicited. 
  The defining feature of spam (in fact, its raison d'etre) is not that 
it is unsolicited, but that it is automated.

It is merely incidental, too, that spam is usually commercial. If 
someone started sending mass email to support some political cause, for 
example, it would be just as much spam as email promoting a porn site.

I propose we define spam as unsolicited automated email. This 
definition thus includes some email that many legal definitions of spam 
don't.  Legal definitions of spam, influenced presumably by lobbyists, 
tend to exclude mail sent by companies that have an "existing 
relationship" with the recipient.  But buying something from a company, 
for example, does not imply that you have solicited ongoing email from 
them. If I order something from an online store, and they then send me 
a stream of spam, it's still spam.

Companies sending spam often give you a way to "unsubscribe," or ask 
you to go to their site and change your "account preferences" if you 
want to stop getting spam.  This is not enough to stop the mail from 
being spam.  Not opting out is not the same as opting in.  Unless the   
  recipient explicitly checked a clearly labelled box (whose default was 
no) asking to receive the email, then it is spam.

In some business relationships, you do implicitly solicit certain kinds 
of mail.   When you order online, I think you implicitly solicit a 
receipt, and notification when the order ships. I don't mind when 
Verisign sends me mail warning that a domain name is about to expire 
(at least, if they are the actual  registrar for it).  But when 
Verisign sends me email offering a FREE Guide to Building My E-Commerce 
Web Site, that's spam.




Notes:

[1] The examples in this article are translated into Common Lisp for, 
believe it or not, greater accessibility. The application described 
here is one that we wrote in order to test a new Lisp dialect called 
Arc that is  not yet released.

[2] Currently the lowest rate seems to be about $200 to send a million 
spams. That's very cheap, 1/50th of a cent per spam. But filtering out 
95% of spam, for example, would increase the spammers' cost to reach a 
given audience by a factor of 20.  Few can have margins big enough to 
absorb that.

[3] As a rule of thumb, the more qualifiers there are before the name 
of a country, the more corrupt the rulers.  A country called The 
Socialist People's Democratic Republic of X is probably the last place 
in the world you'd want to live.

Thanks to Sarah Harlin for reading drafts of this; Daniel Giffin (who 
is  also writing the production Arc interpreter) for several good ideas 
about filtering and for creating our mail infrastructure; Robert 
Morris, Trevor Blackwell and Erann Gat for many discussions about spam; 
Raph  Levien for advice about trust metrics;  and Chip Coldwell  and 
Sam Steingold for advice about statistics.

#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: [email protected] and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: [email protected]