<nettime> The Breaking of Cyber Patrol® 4 [part 2 of 2]

nettime's_dusty_archivist on Mon, 20 Mar 2000 07:32:44 +0100 (CET)
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
<nettime> The Breaking of Cyber Patrol® 4 [part 2 of 2]
To: [email protected]
Subject: <nettime> The Breaking of Cyber Patrol® 4 [part 2 of 2]
From: "nettime's_dusty_archivist" <[email protected]>
Date: Mon, 20 Mar 2000 00:01:02 -0500
Sender: [email protected]
[orig from <http://hem.passagen.se/eddy1/reveng/cp4/cp4break.html>]

[part 2 of 2]

   If we were to go another step back we would get a record like this:
 0x4348, 0x0000000, 0x00030103

   This clashes with the structure as we know it, and so we assume that
   there are only three records, the data before them having some other
   structure. Looking, again backwards, we notice that the word following
   the first table entry is 0x0003, which could mean that it's a count of
   the number of tables, right? By checking against another file with the
   same structure, the hotlist.not, we could see that this assumption was
   correct.
   
   The little bit left of the header is not as important as locating the
   table entries and their count, but it seems like the 0x2A at offset
   0x02 is the header size, assuming the header starts at 0x02 and the
   two bytes in front of it being not related to it. The "CH" seems to be
   a marker, the hotlist.not contains "HH" instead. Without more files to
   compare to, or time-consuming debugging of the executable, the few
   bytes left unaccounted for will remain a "mystery".
   
   We learned several important things from the newsgroups list. First,
   Microsystems likes putting length bytes on things. Second, the
   blocking mask 0x000E (corresponding to "Partial Nudity", "Full
   Nudity", and "Sexual Acts / Text") is the most popular one. It appears
   that that's the generic "porn" label which they slap on everything
   that looks like it might be porn, whether it technically applies or
   not. Both these facts were useful in attacking the other two tables in
   cyber.not.
   
   The first table mentioned in the header is the biggest one. At over
   half a megabyte, it makes up most of the bulk of the cyber.not file.
   As our previous measurements indicated, this table includes a lot of
   repeats at a distance of six or seven bytes. Character frequency
   counts revealed that the top three characters in table 1 are:
    1. 0x00 (106280 times)
    2. 0x0E (65483 times)
    3. 0x07 (25212 times)
       
   We know that they like using blocking mask 0x000E, and the bytes
   making up that number are the top two most frequent bytes in the
   table. We know they like length bytes, we know there's some kind of
   structure in here with a size of seven bytes, and 0x07 is the third
   most frequent byte value. This looks promising. Let's look at a hex
   dump. This dump was generated with the Linux od -Ax -txC command;
   offsets are from the start of table 1 as specified in the cyber.not
   header.
000000 53 44 0a 00 03 c7 00 00 07 0e 00 99 37 55 67 00
000010 0a 0a 0a 0a 0e 0c 0b 67 73 76 00 00 07 0e 00 51
000020 b1 f1 6d 00 0c 0a 79 c8 0e 00 0c 0a 9e 09 00 00
000030 0b 01 00 89 84 e0 4e 55 9e 53 d8 00 0c 0a bd 05
000040 00 00 07 0e 00 71 aa 8a 2a 00 0c 0b b8 18 00 00
000050 0b 08 00 ea 1e da d8 d4 fc d4 20 00 0c 0b b8 1a
000060 00 00 07 00 04 e0 3d c1 be 07 08 00 7b 75 fd b7
000070 07 00 04 87 0b 1e ef 00 0c 0b b8 1f 0e 00 0c 0b
000080 b8 2b 08 00 0c 0b b8 2c 0e 00 0c 0b b8 36 08 00
000090 0c 0d 78 02 00 00 07 0e 00 13 53 03 e2 00 0c 0d
0000a0 79 06 00 04 0c 0d ab 97 00 00 07 06 00 31 75 fc
0000b0 80 00 0c 0d 13 5a 0e 00 0c 0e c7 33 0e 00 0c 0e
0000c0 c8 02 00 00 07 0e 00 22 39 82 eb 00 0c 0e e1 0d
0000d0 00 00 07 01 00 0d b0 59 21 00 0c 0e e8 32 00 00
0000e0 07 20 00 7c d3 df f8 00 0c 0f 87 cd 00 00 07 0e
0000f0 00 88 35 ae 33 00 0c 0f c1 72 0e 00 0c 10 a0 d8

   This may appear quite formidable to someone unaccustomed to reading
   hex dumps, but careful examination reveals some interesting things.
   First of all, the sequence "0e 00" occurs quite frequently. It's
   reasonable to suppose that that might be the blocking mask for a page
   or site. Another common one is "07 0e 00". When that one occurs, there
   are often four more bytes and then those three again. These patterns
   are easier to see when one examines more of the dump than the short
   sample here.
   
   It's reasonable to guess that the 07 is a length byte, just like in
   the newsgroup list. But that doesn't explain why we get so many
   repeats at distance six. The byte value 0x06 is only the 39th most
   common value in table 1, even though there are far more repeats at
   distance six than seven. So not everything can be tagged with a length
   byte, or there's something else we don't understand.
   
   Further skimming of the hex dump revealed inspirational passages like
   this one:
037b50 5c b7 08 6f 00 cf cc ae 13 0e 00 cf cc ae c8 0e
037b60 00 cf cc ae c9 0e 00 cf cc ae ca 0e 00 cf cc ae
037b70 cc 0e 00 cf cc ae cd 0e 00 cf cc ae d0 0e 00 cf
037b80 cc ae 15 0e 00 cf cc ae d8 0e 00 cf cc ae 16 0e
037b90 00 cf cc ae 18 0e 00 cf cc ae 1b 0e 00 cf cc ae
037ba0 1d 0e 00 cf cc ae 1e 0e 00 cf cc ae 1f 0e 00 cf
037bb0 cc ae 21 1e 00 cf cc ae 23 0e 00 cf cc ae 24 0e
037bc0 00 cf cc ae 27 0e 00 cf cc ae 28 0e 00 cf cc ae
037bd0 30 0e 00 cf cc 13 ea 0e 00 cf cc c4 c4 0e 00 cf
037be0 cc d0 a0 0e 00 cf cc d0 f8 0e 00 cf cc d2 64 1f
037bf0 00 cf cc d2 a0 00 00 07 0e 00 e5 b0 e3 10 00 cf
037c00 cc d2 18 0e 00 cf cc d2 19 0e 00 cf cc d2 1e 0e

   The pattern may be clearer if we look at the bytes six at a time:
037b54 00 cf cc ae 13 0e
037b5a 00 cf cc ae c8 0e
037b60 00 cf cc ae c9 0e
037b66 00 cf cc ae ca 0e
037b6c 00 cf cc ae cc 0e
037b72 00 cf cc ae cd 0e
037b78 00 cf cc ae d0 0e
037b7e 00 cf cc ae 15 0e
037b84 00 cf cc ae d8 0e
037b8a 00 cf cc ae 16 0e
037b90 00 cf cc ae 18 0e
037b96 00 cf cc ae 1b 0e
037b9c 00 cf cc ae 1d 0e
037ba2 00 cf cc ae 1e 0e
037ba8 00 cf cc ae 1f 0e
037bae 00 cf cc ae 21 1e
037bb4 00 cf cc ae 23 0e
037bba 00 cf cc ae 24 0e
037bc0 00 cf cc ae 27 0e
037bc6 00 cf cc ae 28 0e
037bcc 00 cf cc ae 30 0e

   Here we've obviously got our generic porn mask of 0x000E, alternating
   with four unknown bytes, the last of which often seems to be
   incrementing - but not always. Scanning across the table, we saw that
   when this kind of six-byte structure occurred, the four mystery bytes
   seemed to more or less increment smoothly from the start of the table
   to the end. But it was always the last byte that incremented first,
   and then the second-to-last, and so on. In other words, the field is
   being stored in "big endian" byte order, the exact opposite of the
   "little endian" byte order conventional on PCs. Why would a PC
   software package bother doing something in big endian when it's
   running on a CPU designed for little endian?
   
   At this point we had to depend on intuition. There is one thing that's
   32 bits long and big endian everywhere, even on a PC: that is an IP
   address. Some computers like big endian and some like little endian,
   but it is standard for all Internet protocols to use big endian
   regardless of what kind of system they're running on - so that they'll
   all be able to talk to each other. An added bit of evidence is that
   the actual values of this four-byte field seem to be distributed the
   way one would expect IP addresses to be distributed. Lots of them
   start with bytes like 0xCF, which puts them right in the popular part
   of the Class C IP address space. So, let's write the decimal
   equivalents of the supposed IP addresses next to the hex dump:
037b54 00 cf cc ae 13 0e   207.204.174.19
037b5a 00 cf cc ae c8 0e   207.204.174.200
037b60 00 cf cc ae c9 0e   207.204.174.201
037b66 00 cf cc ae ca 0e   207.204.174.202
037b6c 00 cf cc ae cc 0e   207.204.174.204
037b72 00 cf cc ae cd 0e   207.204.174.205
037b78 00 cf cc ae d0 0e   207.204.174.208
037b7e 00 cf cc ae 15 0e   207.204.174.21
037b84 00 cf cc ae d8 0e   207.204.174.216
037b8a 00 cf cc ae 16 0e   207.204.174.22
037b90 00 cf cc ae 18 0e   207.204.174.24
037b96 00 cf cc ae 1b 0e   207.204.174.27
037b9c 00 cf cc ae 1d 0e   207.204.174.29
037ba2 00 cf cc ae 1e 0e   207.204.174.30
037ba8 00 cf cc ae 1f 0e   207.204.174.31
037bae 00 cf cc ae 21 1e   207.204.174.33
037bb4 00 cf cc ae 23 0e   207.204.174.35
037bba 00 cf cc ae 24 0e   207.204.174.36
037bc0 00 cf cc ae 27 0e   207.204.174.39
037bc6 00 cf cc ae 28 0e   207.204.174.40
037bcc 00 cf cc ae 30 0e   207.204.174.42

   Notice that these are not in numerical order; 216 is not normally
   considered to come between 21 and 22. However, considered as decimal
   representations, these addresses are in strict alphabetical order.
   This list is the kind of thing you might get if you took a text list
   of URLs and passed it through a sort utility designed for text. A
   little examination reveals that these six-byte structures in table 1
   are strictly in this "text IP" order across the entire table. As a
   final confirmation that these numbers are intended to represent IP
   addresses, just point a Web browser to a few. Almost all are porn
   sites.
   
   At this point we had figured out that there were a lot of blocking
   masks interspersed with IP addresses in the table, and also a lot of
   seven-byte structures starting with a length byte and a blocking mask.
   But the remaining four bytes of those seven-byte structures were
   apparently not sorted, nor IP addresses, and there were still some
   bytes that didn't fit into either kind of structure. So we wrote a
   Perl program to dump out the known structures and label the unknown
   parts.
   
   The next step was simply to stare at the output and look for patterns.
   We saw that the six-byte and seven-byte records often occurred in
   blocks of lots of the same kind all together. The unknown part often
   seemed to consist of the byte 0x0B followed by a blocking mask and
   eight bytes of garbage. We guessed that that might be a third record
   type, so we added it to the dumper program, and noticed that the
   remaining unknown sequences often seemed to consist of 0x0F, a
   blocking mask, and then twelve bytes of garbage. From this we inferred
   a general pattern: a length byte (always 3 plus a multiple of 4), a
   blocking mask, and then some amount of garbage, always a multiple of
   four bytes.
   
   Between this and the six-byte IP/mask pattern, almost all the contents
   of table 1 fit some kind of structure. But there were still a bunch of
   zero bytes hanging around. A reasonable guess was that these signalled
   some kind of "end of structure" condition. It only took a little more
   intuition to realise that of the "length byte" records and the "IP
   address" records, one logically went inside the other. Unfortunately,
   we guessed that the "IP address" records went inside the "length byte"
   records, and that confused us for quite a while. Here's part of the
   output from our dumping program at this stage:
07 0E 00 0F 25 6B BF
07 0E 00 C8 87 B1 C1 (0501)(0800)(0800)(0000)
0B 02 00 B9 53 9A 71 6A BE 88 54
0B 00 08 B9 53 9A 71 3D 5F E2 F4
0B 00 08 B9 53 9A 71 38 16 1A 41
0B 08 00 B9 53 9A 71 07 B3 CA 02 (000E)(0000)
07 08 00 2F 31 2A 45 (000E)(000E)(0000)
07 0E 00 37 71 0F 71 (000E)(000E)(0008)(0000)
0B 01 00 88 B4 92 0E A6 53 2E 7F (000E)(0000)
07 98 04 08 B0 DD FB
07 08 00 0F E8 F5 82 (0000)
07 09 00 4F DE 86 ED (0000)
07 0E 00 79 1F 36 41
07 0E 00 63 C8 51 C4 (0000)
07 02 00 0A E2 34 93 (000E)(0000)
07 08 00 31 2D E5 BA (000A)(000E)(0800)(0020)(0000)

   In this dump, the four-digit numbers in parentheses are abbreviations
   for "IP address" records, showing only the blocking mask part. We had
   already figured out, although it's a break with the tradition set
   elsewhere in the file, that in the six-byte IP address records, the
   blocking mask comes at the end instead of the start. Not shown in this
   dump is the enormous variability in the number of IP addresses
   apparently associated to each "length byte record"; some had dozens,
   many had none at all.
   
   Also, although it looks okay in this fragment, there's a critical
   problem of how to recognize which records are which. The dumping
   program would guess what looked like a plausible IP address, but it
   sometimes guessed wrong and produced junk until it happened to
   randomly re-synchronize. It appeared that IP records with a blocking
   mask of 0x0000 helped signal "OK, length byte records coming now", and
   a length byte of 0x00 (not shown here) signalled the start of a list
   of IP address records, but these things raised problems because it
   appeared that in a list of IP addresses, there would always be one
   more address than there were blocking masks. Where would the blocking
   mask for the last IP address come from?
   
   Late one night, under the influence of a couple bowls of MSG-saturated
   Korean instant noodles ("kimchee" flavour), we realised what we should
   have seen all along. The "IP address" records are actually the major
   records, and the other records go inside them, as children of a parent
   IP address. This makes more logical sense, given the purpose of the
   file; the package blocks either an entire IP address, or one or more
   subsections of an IP address. Then the rest of the structure fell out
   easily.
   
   The basic record contains an IP address and a blocking mask. If the
   blocking mask is nonzero, it applies to that entire IP address. If the
   blocking mask is zero, then there are a number of subrecords, each
   consisting of a length byte, a blocking mask, and one or more
   four-byte unknown fields. A length byte of 0x00 terminates the list of
   subrecords and signals a new IP address.
   
   Now, what about those subrecords? Well, they obviously represent some
   kind of subdivision of an IP address - like, for instance, a directory
   full of Web pages. Here's an entry from table 1, decoded by a more
   sophisticated Perl program that also incorporated reverse lookups of
   the IP addresses:
 207.34.139.253 (pii300.bc1.com):
    000E  D2A152F4 23AC865E
    0002  D2A152F4 9ECA24AB
    000E  D2A152F4 4337DDA1
    001E  D2A152F4 F1909EA3
    000E  D2A152F4 8532C8E2

   This particular entry stood out partly because bc1.com is an ISP local
   to one of us. We have friends with pages on that system (although not,
   as far as we could tell, at the particular URLs blocked by Cyber
   Patrol). It also stood out because all the subrecords start with the
   same four-byte sequence. That's a pattern that appears in lots of
   other entries, too; there will often be a site where several
   subrecords start with the same four-byte sequence. Here's a good
   example (it's long, so we've left out part):
 158.43.192.14 (twister.dial.pipex.net):
 [...]
   000E  86AC9240
   000E  4603712B
   0002  D7E769CA
   001E  0B01848F
   000E  8A1266F1
   000E  6DA218B8 957FF449 607AB5ED
   000E  6DA218B8 957FF449 E90B0308
   000E  6DA218B8 957FF449 D5D0798C
   0002  6DA218B8 6A96D698 5F78E699
   000E  6DA218B8 6A96D698 CCA4ED77
   000E  6DA218B8 118AA2D3 5B69B41C
   000E  6DA218B8 3CEC7FA9 48E41B10
   000E  6DA218B8 3CEC7FA9 09ED716A
   001E  6DA218B8 9B826D61 9BEC198D
   000E  6DA218B8 9B826D61 8EF51A8C
   000E  6DA218B8 1A7E65EE 8E16AE15

   Notice how the four-byte values seem to be grouped together in an
   hierarchical structure. Just like directories... It seemed a
   reasonable guess that in fact, that's what they were. If they wanted
   to block a URL like http://www.foo.com/bar/baz/, maybe they'd do it by
   creating a record with the IP address of www.foo.com, and a subrecord
   with some representation of the strings "bar" and "baz".
   
   We said "some representation of the strings". What, exactly, does that
   mean? Well, it would be quite reasonable to suppose that these
   four-byte fields are hashes, similar in nature to the password hashes.
   They could feed each URL component into a hash function, store only
   the hashes, and then have enhanced security as well as various
   efficiency advantages.
   
   We figured out the exact nature of the hash function with the aid of
   the bc1.com entry. As you can see above, every subrecord for that
   server starts with the hash value 0xD2A152F4. If you look on the
   corresponding Web site, you find that it's an ISP's server for user
   home pages, all of which are stored in a "users" subdirectory. And it
   just so happens that in the nonstandard CRC32 variant that was used as
   half of the HQ password hash, the hash of the string "users" is
   0xD2A152F4. Problem solved. We've designated this structure
   TNotURLEntry.
   
   Above we explain the cryptanalysis of CRC32 in considerable detail,
   and we show how to construct, in negligible time, an input that will
   generate any output of our choice. As with the passwords, Cyber Patrol
   doesn't use any salt for its URL hashes, so we can recognize where
   there are duplicate directory names even without reversing the hashes,
   and get extra value for each hash we reverse because the same reversal
   will be valid for all other occurrences of that hash.
   
   Unfortunately, there is what might be called an "information
   theoretic" problem with reversing these hashes. There are many
   possible directory names that could generate the same CRC. We can
   never be absolutely sure which of several equivalent (same CRC) URLs
   was actually meant to be blocked. In the case of the HQ password, we
   could use the other half of the hash output to recognize which one was
   correct, but here, that doesn't work. In a perverse way, shortening
   the hash has actually increased its security. But one good thing for
   us as attackers is that of the many possible strings, only a few will
   be meaningful. Given the choice between "sex" and "dkbgl~3.a7df", few
   would argue with our choice of "sex". For the small number of hashes
   which are hashes of very short strings, we can guess that the short
   strings are really correct - there are so few possible strings of five
   or fewer characters, that they're almost certainly right.
   
   But for most hash values, the CRC32 reversal isn't really very
   helpful. For any given hash it generates a long list of possibilities,
   most of which are garbage. Instead of sorting through them, we fell
   back on the old reliable dictionary attack. We took a list of words
   and hashed them all, and then started modifying them by tacking tildes
   onto the start (to make it look like user home directories), adding
   letters to the start and end, adding ".htm" and ".html" to the end,
   and so on.
   
   The source file "cndecode.c" implements this attack on the cyber.not
   file, as well as incorporating decryption code, some prettier output
   formatting, and (for systems where this works) reverse DNS lookups. It
   uses a hash table, and remembers the reversal of each hash for use on
   future occurrences of that hash, in an effort to be as efficient as is
   reasonable, although the prime emphasis was on expediency in
   programming over squeezing out the last CPU cycles.
   
   As a last resort, if it can't find a hash in the dictionary, the
   cndecode program goes through all the possible reverse-CRC values up
   to a configurable limit, assigning scores to them based on how
   plausible they seem, and then chooses the best. That takes a
   relatively long time (significant fraction of a second) per hash, and
   it doesn't really work very well, but it does catch a few that aren't
   caught by the dictionary attack. Here's a sample of the output:
 ************************************************************************
   www1.iastate.edu
 = 129.186.1.22

 0006 http://129.186.1.21/.wmdnl/
 000E http://129.186.1.21/~blak/
 0008 http://129.186.1.21/~cwhipple/
 0820 http://129.186.1.21/~ejackson/
 0010 http://129.186.1.21/~ipdpfid/
 0001 http://129.186.1.21/2kihan/
 000E http://129.186.1.21/~omega/
 0008 http://129.186.1.21/~roymeo/
 0800 http://129.186.1.21/s(ettk/
 0001 http://129.186.1.21/~thinker/

 0001:  Violence / Profanity
 0006:  Partial Nudity, Full Nudity
 0008:  Sexual Acts / Text
 000E:  Partial Nudity, Full Nudity, Sexual Acts / Text
 0010:  Gross Depictions / Text
 0800:  Alcohol & Tobacco
 0820:  Intolerance, Alcohol & Tobacco
 ************************************************************************

   As this shows, URLs tend to be sorted within a given IP address. The
   ones that aren't in sorted order are probably ones for which the
   reverse-CRC didn't guess the right reversal. A more sophisticated
   version might attempt to detect the sorted order, and force the
   reverse-CRC to choose a reversal which would fit into the sorted
   order, but the amount of work involved would probably be more than
   it's worth.
   
   This entry also shows something else we haven't talked about yet -
   "alias" IP addresses, which are the apparent purpose of the one
   remaining table in cyber.not. The structure can be seen in the
   TNotIPEntry. These aliases are just that. Each entry consists of a
   root IP and one or more aliases to that one. The root IP corresponds
   to entries in the URL table, and any resource banned under the root IP
   will also be banned under its aliases. These aliases may or may not
   resolve to the same machine; the assumption here is that these IPs are
   serving the same pages.
   
   Let's talk briefly about hash collisions. The chance that any two
   randomly chosen URL components will happen to have the same hash is
   one in 2**32, which is not very likely. This is true even with the
   uneven distribution of URLs, because CRC32 is a reasonably good hash
   just as a hash, for all its cryptographic weakness. So at first
   glance, it doesn't seem like there'll be a big problem of different
   URLs having the same hash.
   
   But the birthday paradox comes into play, too. With 2**32 possible
   hash values, there starts to be a serious chance of collisions as soon
   as the number of hashes gets past 2**16, which is 65536. It's
   certainly easy to imagine that a large ISP could have more than that
   many user home pages at the same location in their URL tree. Then two
   or more different sites would have the same URL as far as Cyber Patrol
   is concerned, and any block on one such page would hit the others.
   Given the current size of the Net and the size of cyber.not, there
   probably aren't any real examples of this kind of problem in the
   cyber.not file. But there is very little safety margin. A 64-bit hash
   would remove any suggestion of collision risks, at the cost of a
   considerable increase in filesize.
   
   Of course, using a 64-bit hash would improve our ability to attack the
   cyber.not file too, by reducing the number of possible URLs for each
   hash value. Remember how having the second half of the HQ password
   hash made it so much easier to unambiguously reverse the hash?
   Information theory makes this tradeoff unavoidable: the fewer possible
   collisions, the easier and more unambiguous dictionary attacks will
   necessarily become. Given that bytes in cyber.not are somewhat
   expensive (because the file has to be transferred to all the users in
   updates all the time), the choice of a 32-bit hash is probably
   reasonable, even though it has some small risk of creating false
   blocks.
   
   A more practical security measure would be to salt the URL hash. In
   the section on the HQ password we described how salting that hash
   would make dictionary attacks on the password much harder. With the
   URL hashes that becomes all the more significant, because with the URL
   hashes we aren't attacking just one hash value. We're attacking a few
   tens of thousands of hash values all at once. So anywhere we can
   recognize that two hashes are the same, that's a win, and any time we
   hash a dictionary word, we can easily check it against all the hash
   values in cyber.not all at once.
   
   If every URL in cyber.not had been hashed with a different salt value,
   then we would have to hash an entire dictionary for every URL instead
   of just hashing one dictionary for the entire file. That would raise
   our time for a dictionary attack from a few CPU minutes to a few CPU
   months - we could still do it, possibly by recruiting a network of
   volunteers to compute cooperatively, but not as easily as the present
   attacks.
   
   They wouldn't even need to make cyber.not any bigger to get the
   benefit of salted hashing - they could just use the offset of each URL
   in the cyber.not file as its salt value. Salt doesn't have to be
   random or secret, it just has to be different for each hash. They
   would also have to upgrade the hash function to one that isn't linear
   like CRC32; with CRC32, we could simply figure out the hash of the
   salt, XOR it out, and then have an unsalted hash to attack normally. A
   much more secure approach, which wouldn't make cyber.not any bigger,
   would be to take the offset and the URL, hash them together with SHA1,
   and then take the bottom 32 bits of the result.
   
   But even that wouldn't raise the difficulty of attack above the level
   of competent amateurs, and indeed, there is no way to make this kind
   of hashing scheme any more secure. There just aren't enough possible
   URLs on the Web; it's too easy for attackers to guess all possible
   URLs and test them to see which ones would be blocked. Unix sysadmins
   accept the fact that attackers can test passwords offline, and attempt
   to educate their users to choose hard-to-guess passwords, but
   censorware companies cannot ask all objectionable Web sites to choose
   hard-to-guess URLs. So they ultimately cannot defend themselves
   against this form of attack. With salt in the hashes, though, they
   could make it a lot harder for us.
   
   Next, the cyber.yes file contains "positive option" URLs; when the
   software is configured to its strictest setting, only these URLs will
   be permitted. There is also a list of newsgroups at the end that seems
   to be in identical format to the one in cyber.not. A quick scan of the
   decrypted file with a text lister showed that it's full of fragments
   of ASCII text, like this (dump generated, amusingly enough, by Richard
   E. Morris's good old DOS-based HEXEDIT program):
000880:  0B 01 00 7E 63 68 69 6E 6F 6F 6B 00 81 80 3D 11   |...~chinook...=.|
000890:  00 00 06 08 00 77 73 69 00 81 80 44 0A 00 00 15   |.....wsi...D....|
0008A0:  10 00 7E 77 61 6E 69 67 61 72 2F 73 70 61 63 65   |..~wanigar/space|
0008B0:  6C 69 6E 6B 00 81 0D 0A 64 00 00 10 09 00 7E 74   |link....d.....~t|
0008C0:  68 67 72 69 65 73 2F 64 69 73 63 00 81 89 C2 89   |hgries/disc.....|
0008D0:  00 02 81 89 21 25 02 40 81 0F 02 5A 00 00 07 40   |....!%[email protected]...@|
0008E0:  00 6F 75 70 64 19 48 00 7E 6E 77 73 2F 73 70 6F   |.oupd.H.~nws/spo|
0008F0:  74 74 65 72 67 75 69 64 65 2E 68 74 6D 6C 00 81   |tterguide.html..|
000900:  A4 28 6C 10 02 81 A4 28 DF 80 00 81 A4 28 E1 10   |.(l....(.....(..|
000910:  82 81 B1 0C 0C 00 00 0F 40 02 70 65 6F 70 6C 65   |[email protected]|

   These look like URL fragments, but they also look sort of haphazard.
   In fact we theorized at one point that they might be stray garbage
   from memory allocation calls. However, they do have a purpose, and
   once we had the format of the cyber.not file, the cyber.yes file
   became easy to figure out.
   
   The same correlation-counting program that we ran on cyber.not showed
   similar results on cyber.yes, with strong correlation at a distance of
   six characters, but unlike cyber.not, no sharp peak at seven
   characters. This suggested that the format for the main table in
   cyber.yes would be very similar to that of cyber.not. Examination of
   the hex dump showed similar stretches of six-byte repeats with a field
   incrementing in big endian.
   
   A little trial and error revealed that the format is essentially
   identical: records with IP addresses and two-byte "mask-like" fields.
   We say mask-like because it's not clear that they serve the same
   function as the mask fields in cyber.not. When the mask-like field is
   zero, there follows some number of variable-length URL records,
   terminated by a zero byte. There are two significant differences in
   the subrecord format. First, the URL is in plain text instead of being
   hashed. As a result, the variable length can assume a less restricted
   set of values. Second, the "mask" field appears to have a different
   significance. Here is a sample record from cyber.yes:
202.231.128.32:
   0802 "home/dbec1"
   5A8A "home/kazoo"
   5A8A "home/kiboc"
   5A8A "home/kimin"
   5A8A "home/sanyohs"
   7ACA "home/terada"
   7AEA "home/tomoy"
   7AEA "home/tomoyuki"
   7BFA "home/ueno"
   7BFA "home/warp"

   The hexadecimal column is the field that in cyber.not would be the
   blocking mask. Here, it's not clear what it is. It could be some kind
   of anti-blocking mask, of categories NOT to block, but then it's
   surprising that it would be in sorted order (a pattern that persists
   in other records too), especially when the URLs are also in
   alphabetical sorted order. Other possibilities for this field include
   some kind of time stamp, a serial number, an index pointer, an
   authentication token or hash, or random memory garbage. The
   "mask-like" fields on IP addresses similarly show little apparent
   design, except that (just as in cyber.not) a zero value indicates the
   presence of URL subrecords. The newsgroup list has mask-like fields
   too, and there's no immediately obvious meaning to the data in them.
   
   At this point we should note the overall file structure of cyber.yes.
   Unlike cyber.not which had an elaborate header, the header on
   cyber.yes consists of just three bytes: one version number (or
   possibly encryption key fixup), and two bytes giving the length of the
   URL table. We discovered this by working backwards from the URL table
   until we found that all the bytes in the file except the first three
   made sense as part of the URL table. The newsgroup list follows
   immediately after the URL table and continues until the end of the
   file, in the same format as the cyber.not newsgroup list except with
   unknown data where the blocking mask would go. Unlike the tables in
   cyber.not, both tables in cyber.yes are just bare data, with no "SD"
   and "ED" delimiters.
   
   This file structure is interesting because it seems stripped down or
   simplified from the structure of cyber.not. It would be reasonable to
   guess that the cyber.yes format was a quick hack retrofitted onto the
   product subsequent to the more carefully-designed cyber.not table.
   It's also possible that the cyber.not format proved too complicated
   and cyber.yes is an example of a "leaner and meaner" file format,
   still keeping to the same design principles as cyber.not and likely
   re-using a lot of code originally written for cyber.not.
   
   Following are the relevant structure tables. This concludes the
   section on reversing the file formats.
   
   
    6.1 Structure tables
    
   TNotHeader
   Offset Size Description
   0x0000 2 Filetype? (0x00FC)
   0x0002 2 Header size (0x002A)
   0x0004 2 Header id ('CH' or 'HH')
   0x0006 2 unknown ( 00 00 )
   0x0008 2 unknown ( 00 00 )
   0x000A 2 unknown ( 03 01 )
   0x000C 2 Count of TNotHeaderEntries (0x0003)
   
   Immediately followed by one or more of these:
   
   TNotHeaderEntry
   Offset Size Description
   0x0000 2 Table type ( 4x 00)
   0x0002 4 Absolute offset
   0x0006 4 Size (in bytes)
   
   The problem here is the Table Type field which we have too little data
   to fill in with any certainty. We can build the following table from
   the files we have analysed so far, built around the types that have
   occurred and the type of data they pointed to.
   
   TNotTableType
   Value Binary Description
   0x0041 0100 0001 Points to TNotIPEntries in cyber.not
   0x0047 0100 0111 Points to TNotNewsEntries in hotlist.not
   0x0049 0100 1001 Points to TNotURLEntries in cyber.not and hotlist.not
   0x004E 0100 1110 Points to TNotNewsEntries in cyber.not and
   hotlist.not
   0x004F 0100 1111 Points to TNotURLEntries in hotlist.not
   
   We can make no detailed conclusions from so little data.
   
   TNotIPEntry
   Offset Size Description
   0x0000 4 IP
   0x0004 1 Count of additional IP addresses (typically 1-23)
   0x0005 * IP x count
   
   TNotURLEntry
   Offset Size Description
   0x0000 4 IP Address
   0x0004 2 Category blocking mask or 0x0000 to indicate a subrecord
   follows
   Subrecord
   0x0000 1 Subrecord size
   0x0001 2 Category blocking mask
   0x0003 * URL hash
   
   In the case where there are one or more subrecords, the list is
   terminated by a zero byte.
   
   TNotNewsEntry
   Offset Size Description
   0x0000 1 Record size
   0x0001 2 Category blocking mask
   0x0003 * Newsgroup string
   
   Now, for the cyber.yes:
   
   TYesHeader
   Offset Size Description
   0x0000 1 Filetype? (0xFB)
   0x0001 2 Count of TYesURLEntries
   
   This is the only record-type of the cyber.yes:
   
   TYesURLEntry
   Offset Size Description
   0x0000 4 IP Address
   0x0004 2 Unknown, or 0x0000 to indicate a subrecord follows
   Subrecord
   0x0000 1 Subrecord size
   0x0001 2 Unknown
   0x0003 * URL as plaintext
   
   Same as for the TNotURL-entries, in the case where there are one or
   more subrecords, the list is terminated by a zero byte.
   
   
  7 Observations
  
   With all these technical things resolved, let's look at the data
   itself. First a table of statistics pulled from two different CyberNOT
   files:
   
   Cyber Patrol URL Database Statistics
   Bit Category 1999-04-29 2000-02-20 Change
   0 Violence / Profanity 1201 1407 +206 (17%)
   1 Partial Nudity 46538 72236 +25698 (55%)
   2 Full Nudity 45013 70248 +25235 (56%)
   3 Sexual Acts / Text 47769 74009 +26240 (54%)
   4 Gross Depictions / Text 1414 2273 +859 (61%)
   5 Intolerance 259 337 +78 (30%)
   6 Satanic or Cult 129 197 +68 (53%)
   7 Drugs / Drug Culture 197 306 +109 (55%)
   8 Militant / Extremist 187 204 +17 (9%)
   9 Sex Education 201 270 +69 (34%)
   A Questionable / Illegal & Gambling 1347 1928 +581 (43%)
   B Alcohol & Tobacco 783 1155 +372 (48%)
   C Reserved 4 48 3 -45 (1500%)
   D Reserved 3 0 0 0 (0%)
   E Reserved 2 0 0 0 (0%)
   F Reserved 1 0 0 0 (0%)
   Total URL masks 52315 79899 27584 (52%)
   
   We can see that of the roughly 80000, entries about 90% fall into one
   or more of the pornography categories. The Learning Company have a
   page on their site describing their criteria for categorizing entries.
   At the end it states: "Note: Web sites which post "Adult Only" warning
   banners advising that minors are not allowed to access material on the
   site are automatically added to the CyberNOT list in their appropriate
   category.". This may give the impression that sites are automagically
   added as soon as they appear on the web, which certainly isn't the
   case. They are most probably using a web spider to pick these up.
   These spidered sites probably make up the bulk of the URLs flagged in
   all of categories 1, 2 and 3, which is the dominant set of flags by
   far. By monitoring these statistics for a longer period of time one
   could deduce how effective the spider is in finding new sites. The
   oldest cyber.not we have available is dated 1999-04-29. By comparison
   it contains only 52315 entries, but the ratio of "porn" rated sites is
   the same, about 89%, with 46538, 45013 and 47769 entries flagged for
   categories one, two and three respectively. Most of the other
   categories are up by between a hundred and three hundred entries, but
   the porn categories, suspected mostly to consist of spidered sites,
   are up by about 25000 entries each for the period (about 38 weeks).
   
   There is a function in CP where a user can use a form to report new
   URLs for consideration of inclusion into the CyberNOT. It would be
   interesting to know how many of the URLs added come in this way. It
   would be possible for users to team up and exchange URLs on their own,
   bypassing The Learning Company, which is charging for these CyberNOT
   updates. By patching the CP executable it could be made so that this
   report form is posted to another server, which could also host updated
   CyberNOT lists. It would take a little work to set up, but not too
   much. The most difficult aspect would probably be to reach out to
   active Cyber Patrol users and convince them that this would be
   worthwhile, especially since it would require a certain amount of
   momentum to be worthwhile at all. With this threat, it's logical to
   assume that The Learning Company and other censorware vendors will use
   even more security-through-obscurity in future products, to deter the
   threat of having one of their sources of income bypassed.
   
   Near the start of this essay we mentioned the "reserved" blocking
   categories. Cyber Patrol, in addition to the twelve documented
   blocking categories, has an additional four (labelled "Reserved 1"
   through "Reserved 4") which are greyed out. Reserved 3 and Reserved 4
   are selected by default, and so cannot be disabled - even by the
   administrator.
   
   Any sites placed in one of those two categories will be blocked no
   matter what. We found three examples on the now current CyberNOT list.
   All three are in Japanese. They were each blocked in Reserved 4 and no
   other categories; we could not find any examples of blocks on other
   reserved categories.
     * http://133.205.62.133/~coga/, which appears to say something like
       "This domain has moved".
     * http://202.26.1.170/~mcqueen/, which is mostly in Japanese but
       includes the English text "The page you requested was not found".
     * Tsutomu Notani's home page, which based on the pictures appears to
       include some content about horse racing, and thus (presumably)
       gambling. No other blockable content is immediately apparent.
       
   There are a few entries in the CyberNOT list that are blocked under
   all non-reserved categories. For instance, the anti-censorware site of
   Peacefire is listed as containing "Violence / Profanity, Partial
   Nudity, Full Nudity, Sexual Acts / Text, Gross Depictions / Text,
   Intolerance, Satanic or Cult, Drugs / Drug Culture, Militant /
   Extremist, Sex Education, Questionable / Illegal & Gambling, Alcohol &
   Tobacco". That's not such a surprise; blocking Peacefire has become
   traditional among censorware manufacturers.
   
   The other sites blocked under all categories seem to be translation
   and anonymizer services; any site where you can type in a URL and it
   will present you a copy of that page. That's probably no big surprise
   either, because such sites can be used to circumvent censorware. So it
   may be reasonable that sites like anonymizer.com should be blocked
   under all categories; potentially, they do make available the entire
   range of human thought. Not all these blocks are carefully applied,
   however; the "STOP KITTY PORN" page (which features a picture of a
   very bored-looking house cat) is blocked under all categories
   apparently just for containing a link to anonymizer.com. Here, as
   elsewhere, the blocking list doesn't seem to be updated very
   frequently. The server at 207.55.200.2 (whose reverse-DNS resolves to
   "www.live4u.com", although that doesn't resolve in the forward
   direction) seems to be an ordinary portal site, with no obvious
   translation service, but it's blocked for everything except sex
   education.
   
   Of course, the most interesting things we could find on the blocking
   list would be sites about political or social issues. Other censorware
   packages have gotten in a lot of trouble, for instance, by blocking
   sites like the National Organization of Women, and a great many gay
   and lesbian sites. The CyberNOT list seems relatively free of that
   kind of political agenda, which could be a good or a bad thing
   depending on your point of view. If the software is to be installed in
   public libraries, it's good that it won't block these
   politically-important sites. Of course, it would be better if it
   didn't block any sites at all. On the other hand, if you were a parent
   who considered feminism or homosexuality to be unimaginably horrid
   subjects, then you might feel ripped off by Cyber Patrol's not
   blocking the high-profile sites.
   
   Let's take a closer look at the category intolerance. While they do
   block smaller sites, such as this one on atheism, which we feel is
   relatively benign, they also block such high profile a site as
   www.godhatesfags.com and part of American Family Organization, whose
   views on homosexuality cannot be described as anything if not
   intolerant. AFA is one organization pushing for the installation of
   censorwares in US libraries. One can only assume they'd prefer one of
   Cyber Patrol's competitors.
   
   Some other sites in this category:
     * Matthew R. Galloway's homepage. Contains the word "Voodoo" in a
       reference to voodoo-cycles.com, and a pretty famous joke file
       entitled Top 10 Reasons Why Beer Is Better Than Jesus. No #1 being
       "If you've devoted your life to Beer, there are groups to help you
       stop.", BTW.
     * Misha Verbitsky's old homepage. Seems perfectly ordinary. Some
       papers, a couple of usenet archives. Note that this page was
       frozen several years back, so whatever it was censored for, is
       still there.
     * Church of the SubGenius. Banned in every category except sex-ed.
       The Church is a spoof of fundamentalist Christianity, consumer
       culture, and other things.
     * joc.mit.edu/cornell/. This link is for the archive containing
       files relevant to:
       
     The Justice on Campus Project's mission is to preserve free
     expression and due process rights at universities. Our online
     archive includes reports on disciplinary charges, speech codes, and
     censorship on college campuses around the country. The Project was
     one of 20 plaintiffs in the ACLU's successful challenge of the
     Communications Decency Act.
       How very intolerant of them to be working for free speech, huh?
       
   How about some examples from the category "Satanic / Cults"?
     * Mega's Metal Asylum. Miika "Mega" Kuusinen's page of Metal music.
       Articles, links. Perfectly ordinary. Tagged as militant, too.
       Well, we all know how metal music is the devil's work.
     * This site contains nothing but the text "Welcome!". If that's
       enough to be branded a "Satanist", we can expect a rapid growth in
       bans. If nothing else, this is another example of how the bans
       grow outdated as time goes by, but The Learning Company doesn't
       seem to care much.
     * webdevils.com - "Experiments with sound", a site which has nothing
       to do with religion, or lack of it. Guess the hostname was enough
       in this case.
       
   There is one political issue the CyberNOT list doesn't shy away from:
   that of nuclear disarmament. All sites relating in any way to war,
   bombs, explosives, or fireworks, both for and against, seem to be
   eligible for blocking as "Militant / Extremist". Most are also classed
   as "Violence / Profanity" and "Questionable / Illegal & Gambling",
   whether those categories seem to apply or not. For instance:
     * The Nuclear Control Institute. From the blocked page:
       
     Founded in 1981, the Nuclear Control Institute (NCI) is an
     independent research and advocacy center specializing in problems
     of nuclear proliferation. Non-partisan and non-profit, we monitor
     nuclear activities worldwide and pursue strategies to halt the
     spread and reverse the growth of nuclear arms. No Bomb! In
     particular, we focus on the urgency of eliminating atom-bomb
     materials ---plutonium and highly enriched uranium---from civilian
     nuclear power and research programs.
       Is that an extremist position?
     * A personal site including a lot of different material, apparently
       blocked for something called "The Nazism Exposed Project". From
       the blocked page:
       
     Nazism, fascism and extreme nationalism are today at its highest
     peak since the destruction of Hitler's dictatorship in 1945. Today,
     all over the world, fascists and extreme nationalists win millions
     of votes on their simple racist solutions to very complex problems
     of the society. In the streets, Nazi boneheads are spreading fear
     by using murderous violence and terror. These fascist groups blame
     the cultural and ethnic minorities for the problems in our society.
     These individuals, and their political leaders, are a threat to our
     democracy, and to everything that is decent.
       Blocked as "Violence / Profanity, Militant / Extremist,
       Questionable / Illegal & Gambling".
     * Anti-nuclear-bomb articles from the Tri-City Herald newspaper,
       blocked as "Violence / Profanity, Militant / Extremist,
       Questionable / Illegal & Gambling".
     * One page in this directory (URL hash not fully reversed) on the
       City of Hiroshima Web site, blocked as "Violence / Profanity,
       Militant / Extremist, Questionable / Illegal & Gambling".
     * Jim Lippard's home page, which contains some anti-Scientology
       material and a link (not text) to this Salon article about the
       Littleton shootings, which everone ought to read.
     * Cheesehead Central, a personal home page, which contains a few
       links relating to fireworks displays and therefore, apparently,
       qualifies as "Violence / Profanity, Militant / Extremist,
       Questionable / Illegal & Gambling".
     * The former location of the American Airpower Heritage Museum - an
       apparently-legitimate museum of US combat aircraft. Blocked as
       "Violence / Profanity, Militant / Extremist, Questionable /
       Illegal & Gambling".
       
   Some sites that may be blockable under a few categories are also
   blocked under a great many other categories. For instance:
     * Teen Babe of the Month; it's a porn site, but it appears to be a
       perfectly ordinary porn site. Blocked under all categories except
       sex education.
     * http://www.xs4all.net/~stones/, a link (not the actual site
       itself) pointing at a warez search engine. That would presumably
       qualify as "Questionable / Illegal", but it's flagged for
       everything except sex education.
     * http://www.danland.engelholm.se/, a personal home page. Some
       content relating to warez, but nothing else blockworthy is
       immediately apparent. Blocked for everything except sex education.
     * The Marston Family Home Page, with the usual round of pictures of
       Mom, Dad, the kids, the dog, etc. Entire directory blocked for
       "Militant / Extremist, Questionable / Illegal & Gambling",
       apparently just because of this paragraph in young Prescott's
       section:
       
     In school they teach me about this thing called the Constitution
     but I guess the teachers must have been lying because this new law
     the Communications Decency Act totally defys [sic] all that the
     Constitution was. Fight the system, take the power back, WAKE
     UP!!!!!
       You go, boy.
       
   It is obvious on examining the list that many entries haven't been
   updated or checked in a long time. Many sites that are blocked now
   give 404 not found errors, or redirects to new locations that are not
   blocked. Changes to Web sites may also account for some of the
   inappropriate category labelling. Here are some samples of sites that
   seem inadequately reviewed:
     * an empty page blocked in all categories except sex education, and
       a 404 not found page blocked in all categories including sex
       education. There are many others like these.
     * A student home page at utexas.edu, blocked for "Violence /
       Profanity, Partial Nudity, Full Nudity, Sexual Acts / Text,
       Militant / Extremist, Questionable / Illegal & Gambling" content.
       It consists mostly of (clothed) photos of the author's baby son,
       with no blockable content immediately apparent.
     * Another student home page at imsa.edu, blocked as "Violence /
       Profanity, Militant / Extremist, Questionable / Illegal &
       Gambling". Consists solely of a link to the author's resume, which
       is perfectly ordinary.
     * A personal home page at world.std.com. The part about his wife is
       nauseatingly sweet, but doesn't really fit most people's
       definitions of "Gross Depictions / Text, Militant / Extremist,
       Questionable / Illegal & Gambling", which is what it's blocked
       for.
     * A sheet-music publisher, blocked as "Violence / Profanity,
       Militant / Extremist, Questionable / Illegal & Gambling" for no
       apparent reason.
       
   These are just a few examples of sites that Cyber Patrol is banning,
   or was. It is not unthinkable that they might lift a few after this is
   published. We've only scratched the surface as far as checking on the
   sites that are banned. Going through even a few hundred takes a lot of
   time, and with almost 80,000 bans in effect, the work required to
   check them all would be enormous. We don't have time to do it, but
   since The Learning Company is making money from the supposed
   correctness of the list, they ought to be able to find resources to
   check the list from time to time.
   
   We know they are banning 80,000 or so URLs, but most censorware
   packages also have a database of words that are not allowed to exist
   in incoming pages, because it's the only way to really approach being
   effective in banning new pages on the ever evolving and growing
   Internet. Cyber Patrol doesn't do that, and so its IP and URL bans are
   its only real line of defence. If you can find a site that The
   Learning Company have not, then there's very little stopping you from
   browsing it. There is the function that can filter a site based on
   substrings in the URL itself, but that is it.
   
   Cyber Patrol is actually fairly efficient in blocking sites if you
   don't know how to search effectively. If you simple search one of the
   major search-engines then you will probably draw a blank, because it's
   very likely that that is the exact kind of search used by The Learning
   Company to bait their web-spiders. However, finding a few pages with
   obscene banners and thumbnail pictures is no big problem. We could
   locate this one and this one in short order. One somewhat effective
   method is to search for non-English language pages. The spider might
   not be effective in locating and parsing these for automatic inclusion
   in the CyberNOT. You could for instance look for a Swedish site, and
   locate www.smygis.com, which is not - as this is written - blocked in
   any way. If you really want porn, Cyber Patrol might slow you down a
   little, but it won't cut you off entirely.
   
   
    7.1 Rogue deinstallation
    
   Apart from checking for "unauthorized" modifications to cyberp.ini,
   CP's "advanced anti-hacker security" consists of a new
   %windir%\system\system.drv that checks for the existence of the
   modules PROGIC, PROGICS and TS. These are represented by the files
   IC.EXE, ICFIRE.EXE and TS.DLL, all in the %windir%. The original
   system.drv is cleverly hidden away as %windir%\system.386.
   
   The modules are loaded in two ways: first there is a load entry in the
   win.ini file, and second, there's a entry in the registry at
   HKCU\Software\Microsoft\Windows\CurrentVersion\Run called
   "FltProcess", which will load %windir%\system\msinet.exe, which in
   turn will load the Cyber Patrol modules. After replacing the
   system.drv, which in the CP-version will halt loading of Windows if it
   doesn't find it's modules, and ask you to call their support number,
   you can safely do away with the registry entry, the load-key in the
   win.ini and any of the numerous binaries. Because of the many files CP
   installs to your system, we suggest you use the normal uninstaller
   instead. Not that it does a very good job of removing its system
   files, but there you go.
   
   Optionally, if you come across an installation running unregistered,
   you can use the backdoor password omed to uninstall, or simply to gain
   administrator access.
   
   
  8 Source and binaries
  
   We have developed a set of software for getting around Cyber Patrol.
   People oppressed by Cyber Patrol will want to take a look at CPHack, a
   Win32 binary which will decode the userlist for you, and also let you
   browse the different banlists.
   
   Also available is C source for two command-line programs illustrating
   the cryptographic attacks on cyber.not (cndecode.c) and the HQ
   password hash (cph1_rev.c). These programs were written under Linux
   and are not guaranteed to work anywhere else.
   
   A complete package with this essay, the binaries, and various sources
   and related files are available as cp4break.zip (~360Kb).
   
   
    8.1 CPHack documentation
    
   This tool is not particularly hard to use, but some comment on its use
   could be in order. First of all the author would like to state that
   this is a hack(1), which is reflected both in the state of the source
   and the user interface. The basic functionality is to let you load and
   browse the information of a cyber patrol .not file and/or the user
   information contained in a cyberp.ini file. Simple select which you
   want to load using the file menu. Also in the file menu are functions
   for importing and exporting hosts. By importing hosts you are reading
   a text file containing lines of IPs and their corresponding hostnames
   into the treeviews. Export, of course, does the opposite.
   
   Continuing we have the functions "Export dictionary" which will
   traverse the treeviews and write out all words that have been assigned
   to URL-hashes. "Export unresolved IPs" does just that; it could be
   used to distribute the work of doing reverse-lookups. The final export
   function is "Export URL hashes", which will export any hash that has
   not been assigned a word, the logical inverse of the "Export
   dictionary" function.
   
   Maybe the most useful functions are the last ones, "Generate report",
   which will output a HTML document reflecting the data you have loaded.
   Be sure to check out the "Configuration" tab before doing that though,
   and the somewhat mysterious "Cull dictionary by hash". The last
   function will take the main dictionary (as defined in the
   configuration tab), and create a new dictionary containing only the
   words with hashes contained in a .not file you have loaded. A bit of
   explanation on this: It was thought by the author that a lazy
   dictionary attack would be enough. This lazy approach is what you get
   if you select one of the attacks available by right-clicking a node.
   However, this proved quite slow when used with large dictionaries
   (15Mb or so), as it only looks at one URL at a time.
   
   The problem here is that CPHack will try - for each node - lots of
   words from the dictionary with hashes that doesn't exist in the
   database at all. As a quick hack on the hack, this function was
   implemented, which will take all the hashes in the database and attack
   them all at once. The downside is that no references are kept as to
   which exact nodes the found hashes belong to, so you will only get a
   new optimized dictionary to use in the lazy attack, you won't get a
   instant update to the treeview. While desirable, it would take too
   much time and effort - at this point - to implement correctly. A good
   implementation would traverse the nodes you have selected, creating a
   ordered list of unique hashes, attached to which would be lists of all
   associated nodes. When the hash of a word is found in this ordered
   list of hashes, the correct chain of tree nodes could be quickly
   traversed and nodes updated to reflect the hit. Until this is fixed,
   you should cull the dictionary first, and use the output with the lazy
   attack, to "assign" all words into the database.
   
   The main interface contains the five sections "Users", "Newsgroups",
   "URL database", "IP Aliasing" and "Configuration". A quick rundown
   follows.
   
   If you load a cyberp.ini the "Users" tab will display the names and
   passwords of the users therein, including the passwords of the innate
   administrator and deputy accounts.
   
   After loading a CyberNOT file, the "Newsgroups" tab will display all
   filters defined therein. To the rights is a panel of checkboxes which
   you cannot operate, but will reflect the masks applied to the
   newsgroup entry you select in the listview.
   
   Next we have the "URL database" tab, which contains a treeview where
   you can browse the database. It should be noted that the relative long
   loading time of a CyberNOT file is due to the way the treeview works,
   with insertion into a branch - apparently - being O(n) and not about
   O(1) in regard to the number of siblings of a new node. Anyway, you
   can browse the view in the normal manner of things. There are three
   different types of nodes, the first being called internally a "net
   node". This is simply a root node containing all entries for IPs of a
   "A net". Below these are "IP nodes" which are the IPs that are banned
   by the database. Some of these have children of their own, being "URL
   nodes" which contains the hashes of specific paths and resources being
   banned. You can right-click on any one of these three types of nodes
   for additional context sensitive functionality, such as "Open",
   "Lookup" and "Dictionary attack". As with the newsgroups tab, there is
   a panel of checkboxes which will reflect the masking status of the IP
   or URL you select. At the bottom is a quick search bar where you can
   do case sensitive string searches.
   
   There's not much to say about the "IP Aliasing" tab, but here too you
   can right-click for additional functionality.
   
   Finally we have the configuration tab where you define the different
   dictionaries you want to use, and a number of other things which are
   self-explanatory, except maybe for the "Lock found URLs". This
   function, if enabled, makes sure that once a word has been found to
   match a hash and been attached to it in the treeview, then it will
   never get replaced even if another possible candidate is found.
   
   This program is entirely self contained. It will not write to the
   registry, and it will not create files anywhere but in the its own
   path, unless you say it can.
   
   The source is included, and you can do whatever you want with it.
   
   
  9 Conclusion
  
   On the good side, we note that Cyber Patrol is - technically -
   somewhat better than NetNanny and CyberSitter, the two other
   censorware packages we have intimate knowledge of, but there is still
   far too much 16-bit code for it to be really stable and earn a good
   grade.
   
   We see no evidence of a clear political or religious agenda behind
   Cyber Patrol, though as citizens of highly secularized countries we
   might feel that many of the bans in the "Satanist / Cult" category are
   unreasonable. Their criteria document says "Satanic material is
   defined as: Pictures or text advocating devil worship, an affinity for
   evil, or wickedness." and "A cult is defined as: A closed society
   [...] Common elements may include: [...] influences that tend to
   compromise the personal exercise of free will and critical thinking."
   LaVey Satanism - for instance - isn't about any of the things in the
   full definition, and atheism certainly isn't, but such sites are
   included in the CyberNOT.
   
   The evidence points to the CyberNOT list not being properly updated to
   remove old and outdated entries. As many as 50% of the IPs in the list
   doesn't even resolve! When evaluating a product with a ban list, you
   should not look at the number of entries, but the number of current
   entries. Simply collecting new entries, and using the ever growing
   (but outdated) list of bans as an argument in the sales game, is much
   easier than actually putting in work to ensure the list is up to date
   and accurate.
   
   The old classic tactic of entering critics into the banlist continues,
   with the banning of Peacefire in almost every category available. When
   the producers are knowingly banning a site in clearly the wrong
   categories, then what kind of trust can you put in them and their
   products? None. We must continue to reverse-engineer these products so
   that consumer rights can be protected. Will we ever find a censorware
   company who are not lying to us with these false bans?
   
   The absence of filtering based on content keywords is surprising, but
   welcome. The technology does not exist to make content-based filtering
   really functional. The problem of recognizing content and making
   choices based on context is a hard one, suitable for research by the
   AI-labs. But it is a two-edged sword. The price of leaving this
   error-prone functionality out is that it makes Cyber Patrol less
   effective in blocking pages not previously processed by The Learning
   Company.
   
   After all this, the feeling is that CP is just another censorware
   package. It tries hard to come across as effective - the magical
   technical solution to a non-technical problem - but when push comes to
   shove, it yields to the power of the human mind. If you thought
   putting this between your children and the Internet would protect them
   from "dangerous" ideas, then you'd better think again.
   
   
    9.1 Thanks
    
   We would like to thank all the fine men and women working for civil
   liberties all over the world.
   
   Matthew would like to thank: the goddess Pele for favours received,
   and the Canadian government for supporting my cryptographic interests
   in several ways. Greetings to all the people I hang out with in
   sci.crypt, alt.kids-talk, talk.bizarre, and the VLUG and Voynich
   mailing lists.
   
   Eddy would like to thank: Robert Risberg, Kristoffer Andergrim,
   Mattias Aspman, Gunnar Rettne, and all of my friends around the world.
   Special regards to all the intelligent, knowledgeable and humorous
   folks of R20 of the Fidonet - you know who you are.
   
   All cryptanalysis done by Matthew Skala. Reverse Engineering done by
   Eddy L O Jansson and Matthew Skala. Feel free to contact the authors
   with your comments and/or questions.
   
   This essay first published at Eddy's homepage in 2000-03-11. You'll
   find Matthew's homepage here.
   
   You are allowed to mirror this document and the related files anywhere
   you see fit.
   
   
  10 References
  
   [DFR98] Saruman and Bobban, "The Penetration of CyberSitter'97", Apr
   1998.
   [DFR99] Saruman, "The Reversal of NetNanny", Aug 1999.
   [ACLU96] American Civil Liberties Union "FCC V. Pacifica Foundation",
   1996.
   [RNW93] Ross N. Williams "A painless guide to CRC error detection
   algorithms", Aug 1993.
   [JRG00] Raphael Finkel, Eric S. Raymond, et al. "The on-line hacker
   Jargon File, version 4.2.0", Jan 2000.
   
   (c)2000 Eddy L O Jansson and Matthew Skala. All rights reserved. All
   trademarks acknowledged.

[END]

#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: [email protected] and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: [email protected]
Prev by Date: <nettime> not bloody likely: a response to hush
Next by Date: <nettime> cph1_rev.c
Prev by thread: <nettime> not bloody likely: a response to hush
Next by thread: <nettime> cph1_rev.c
Index(es):
- Date
- Thread