<nettime> CyberWire Dispatch: Semantic Forests

nettime's_roving_reporter on Thu, 2 Dec 1999 05:27:31 +0100 (CET)
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
<nettime> CyberWire Dispatch: Semantic Forests
To: [email protected]
Subject: <nettime> CyberWire Dispatch: Semantic Forests
From: "nettime's_roving_reporter" <[email protected]>
Date: Wed, 1 Dec 1999 18:59:21 -0500
Sender: [email protected]

CyberWire Dispatch // (c) Copyright 1999 // November 30

Jacking in from the "Sticks and Stones" Port:

By Suelette Dreyfus
Special Correspondent
CyberWire Dispatch

"Semantic Forests" doesn't mean much to the average person. But if
you say it in concert with the words "automatic voice telephone
interception" and "U.S. National Security Agency" to a
computational linguist, you might just witness the physical
manifestations of the word "fear."

Words are funny things, often so imprecise. Two people can have a
telephone conversation about sex, without ever mentioning the
word. And when the artist formerly known as Prince sang a song
about "cream," he wasn't talking about a dairy product.

All this linguistic imprecision has largely protected our voice
conversations from the prying ears of governments. Until now.

Or, more particularly, it protected us until 15 April, 1997 - the
date the NSA lodged a secret patent application at the US Patent
Office. Of course, the content of the NSA patent was not made
public for two years, since the Patent Office keeps patent
applications secret until they are approved, which in this case
was August 10, 1999.

What is so worrying about patent number 5,937,422?  The NSA is
believed to be the largest and by far most well-funded spy agency
in the world, a Microsoft of Spookdom. This document provides the
first hard evidence that the NSA appears to be well on its way to
creating eavesdropping software capable of listening to millions
of international telephone calls a day.  Automatically.

Patents are sometimes simply ambit claims, legal handcuffs on what
often amounts to little more than theory. Not in this case. This
is real. The U.S.  Department of Defense has developed the NSA's
patent ideas into a real software program, called "Semantic
Forests," which it has been lab testing for at least two years.

Two important reports to the European Parliament, in 1998 and
1999, and Nicky Hager's 1996 book "Secret Power" reveal that the
NSA intercepts international faxes and emails. At the time, this
revelation upset a great number of people, no doubt including the
European companies which lost competitive tenders to American
corporations not long after the NSA found its post-Cold War "new
economy" calling: economic espionage.

Voice telephone calls, however, well, that is another story. Not
even the world's most technically advanced spy agency has the
ability to do massive telephone interception and automatically
massage the content looking for particular words, and presumably
topics. Or so said a comprehensive recent report to the European
Parliament.

In April 1999, a report commissioned by the Parliament's Office of
Scientific and Technological Options Assessment (STOA), concluded
that "effective voice 'wordspotting' systems do not exist" and
"are not in use".

The tricky bit there is "do not exist". Maybe these systems
haven't been deployed en masse, but it is  looking increasingly
like they do actually exist, probably in some form which may be
closer to the more powerful topic spotting.

Do The Math
============

There are two new pieces of evidence to support this, and added
together, they raise some fairly explosive questions about exactly
what the NSA is doing with the millions of international phone
calls it intercepts every day in its electronic eavesdropping web
commonly known as Echelon.

First. The NSA's shiny new patent describes a method of
"automatically generating a topic description for text and sorting
text by topic." Sound like a sophisticated web search engine?
That's because it is.

This is a search engine designed to trawl through "machine
transcribed speech," in the words of the patent application. Think
computers automatically typing up words falling from human lips.
Now think of a powerful search engine trawling through those
words.

Now sweat...

Maybe the spy agency only wants to transcribe the BBC Radio World
News, but I don't think so. The patent contains a few more
linguistic clues about the NSA's intent -  little golden Easter
eggs buried in the legal  long grass.  The "Background to the
Invention" section of every patent application is the place where
the intellectual property lawyers desperately try to waive away
everyone else's right to claim anything even remotely touching on
the patent.

In this section, the NSA attorneys observed there has been
"growing Interest" in automatically identifying topics in
"unconstrained speech."

Only a lawyer could make talking sound so painful. "Unconstrained
speech" means human conversation. Maybe it's been "unconstrained"
by the likelihood of being automatically transcribed for real time
topic searching.

Here's the part where the imprecision of words - particularly
spoken words - comes in. Machine transcribed conversations are
raw, and very hard to analyze automatically with software. Many
experts thought the NSA couldn't go driftnet fishing in the
content of everyone's international phone calls because the
technology to transcribe and analyze those calls was too young.

However, if the NSA didn't have the technology to do automatic
transcription of speech, why would it have patented a sifting
method  which, by its very own words, is aimed at transcripts of
human speech?

As Australian cryptographer Julian Assange, who  discovered the
DoD and patent papers while investigating NSA capabilities
observed: "Why make tires if you don't have a car? Maybe we
haven't seen the car yet, but we can infer that it exists by all
the tires and roads."

One of the top American cryptographers, Bruce Schneier, also
believes the NSA already has machine transcription capability.
"One of the Holy Grails of the NSA is the ability to automatically
search through voice traffic," Schneier said.  "They would have
expended considerable effort on this capability, and this research
indicates at least some of it has been fruitful."

Second, two Department of Defense academic papers show the U.S.
developed a real  software program, called "Semantic Forests," to
implement the patented method.

Published as part of the Text REtrieval Conference (TREC) in 1997
and 1998, the Semantic Forest papers show the program has one main
purpose: "performing retrieval on the output of automatic
speech-to-text (speech recognition) systems."  In other words, the
U.S. built this software *specifically* to sift through
computer-transcribed human speech.

If that doesn't send a chill down your spine, read on.

The DoD's second prime purpose for Semantic Forests was to
"explore rapid Prototyping" of this information retrieval system.
That statement was written in 1997.

There's also an unambiguous link between Semantic Forests and the
NSA patent, it's human and its name is Patrick Schone.

Schone appears on the NSA patent documents, as an inventor, and
the Semantic Forests papers, as an author and he  works at Ft.
Meade, NSA's headquarters.

Specifically, he works in the DoD's "Speech Research Branch" which
just happens to be located at, you guessed it, Ft. Meade.

Very Clever Fish
================

The NSA and the DoD refused to comment on the patent or Semantic
Forests respectively. Not surprising really but no matter, since
the Semantic Forest papers speak for themselves. The papers reveal
a software program which, while somewhat raw a year ago, was
advancing quickly in its ability to fish relevant data out of
various document pools, including those based on speech.

For example, in one set of tests, the scientists increased the
average precision rate for finding relevant documents per query
from 19% to 27% in just one year, from 1997 to 1998. Tests in 1998
on another set of documents, in the "Spoken Document Retrieval"
pool were turning up similar stats around 20-23 per cent. The team
also discovered that a little hand-fiddling in the software reaped
large rewards.

According to the 1998 TREC paper: "When we supplemented the topic
lists for all the queries (by hand) to contain additional words
from the relevant documents, our average precision at the number
of relevant documents went from 28% to 50%."

The truth is that Schone and his colleagues have created a truly
clever invention. They have done some impressive research. What a
shame all this creativity and laborious testing is going to be
used for such dark, Orwellian purposes.

Let's work on the mental image of that dark landscape.  The NSA
sucks down phone calls, emails - all sorts of communications to
its satellite bases.  Its computers sift through the data looking
for information which might interest the U.S. or, if the Americans
happen to be feeling generous that day, their allies.

Now, whenever NSA agents want to find out about you, they pull up
a slew of details about you on their database. And not just the
run-of-the-mill gumshoe detective stuff like your social security
number, address, but the telephone number of every person you call
regularly, and everything you have said when making those calls to
1-900-Lick-Me from your hotel room on those stop overs in
Cleveland.

And here's the real scary stuff:

The NSA likely already has a file on many of us. It's not a
traditional manilla file with your name typed neatly on the front.
It's the ability to reference you, or anyone who matches your
patterns of behavior and contacts, in the NSA's databases. Now, or
in the near future, this file may not just include who you are,
but what you *say*.

British Member of the European Parliament Glyn Ford is one of the
few politicians around who is truly concerned with the
individual's right to privacy. A driving force behind the European
Parliament's STOA panel's two year investigation into electronic
communications, Ford is worried that the NSA  possesses
technologies that are "potentially very dangerous" to privacy and
yet have no controls over their activities.

The Australian aboriginal activist and lawyer Noel Pearson once
said that that the British gave three great things to the world:
tea, cricket and common law. If unchecked, the NSA and its sister
spy agencies in the UK/USA agreement may use this technology to
lead an assault on the most important of those gifts and the
common law tenet "innocent until proven guilty" may be the first
casualty.

How ironic: one Blair wrote '1984' as fiction, and another is
helping to make it fact.

= = = = = = = = = = = = = = = =

An Australian-American writer, Suelette Dreyfus was educated in
the UK and US, studied at Oxford University and Columbia
University in New York, where she won the prestigious Teichmann
Prize for excellence and originality in writing. She is the author
of Underground, the first book about Australian computer hacking,
available at

= = = = = = = = = = = = = = = = =

EDITOR'S NOTE:  CyberWire Dispatch, with an Internet circulation
estimated at more than 600,000 is now developing plans for a
once-a-week e-mail publication.  Every week, one of five well-known
investigative reporters will file for CWD.  If you think your company
or organization would be interested in more information about
establishing an sponsorship relationship with CyberWire Dispatch,
please contact Lewis Z. Koch at [email protected].

===================
To subscribe to CWD, send a message to:
         [email protected]
No subject needed.

In the first line of the message put:
         Subscribe CWD

To remove yourself from this list, send a message to:
         [email protected]
No subject needed.

In the first line of the message put:
         Unsubscribe CWD

#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: [email protected] and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: [email protected]
Prev by Date: <nettime> Hegelian Dialectics 101
Next by Date: <nettime> WTO pix + report
Prev by thread: <nettime> Hegelian Dialectics 101
Next by thread: <nettime> WTO pix + report
Index(es):
- Date
- Thread