Douglas Bagnall on Fri, 27 May 2016 16:11:09 +0200 (CEST) |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: <nettime> artfcity: Turbulence.org Going Offline |
As it happens, according to some measures[1][2], I am the reigning world champion of amateur author identification, a hobby I picked up a couple of years ago in response to a particular political situation in New Zealand. On 26/05/16 03:22, t byfield wrote: > Stylometry tries to measure a tiny handful of aspects of the parts we > think we understand. And for this reason it failed badly, and the term became an embarrassment. Trying to pick stylistic features (because we think we understand) is a ridiculous practice. There is really very little information in text -- you can count the bytes, and they are sparsely packed -- so the trick is to avoid discarding it prematurely. Use a model that can find the patterns where they lie, and which makes sense in terms of information theory. That will send you toward the right asymptote. Even so questions about authorship will remain uncertain: most of the very little information available in text actually is devoted to conveying and contextualising the "message", and not to leaking identity. > But Anonymouth is just one of many stylometric projects. Many will > result in libraries, and many of those libraries will be included in the > neatly packaged-up software tools -- mainly for identifying speakers and > attributing utterances. Over time, and probably not very much of it, > this 'many eyes' effect will outstrip the artisanal editorial skills of > the kind I mentioned. So, on a certain level, the orientation, > sophistication, and quality of Anonymouth is immaterial to the fact that > writing is becoming biometric. And this issue will be a properly > informational problem, not in the simplistic sense of "we have 'vast > amounts' of data" but in the more classical sense of measuring the > reduction of uncertainty -- in this case the uncertainty of whether X > wrote Y or Y was written by X (which are completely different > questions). Yes. Vast amounts of data can fall two ways. If you have a lot of text from the target sources (deciding, say, whether $FAMOUS_AUTHOR wrote $ANONYMOUS_BOOK) the evidence piles up nicely. But if you are trying to figure out which of these million suspects is running a naughty twitter account, the combinatorics are against you -- even if you are really really big and clever. If you narrow those suspects down to a handful, the game is back on. Thus my opsec advice to Phineas Phisher and the Panama Papers entity would be to cut back on the manifestos. They can't be picked out of the multitude, but it won't help if they ever make the shortlist. All the money seems to be in "author profiling", which is of course that boring stuff about targeting ads without fixing identity. It turned out in the case of the scandal that led me into this field that I could find patterns of deception, but this was not nearly as convincing as the already available facts (you know, leaked emails). And, of course, I was months late. cheers, Douglas [1]http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-papers-final/pan15-authorship-verification/stamatatos15-overview.pdf [2]http://arxiv.org/abs/1506.04891 # distributed via <nettime>: no commercial use without permission # <nettime> is a moderated mailing list for net criticism, # collaborative text filtering and cultural politics of the nets # more info: http://mx.kein.org/mailman/listinfo/nettime-l # archive: http://www.nettime.org contact: [email protected] # @nettime_bot tweets mail w/ sender unless #ANON is in Subject: