Michael Wojcik on Thu, 7 Jan 2010 11:28:51 +0100 (CET) |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: <nettime> fast-changing propaganda website archiving tools? |
Flick Harrison wrote: > Hey nettimers, > > I'm trying to archive some government propaganda websites for a > research project. I'm on mac but could access linux or PC tools in a > pinch. > > All the various things I've tried have failed to maintain the full > interactivity / flash linking within the kind of page I'm wanting. It would help to know what you've tried, then. You mention "things like Fink and Wget". What would those "things" include? If you haven't tried HTTrack (WinHTTrack for Windows, WebHTTrack for UNIX and Linux), I'd suggest that. It's free, open-source, and reasonably easy to use, configure, and automate. I used WinHTTrack to record changes to US presidential candidate websites in 2007-2008, for a visual-rhetoric project, and it did the job. http://www.httrack.com/ Note that in general, though, there are any number of ways that people make websites difficult to successfully copy and archive. Basic honor-system methods like robots.txt (which the Wayback Machine respects, for example) and client sniffing are easy to bypass - you just ignore or spoof them (and HTTrack has an option for that). But techniques like traffic shaping, keying served content to ephemeral session cookies, and scripts that inspect document URLs require considerably more finesse. While it's axiomatic that anything served can be saved, the work factor for saving something can be made pretty high - often higher than the content in question is worth to the person trying to save it. (That's what security is all about, of course: making the work factor for the attacker high enough to invert the economics of the attack, without doing the same to the work factor for authorized parties.) -- Michael Wojcik Micro Focus Rhetoric & Writing, Michigan State University # distributed via <nettime>: no commercial use without permission # <nettime> is a moderated mailing list for net criticism, # collaborative text filtering and cultural politics of the nets # more info: http://mail.kein.org/mailman/listinfo/nettime-l # archive: http://www.nettime.org contact: [email protected]