Simfish/InquilineKea's Thoughts


March 11, 2010, 11:28 pm
Filed under: Uncategorized

which methods of archival do u use wrt deleted websites?

and deleted topics/posts/threads/anything?

i have lots of techniques. google cache, firefox cache (make sure to go on offline mode!) [or even use multiple instances of reload every, although i dont do that], boardreader, caches from MSN and Yahoo (tho they rarely work), sometimes archive.org, lazarus formrecovery, and clipcache (I often copy/paste things). and i’ve gotten used to the practice of saving things instead of bookmarking them. which i am so glad of doing, whenever i look at, say, the bookmarks i made in 2006.

of course, if i anticipate that a site may be deleted soon, i quickly use downthemall or winhttrack (whatever is more convenient). though robots.txt is much more common than it used to be which makes winhttrack useless for a lot of them. but then i created a program that concatenates topic indexes (after using the batch processing feature of downthemall to save an entire list of topics) so then i can use downthemall on the entire concatenated forum and then save every thread that has ever existed.

of course renaming masks are important. but *name*.*ext* and *text*.*ext* have seemed sufficient for everything so far.

2010 allows for archival through iMacros. It can go through the Javascript that httrack/downthemall can’t automatically click on. and it also can go through password protected webpages that httrack is usually bad at getting through. the issue is deep mirroring though (it can probably mirror things one at a time but not simultaneously).

Advertisements


March 4, 2010, 10:08 pm
Filed under: Uncategorized

The Fourth Paradigm: Data-Intensive Scientific Discovery – Microsoft Research

So basically, the two main paradigms used to be experiment and theory. Then in the 1950s came simulations, and now we have data-intensive scientific discovery. Some ppl have recently written programs that can derive physical formulas from massive amounts of data. Such methods can produce true results without an a priori basis for scientific discovery, which runs counter to the scientific method.

As an interesting article says:
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

This is not to say that the scientific method is irrelevant. It isn’t, and data-intensive scientific discovery still needs a priori hypotheses for efficient algorithms that don’t take forever to run. But it does mean that it may be easier for people to develop weaker a priori hypotheses that can serve as the basis of algorithms, or one could, say, try a simulation run at a smaller data set that could produce a conclusion that can serve as the hypothesis of a larger data set.

Anyways, statisticians have long been known to get small amounts of data from large amounts of noise. I’ve talked to a number of professors about this, and they all seem to agree that it comes from statistical techniques. As a New York Times article says, http://www.nytimes.com/2009/08/06/te…y/06stats.html.

Anyways, the science of the future contains sensors with very high resolution, an ability to distribute the sensors in such a way that they can show representative samples (and also an exponential increase in sensors), the exponential growth in storage space per unit of hard drive, and the exponential growth in processing speed as follows Moore’s Law (although this growth will certainly asymptote in the next few decades). This, of course, allows for the possibility of a data deluge. Then comes the algorithms. Regular science will not be obsolete, but rather, be supplemented. The scope of it might possibly change.

Then there’s neural networks and artificial intelligence. Technically neural networks are a subset of artificial intelligence, but I like to separate the two out. I think one point of discussion in the future will be this: how different is data mining/machine learning – how different is it from AI? And are the current statistical heuristics merely primitive means of artificial intelligence? In fact, crowdsourcing (giving the “masses” a means of knowledge creation/discovery – wikipedia is an excellent example, as is anything user-created) is a sort of artificial intelligence – a sort of distributed artificial intelligence.

So the point of this thread? I’m thinking that the great scientific discoveries of the future will have disproportionate influence (relative to the past) from data mining/pattern recognition. Algorithms, in particular, will also be important (those, too, are probably just a subset of AI). So the aspiring scientist may be wisest to study those fields in particular. And then s/he may be well prepared for any particular field.

===

some more links:

http://stackoverflow.com/questions/897695/hottest-areas-in-computer-science-research

http://www.nytimes.com/external/readwriteweb/2010/05/31/31readwriteweb-the-coming-data-explosion-13154.html

[q]We are in the midst of a generational shift in research, and research funding opportunities, driven by new ‘disruptive’ technologies. The rapid emergence of a new world of science driven by very large scale data, next generation sensors, and advanced robotic instruments, in a host of disciplines from the environmental, physical and other sciences and engineering through public health and medicine, requires research universities to make a new set of high-level specialty faculty and technical skills and resources available to research endeavors and proposals in order for them to remain competitive.[/q]