Tuesday, November 27, 2007

The edge of the known

Will serious scholars always require libraries and books or will they one day be supplanted by online documents? The World Wide Web is the closest thing in history to a universal mind dump, a download of the sum total of our knowledge. But can we ever map its extent? Or will our knowledge always grow faster than our ability to catalogue it?

The Library of Congress, which is larger than the New York Public Libary, contains about 11 terabytes of information. That’s a huge amount of information. Yet it is dwarfed by the amount of information already accessible online through search engines, about 167 terabytes. This is about fifteen times as much as the Library of Congress, a figure which even Grafton admits is impressive. But the information available through search engines like Google in turn shrinks to a literal dot compared to the material for which no ready directory exists: the so-called Deep Web. Deep Web is that part of the Internet for which there is no street map. The University of California in Berkeley estimates the Deep Web to be 91,000 terabytes in size — 545 times larger than all the material indexed by search engines and 8,150 times larger than the holdings of the Library of Congress.

Read the rest of my article at Pajamas Media. Nothing follows.

28 Comments:

Blogger F said...

". . .which is larger than the New York Public Libary>. . ." nope, no local bias in this column. New York does it again. Why would someone expect the Library of Congress to be smaller than the New York Public Library (unless that someone were from New York?) F

11/27/2007 07:55:00 AM  
Blogger geoffb said...

Your article reminds me of this story I read long ago.
http://jubal.westnet.com/hyperdiscordia/library_of_babel.html

11/27/2007 08:11:00 AM  
Blogger Phoenix_Blogger said...

I recall the Lexis database was projected to use 96-97 terabytes around 2000-2001, but I have no idea how much data is stored there now. (Full discl: I work for Lexis in support:)

11/27/2007 09:34:00 AM  
Blogger Buddy Larsen said...

Great PJM piece. Adding the TS Eliot--a glimpse into the heart of being suddenly lighting up a data-storage techno-rumination, hmm, izzat what good writing is? I think so.

11/27/2007 09:38:00 AM  
Blogger always right said...

It is equivalent to "witness the birth of the unvierse".

First, there was none. Then Boom! somehow it existed (not with a Big Bang), and continued to expand at extra light speed....

Gives me the shivers.

11/27/2007 10:46:00 AM  
Blogger ZZMike said...

The problem with the 91 petabytes is first, the obvious: finding something of value in that morass. Which illuminates the second: finding something of value. It's very much like Borges' Library of Babel (thanks for the link - I almost forgot about hat one).

Also, there's something permanent about books. If I pick up a copy of a Shakespeare folio, I know that that's what he wrote, and that there haven't been lesser lights mucking about with the text.

Which brings me to Wikipedia. The good news is that anybody can submit and edit, but that's also the bad news. I like the Encyclopedia Britannica system, where authors were identified. (The entry on guerilla warfare was written by T. E. Lawrence, and there isn't a long comment thread of "yes, but"s.)

My wife complains of the loss of card catalogs at libraries. Next to one card will be others, somehow related, that point to books on other shelves.

11/27/2007 11:03:00 AM  
Blogger Fat Man said...

The role of the internet. Older SciFi fans (such as I) may remeber one of Issac Asimov's best short-stories "The Last Question."

It is about the world encompassing computer called MultiVAC and the one question it cannot answer until last. Some anonymous hacker saw the analogy between the story and the internet:

If you're reading this, and you don't know me, chances are you're here because you've reached the
same conclusion that I have: the Internet is the MultiVAC Isaac Asimov envisioned in his 1956 short story The Last Question.

No, it doesn't quite take the same form as Asimov's giant centralized computer. It's considerably more
diffuse -- even more so than the network of microvacs and planetary AC's described in the story.

11/27/2007 12:18:00 PM  
Blogger Charles said...

imho the net makes finding info sufficiently easy that you can get into wierd coffee shop disputes and resolve them satisfactorily by popping open your laptop and doing a fast look up.

argument winners tend to epistemological rather rhetorical.

This is a good thing imho.

11/27/2007 12:18:00 PM  
Blogger Mitch H. said...

Yes the ocean of data is vast, but even the accessible surface is a grand expanse of meaningless salt water interspersed with the little wriggling bits of information which you're looking for. In comparison, a proper and well-maintained library is a dockside warehouse, packed to the rafters with fish. The open sea might be mystic and wondrous and poetic, but you're less likely to starve in the fishmonger's cold-storage unit.

11/27/2007 01:20:00 PM  
Blogger Towering Barbarian said...

I really do have to agree with both Mitch and Mike on this one (And Mitch, you expressed it much better than I ever could have in a hundred years!). There was enough mopery and dopery that went into the publication of the Skakesphere Folios that I winced when that was used an example of lesser lights not mucking with what's been done, but the fact remains that hard copy really is harder to tamper with than its electronic counterparts and is therefore more trustworthy. Also of importance is the fact that paper is easier on the eyes than a computer screen and I do not look for that to change for a while yet to come.

11/27/2007 01:50:00 PM  
Blogger Buddy Larsen said...

I'm afraid that you might find that ''easier on the eyes'' advantage tends to flip over somewhere in your mid 50s.

11/27/2007 02:38:00 PM  
Blogger Wretchard said...

In Umberto Eco's name of the Rose the essential problem in the story was to find a piece of missing information in the labyrinthine library. That library, as those who read the novel may recall, contain a "Darkweb" -- a private network of information distribution. It also contained a Deepweb: books that were not on the catalogue. It was a fine setting for a murder story.

William of Baskerville's problem was twofold: he had to deduce the existence of unindexed information and find it despite the fact it was behind a literal firewall. But it was only after he found the hidden piece of information that the puzzle arranged itself into a solution.

Today we literally live atop a mountain of data; some proprietary, some secret, some criminal. Within the "library" or graph we call the Internet is an entire ecosystem. The hunter and the hunted; the cybercounterterrorist and the cyberjihadi. It is a largely unmapped continent crisscrossed by huge dark rivers of encrypted data.

It is a nontrivial problem to describe the state of the system -- the state of the Internet -- in anything approaching real time. In a sense we don't know what we know. Sometimes we guess, as William of Baskerville did, that the Truth is in there somewhere, like a nugget in a riverbed full of mud, even when the riverbed itself is not miles underground.

11/27/2007 02:55:00 PM  
Blogger eggplant said...

ZZMike said:

"Which brings me to Wikipedia. The good news is that anybody can submit and edit, but that's also the bad news."

Wikipedia is a deep frustration for me. I made the mistake of spending several man-days writing an article there and then watch it slowly erode away as different idiots made edits on it. It's like building a beautiful sand castle then watching the tide come in and wash it away.

Wretchard said:

"In Umberto Eco's name of the Rose the essential problem in the story was to find a piece of missing information in the labyrinthine library."

Wretchard, you show excellent taste in novels. The movie based upon the novel starring Sean Connery was also brilliant and I strongly recommend it. Buy the DVD. It's a movie that you can watch many times.

11/27/2007 04:04:00 PM  
Blogger RWE said...

When I first began using the Internet for non-work related effort (translation: fun stuff) an old friend of mine who was highly enthused about the Internet (translation: hopeless nutcase) told me I needed to got a "cyber education" and directed me to a number of enthusiast websites (translation: gathering places for hopeless nutcases needing reinforcement of their mental infirmity).

I responded that anything posted on the Internet had the same veracity as that written on the wall of a men's room of a Conoco gas station in Ponca City, OK. Some experts disagree violently with this viewpoint, avowing the the Internet more closely resembles the truthfulness of material displayed on the back wall of an Exxon station in Honea Path, S.C.

Case in point: I recently wished to use a quote from Shakesphere in a piece for an on-line magazine. I thought it came from Hamlet. And so did other people. Searching for that quote on the Internet via the usual search engines, I found numerous cases of people quoting that line as from Hamlet. But further search turned up a actual Shakesphere website that revealed the quote was from Julius Ceasar.

And I found the wrong info both first and easier. You don't really have that problem with an actual library. Or even with a gas station wall (not much fake Shakesphere there). Most of the Internet is Chaff, to use electronic warfare terms.

11/27/2007 05:50:00 PM  
Blogger Chuck said...

RWE,
Trying to find a phrase from Shakespeare in a library, in physical books, paging through physical pages is going to be very time consuming and very problematic. On the Internet you can identify where something is from in seconds, even something extremely obscure like "I care not I, knew she and all the world" by Googling (be sure to include the quotation marks to filter the chaff). I am a Baroque music afficionado and recently I found (out of print) CDs of the complete harpischord works of Louis Couperin and of Jacob Froberger. It took an hour or two of effort and some clever querying, but I was able to locate them, purchase them and will soon appreciate listening to them. Back in the 70s when I first fell in love with classical, those CDs would have been simply unfindable and unobtainable. There is certainly vast amounts of chaff on the Internet, but if you use the search intelligently, it can very often make possible what was once impossible.

11/27/2007 06:54:00 PM  
Blogger Mad Fiddler said...

You dudes need to scope out the Akashic Records, which comprise the sum total of every thought that has ever been thunk, by every sentient entity, corporeal or spiritual, plus the events of the unfolding of all universes and all time.

It's a little tricky getting hold of a library access account, though. Seems to involve a number of reincarnations.

I make no claims, mind you, I've only heard references.

11/27/2007 08:59:00 PM  
Blogger Mad Fiddler said...

One of the lists that used to circulate among the engineers and systems administrators in Silicon Valley was the stupid questions and problems they had to deal with from their dumb-ass clients.

A favorite seemed to be the story of the lady that brought a floppy disk to her company's system administrator and asked him to download the internet onto it, please...

11/27/2007 09:12:00 PM  
Blogger geoffb said...

ZZMike wrote,
"My wife complains of the loss of card catalogs at libraries. Next to one card will be others, somehow related, that point to books on other shelves."

I agree with her. One of the things I liked in my younger days was that in looking for one thing in the card catalog I'd stumble on many others I'd never knew existed. However I must say that the Internet with all the hyper-links everywhere is much like a multi-dimensional card catalog. You never know what link someone will post to a place you would never have found otherwise. In my case that is what is so addictive about the Internet.

11/27/2007 10:34:00 PM  
Blogger dla said...

Can we avoid another dark age without libraries? I think the answer is yes.

The ability of society to evolve is based solely on it's ability to leverage the learnings of others and propagate that knowledge to it's citizens.

The web is distributed and therefore much harder to destroy. It is ubiquitous and therefore much harder to idelogically control. However, it still requires an educational system to train up and make use of the information - arguably the weakest asset in America.

11/28/2007 08:08:00 AM  
Blogger Alex Sloat said...

Only 11 terabytes? That's way smaller than I expected. I'm just waiting for the point about 10 years from now where I have enough hard drive space to justify torrenting it.

11/28/2007 09:55:00 AM  
Blogger chachapoya said...

The internet is very much like the universe, ever expanding, all inclusive, all surrounding, ruthless and benevolent. The more you become accustomed to it the more it changes. The internet you perceive has very little to do with what it actually is at this moment and but for some small corners, a complete mystery to most of us. That's OK by me. If you learn the laws of the internet (not all Shakespeare quotes are correct, protect yourself, etc.) you will find the Baroque music you search most times. Caveat emptor.
The dead tree media have their problems too, and lately it's been denizens of the interweb who had shed light on them.

11/28/2007 10:59:00 AM  
Blogger Peter Grynch said...

I recently edited Dennis Kucinich's Wikipedia entry to add a reference to his encounters with extraterrestrials, then checked back daily to see how long the edit would survive. It lasted about a week.

If anyone had conceived of the ubiquitous Internet back in the 1960's, what are the odds that this prescient individual would have guessed that the two most killer aps would be the ability to violate music copyrights, and the ability to download porn?

Of course, books have their own problems:
When the Pope died, he met St Peter at the pearly gates and said he had many questions. St. Peter ushered the Pope to the heavenly library. The Pope was thrilled and settled down to review the history of humanity's relationship with God.

Two years later, a scream of anguish pierced the quiet of the library. Immediately several of the saints and angels came running.

They found the Pope pointing to a single word on a parchment, repeating over and over: "There's an 'R'. There's an 'R.' There's an 'R'... It's CELIBRATE, not celibate!"

11/29/2007 10:59:00 AM  
Blogger dla said...

Peter Grynch said...
I recently edited Dennis Kucinich's Wikipedia entry to add a reference to his encounters with extraterrestrials, then checked back daily to see how long the edit would survive. It lasted about a week.


1 week survival for a bald-faced lie is a huge improvement over paper print. Biology textbooks still have Hegel's moths as a "proof" of Dawinian evolution - even though it was debunked years ago.

I'm hoping (hope because I fear the opposite is quite possible) the future "internet" will be beyond idelogical control.

But I know that just as today there is a 1st church of Darwin, a 1st church of Global Warming, and a 1st church of hate Bush, there will be similar efforts of thought-control in the future.

And that thought control is the central factor of another Dark age.

11/29/2007 12:45:00 PM  
Blogger jj mollo said...

The Name of the Rose is wonderful. Eco makes Dan Brown look like a beginner. There were so many angles to that story with lost books of the ancient Greeks and the Occam referrences and the Sherlock Holmes analogy.

Eggplant made me think about the possibility of creating a shadow Wikipedia where the entries are parallel to the Wikipedia articles, but contain only references to good articles written by individual authors on the same subject. There is already a Conservapedia which gives the convervative angle on Wikipedia subjects, but I'm thinking about a broader spectrum of writing, linking to anyone who has a coherent and co-extensive view on the subject.

11/29/2007 02:47:00 PM  
Blogger Bookyards said...

An excellent article that accurately summarizes certain developments that are occurring with online libraries. I should know because I am the founder of Bookyards ( http://www.bookyards.com ), a library that has been online for the past eight years and is one of the first to explore the commercial possibilities for libraries on the internet.

The only thing that I can add to the article are the following …. a historical perspective, and pointing out certain trends that are not explained in your article.

When we first started there were only a handful of libraries, with a limited readership. Today, there are over 800 legal libraries (the list is at [www.bookyards.com/categories.htm...] ), and approximately 150 online “pirate” libraries (online libraries that are providing books without respecting copyright law). Today’s overall readership for online books runs in the millions.

The content that was available 8 years ago was approximately 30,000 online books. Today, I would estimate that there are over a million English books online.

But the problem with today's online libraries are the following:
(1) the content that is being accumulated, while overwhelming, is primarily material published before 1927 and is out of date. It is not only out of date, but …… and I am being polite ….. quite useless and not worth anyone’s time to look at.
(2) The format’s that are being used are not user friendly. PDF is preferred, but other formats such as DjVu now have been given priority.
(3) For most of these sites (Google and Universal Library) you cannot download the books.
(4) Copyright issues and legal problems have not been resolved, and it will be years before they are. This limits everything.

If this was the case, bookstores and libraries will be safe. But the threat that will be coming to libraries and bookstores will not be the projects that have been outlined by your article, but it will come from offshore online libraries that do not respect copyright laws, have modern books and content (some of them have present day bestsellers), are user friendly (use pdf), and can be downloaded (via through rapidshare and other file transferring services). Gigapedia ( http://gigapedia.org/ ) is a perfect example of this trend….. a Russian online library with thousands of English books that are copyrighted in the West but are available for free on their site.

Like Napster was for music, YouTube and p2p networks were for videos and movies, it is these “pirate” websites that will have a greater impact on books, libraries, and the publishers who produce them than online libraries such as Google, Universal Library, or Project Gutenberg.

11/29/2007 10:08:00 PM  
Blogger Peter Grynch said...

Note to dla...
You might want to check your facts before accusing somebody of "a bald faced lie".

http://www.google.com/search?sourceid=ie7&rls=com.microsoft:en-US&ie=utf8&oe=utf8&q=kucinich+ufo+streisand

Dennis Kucinich's extraterrestrial encounters are well documented and he recently confirmed them publically on national TV.

Hope this doesn't shake your faith in the guy. "Beam me up, Scotty!"
:^)

11/30/2007 01:02:00 PM  
Blogger Don Cox said...

I think the main problem of the Web is the absence of in-depth information. For example, I recently finished reading Hilary Spurling's great 2-volume biography of Matisse. You would find hundreds of one-page entries on Matisse all over the web, but nothing like a full biography. Again, I have a book (published by Taschen) that reproduces hundreds of works by Picasso. There is no comparable collection on a web site, and if there was, the images would probably be too low in resolution to do justice to the paintings and drawings. But the Web is ideal for getting a quick introduction to some topic, or settling some small point.

12/02/2007 11:48:00 AM  
Blogger Peter Grynch said...

Don Cox said, "if there was, the images would probably be too low in resolution to do justice to the paintings".

Here's a link to an ultra-high resolution copy of Da Vinci's last supper (the source file is gigapixels). You can view the whole picture, or zoom in till the brush strokes are visible.

Get a 62" hi-def monitor and the experience will be identical to having a real painting.

No reason big art museum websites won't one day feature al their exhibits in hi-def...

12/05/2007 03:29:00 PM  

Post a Comment

Links to this post:

Create a Link

<< Home


Powered by Blogger