BLOG STARTUPS, VENTURE AND THE TECH BUSINESS
July 23 2009
by Stephen Marcus
- Tagged under
- Information Technology
- Technologies
Data: the new landfill
I’m currently out in Seattle at the 2nd Annual Tableau Customer Conference discussing tips and tricks on their software for the interactive visualization of data. Having been a customer for several years, I can honestly say that Tableau is one of the most useful pieces of software I’ve ever run across. Data is becoming a problem and in more ways than one. It’s everywhere, and lots of it, because we save everything multiple times over. it has become increasingly difficult to dig the meaningful data out from the noise of the duplicate and irrelevant parts.
IDC releases a report every year on the Digital Universe that projects how much data we are stockpiling. They reported that the digital universe was 281 exabytes, or 281 billion gigabytes in 2007. That’s 47 gigabytes for every person on the planet and due to increase 60% annually, a ten-fold increase over the next five years. Hard drive cost continues to decline while capacity increases but here’s the bad news: it’s decreasing to a much lower rate than the increase in information created. In essence, we are generating more data than our ability to match an increase in storage capacity. According to the chart below, we are poised to blow through the available storage capacity on the planet well before we reach 2012.

We have turned into digital hermits, accumulating pictures, videos, old e-mails, excess Facebook friends (see: Burger King Sacrifice campaign), etc. My latest search on Amazon for a hard drive showed that a 1,000 gigabyte hard drive is a mere $90 yet EMC calculates that it costs $36.29 a year just to power and cool a hard drive in a datacenter. It appears that it will cost more to access and maintain your data on a hard drive than it will to save it. Next time, maybe you will think twice about adding that friend you don’t know so well on Facebook.
Even our digital dopplegangers, or what IDC refers to as your digital shadow, is the digital information generated daily about the average person. My digital alter ego is amassing data from my financial records, mailing lists, web surfing histories, and more. Our dopplegangers are now, for the first time, creating more data than we are. Consumers generate more than three quarters of all the data and enterprises the remaining. Several stories have been published about the dearth of data piped into the CIA and FBI for inspection without any real ability to process the bulk of it. Needless to say, just obtaining the data is clearly the easy part but at this rate, we will need Al Gore to launch a new global campaign in three years to combat the stockpiling of mostly useless and repetitive data.
But all of our data isn’t useless, right? True. Not all but most. Some of the best solutions to problems are elegantly simple often containing only a few variables that contribute to the conclusion. How about E=MC2? Three variables leading to one epic conclusion. Large multivariate analyses contain lots of overlap–Name, ID#, etc. are all variables highly correlated with one another, meaning that the addition of one variable doesn’t add any additional description that we didn’t already know. So then why save everything? In a word: fear. We save it because we don’t know which of the variables is the winning lotto ticket. Even if it isn’t conclusive for us it’s probably valuable to say, Facebook or Google, right? That’s akin to saying that one man’s junk is another man’s treasure. Your data is most valuable to you to understand your patterns versus anyone else and increasingly, fewer components of the aggregate data have any meaning. A long time ago, my father told me a story about this reputable local businessman with a side business as a purveyor of almost any piece of equipment he could find. To protect the names of the innocent let’s call him Phil. Phil was a natural born hermit; he would save everything and would add to his pile of stuff if it were a deal (nearly free), regardless of age or even if it were functioning. Phil’s thesis was that everything had value to someone so if he obtained it at close to zero cost he had a lower barrier to finding a buyer. And there had to be a buyer somewhere for nearly everything, right? In one of his defining moments, Phil won the contents of a room full of equipment at a local company liquidation. He had his men take out the contents sitting on a palette inside the room and then proceeded to remove the sheetrock from the walls, radiators, floors, and everything else that wasn’t bolted, glued, or load bearing. The auctioneer walked into the barren room and said, “What did you do?” and Phil responded, “It says here that I bought the entire room so that’s what I took.” To this day Phil still has countless warehouses full of junk, most of which he had long forgotten about their contents. He sold some items he purchased over the years but nearly all the second hand equipment, including radiators, pipes, tanks, and such went unsold while his storage bills continued to increase. Phil doesn’t use a computer but we can all see how the digital version of this story is shockingly similar. Similar to Phil, don’t believe that all your data is valuable to someone if you just save it in the hope for a willing buyer. We all have a little Phil in us but we can’t afford to continue this belief that most of our data is anything but digital junk. The sooner you start to ask questions with your data to find out the most valuable parts, the faster we can save the planet from its next environmental crisis: data pollution.
