You have 2 free stories left this month.
Data is Not the New Oil
The analogy turned meme turned dogma is persistent and flawed.
Walk into any conference that has something to do with data and you’ll likely hear things like: “knowledge is the greatest commodity,” “the most valuable resource is data,” and “data is the new oil.”
Data is the new oil. We’ve all heard it. And probably nodded thoughtlessly along. But is it? Is data the new oil? Does the analogy hold water or is it merely a thing people say? I, for one, do not think it is a good analogy— the two are dramatically different and are ultimately distinct.
The analogy its origin likely meant to point out that oil had a revolutionary influence on its era and so does data contemporarily. But people have become increasingly liberal in invoking the analogy. Even companies that have never mined natural resources can unapologetically state that their ‘data is the new oil.’
Douglas Hofstadter, famous philosopher and author of Gödel, Esher, Bach would call the analogy a naive analogy. These are the “kinds of analogies on which nonspecialists tend to base their notions of scientific concepts. [Notions that] […] are acquired thanks to appealing and helpful but often overly simple analogies.” *
Hofstadter continues to note that they are probably easy to remember but that their weakness stems from the fact that in a certain context they are misleading. “Naïve analogies are like skiers who sail with grace down well-groomed slopes but who are utterly lost in powder. In sum, naïve analogies work well in many situations, but in other situations, they can lead to absurd conclusions or complete dead ends.” ** This is what we are witnessing currently: the analogy leads us astray. It led us to believe that positions such as ‘data analyst’ and ‘data scientist’ are simply refineries for the data because that’s how we handle oil too. Which is not true. At all.
Data is no oil
The most obvious flaw in the analogy is that gathering digital behavioural data comes at (almost) zero cost. For an oil-based product such as gasoline over 57% of the cost is incurred whilst extracting crude oil. Around sixty percent of what you pay at the pump is incurred when extracting oil from the Earth’s surface. No one has ever invested so much of the entire projects budget to get their hands on some data.
Shipping and distribution also pose a significant cost on the final price of oil. Data, on the other hand, can be freely transported and basically stored for unlimited amounts of time for free.
The idea that data equates to a crude resource such as oil has led us to invest heavily in data lakes (read: swamps) and blindly gather information. Because the thought goes, if we have the crude resource, creating an end-product is easy. The analogy leads us to believe that there is, in a sense, an intrinsic value to data.
Data gathering, acquisition and storage have become commonplace and abundant; value creation has been proven much harder. Data, it turns out, has no intrinsic value. Value is created when data is applied to solve a specific problem — and it is here that cost is incurred too — when turning a crude resource into useful endproduct.
Value extraction
Ask any data specialist and they’ll agree that upwards of 60% of the job consists of wrangling the data. Getting the right data from disparate sources, cleaning, restructuring, engineering features, etcetera. Acts that, in themselves, add no additional value other than to the task they are supporting.
Every algorithm, every model, and every analysis requires carefully prepared and cleaned data. Even when thinking of a minimum viable product the owner has to assume a lot of ‘cost’ upfront before any value can be extracted. The data that was cleaned and not used has typically little value left (unlike, again, oil, where it’s by-product, can be used to create tarmac or plastic).
Value extraction is, therefore, the wrong terminology; value from data needs to be created.
Marginal cost and resource depletion
Data has zero marginal cost for gathering, storage, and transfer. One unit increase imposes no significant cost. All cost for analysis is basically upfront. The same goes for production-ready data-products: each additional product or customer scored is not a cost incurred.
Data, unlike typical resources that become scarcer, has become more plentiful. Scarce resources increase intrinsically in value. Their demand outweighs the supply. This is something we have witnessed over the last decades with the ever-increasing price of gas and oil-related products.
Not true with data; data is excessively abundant and increasing evermore. The cost is not in finding or storing its contents. In fact, the cost is in refining the data and creating value from its ‘crude resource.’
Another disparity is that data requires validation and verification; every pipeline needs to research the same questions over and over again. Not true with oil, we perfect refineries and distribution pipelines to make systems function optimally.
Conclusion
Continuation of the ‘data is the new oil’ meme leads us to unconsciously draw more inference about similarity than there strictly is.
The problem with holding on and spreading the naive notion of similarity is that it pertains the common conception that ‘data science’ and ‘analysis’ are simple functions atop of data (akin to making gasoline from crude oil). However, it is in the creation of products where the true cost and time are incurred. Not in the extraction of the crude resource — but in creating meaningful end-products.
Oil always works.
Data doesn’t.
*Surfaces and Essence: Analogy as the Fuel and Fire of Thinking, Douglas Hofstadter, page: 31
** ibid. page: 389