The other week someone brought to my attention an article with a title “Lies Data Tell Us” by Steven J. Thompson, CEO at Johns Hopkins Medicine International. The title took me aback, but as I read it I realized the article was really about better practices required for data to be more useful. Use of the provocative and somewhat misleading title resulted in nearly 12K views, dozens of comments and hundreds of shares in social media. When I started looking for this article again, the search brought a number of links that associate data, big data, etc. with “lies”. Most of the authors blame data or unscrupulous mining and analysis technology vendors for all sort of business problems resulted from “data lies”. It seems some of these authors use the following definition:
Data Scientist (n): A machine for turning data you don’t have into infographics you don’t care about.
I would like to examine a process people often follow when they deal with data.
Since the term “big data” is thrown around a lot, I would like to define it in the context of this article. Mere volume and velocity of data does not constitute “big data”, but multiplicity of data sources and data formats does. From that perspective the term “big data” describes an enterprise data aggregated from multiple departments and multiple data bases (i.e. data warehouse model), linked with data from sources external to a company, in a structured and/or unstructured format. Mining such set of “right data” may produce very valuable intelligence. However, all can also result in waste of money, efforts and opportunities if
- The mining process does not produce relevant new intelligence, or
- The intelligence is not used for action.
We act when we believe the action will result in a desirable outcome. We never know for sure, but we estimate probability based on our experiences in similar circumstances. These dynamics influence how we select, search and interpret the data into intelligence, or lack of thereof. Subconsciously we select data that would likely provide confirmation of our existing beliefs. This usually means that we heavily rely on internally generated (controlled) data and heavily discount externally generated data.
We like to use such terms as unbiased and objective, but the very process of selecting a data set introduces bias and subjectivity. It is unavoidable. It is a much better practice to embrace and understand a bias that is pragmatic, and define a purpose of an inquiry. You don’t see people mining a mountain to find “whatever” is there. They carefully select and test an area for an indication of high concentration of desired mineral before the exploration and mining start.
If the purpose of your inquiry is improvement of customer experience, assemble a data set from the most relevant internal and external data sources available. If you limit your data set to a company controlled data, you introduce a company bias. In such a case the likelihood of discovering any new intelligence for improving your customers experience is quite low. Forget about data mining and just continue your archaic surveying exercises of “guess and validate”. If you include data generated by customers without solicitation and control, you will introduce customer bias. Introduction of channel generated return data and customer service data will allow for balancing of the biases. Correlation of trends in controlled and external data sources will help to discover potential gaps between your beliefs and emerging evidence. However, even the best evidence cannot automatically make people abandon their beliefs and start acting differently, but that is a subject of another article.
The point is – data cannot lie to us; we have to do it ourselves by not mining it honestly and competently.