Making Big Mistakes With Big Data
There’s a lot of excitement about big data. Interest in what we can do with information, especially large volumes of data, is at an all-time high. We have more data at this point in history than we’ve ever had, and most of us are looking for ways to put it to work in order to help our businesses grow. However, just because we have a lot of something doesn’t make it valuable. The economic law of supply and demand would indicate the opposite: that the more of something we have, the less valuable it is. But that couldn’t really true when it comes to data, could it?
It’s difficult to argue with data. Numbers are the ultimate decision-making tool because they can give us an objective view of a situation and quantify things in meaningful ways. When we need to make a point or choose a course of action, numbers have been our reliable ally. Most of us wouldn’t make a strategic decision without data. We’ve become dependent upon numerical figures to do such things as measure market share, evaluate potential expansion efforts, guide committee decisions, and determine when specific initiatives should be advanced or terminated. The advent of big data gives the assurance that we’ll be making even better decisions because we have more data to feed into the process, but this may be a false assurance.
Having so much data doesn’t necessarily mean that we have all the answers. In fact, data never actually tells us what to do. Data tells us “what is.” It’s the interpretation of the data that makes the difference between success or failure in business initiatives. I often hear people use numbers to make their points and couch their arguments with the expression “the data is saying …” But data doesn’t actually give advice. Data only quantifies the current situation, and it may not be quantifying all aspects of it. Data is like the X on the pirate map that says, “you are here.” All applications of data involve interpretation and assumptions, and it’s here that most individuals make mistakes by assuming that the application of the number is just as valid as the existence of the number. It’s important to remember that the measurement may be accurate, but how we use that number is part of the application process and that always involves an element of subjectivity.
Stack onto this the fact that there are almost as many ways to misinterpret data as there are experts doing the interpretation, and this could could be a recipe for a data-driven disaster. However, the key to making better decisions is in understanding data within the proper context. The meaning of specific numbers may change under different circumstances. For example, in the simplest form we all know that a rating of 5 is great on a 1 to 5 scale, but very poor on a 1 to 100 scale. When viewed like this the context appears obvious, but let’s consider something that requires a little more interpretation. A hospital administrator did a cost analysis and discovered that nurses were costing more per hour than other employees. In an effort to increase profitability, they laid off one-third of the nursing staff. In this example the interpretation of the data involved cost per hour and the impact on total profits. However, shortly after terminating that portion of the nursing staff and replacing them with less skilled workers, the overall turnover rate increased and further reduced the salary expenses because even more nurses left the facility. Theoretically this would increase the profit margin as costs decreased, but profits declined because patients had to be turned away due to lack of qualified staff to care for them. Once the nursing staff was reduced there were higher demands on the remaining qualified workers and most of them left the company to find less stressful jobs. The data-driven decision ended up costing the hospital several times what was initially saved by the reduction in workforce. That particular hospital is now struggling to stay afloat. This example is just one manner in which data was interpreted too narrowly, and while narrow interpretation is one of the most common mistakes in utilizing big data. I’ve briefly outlined a few additional ways in which I’ve seen individuals confidently make data driven mistakes in business.
First Mistake: Just Because You Have Data Doesn’t Necessarily Mean That You Should Use It
I’ve lost count of how many executives have said things like “We have all this data. Let’s use it.” The truth is that the data is only as good as it’s ability to answer a question. If there are no specific questions, then analysis may be a waste of time and money. Sending someone to mine data without a purpose is like sending someone for a hike in the forest and asking them to report back on anything they find interesting.
But what’s interesting? Trees? Mosses? Wildlife? There’s a lot to see in the forest and in data, so having a focus is important. In addition, once someone has a focus, you may suddenly find out that the mountain of data you have on hand is useless for answering the questions that you have. Don’t try to use the wrong data to answer the right questions just because it’s convenient.
Second Mistake: Trying To Use As Much Data As Possible.
As I just stated, having a lot of data doesn’t mean that you have the right data and more is not necessarily better. Sometimes we overcomplicate analysis just because we can. Too often I’ve seen data scientists add one additional variable to their model because an executive said, “The data is there, so why not use it?” I understand that it may appear wasteful not to use every bit of information available, but it’s important to make sure that we’re using all the right information and not adding so much into the mix that it becomes muddy. Think about a chef who puts everything into a soup simply because it was available in the kitchen. Lemon, pickles, chocolate, ketchup, whip cream, and tacos may not blend as well as we think. Sure we have a diversity of ingredients, but adding variables to a model can also increase the noise (not to mention indigestion), thus making the results difficult to interpret. It’s not about the quantity or even the diversity of data types: solid business decisions are based on using the right data and interpreting it accurately.
In a similar manner, I’ve seen a lot of work go into presenting the same data from different angles in the hope of revealing something interesting. It’s a phenomenon that people in my line of work refer to as “analysis paralysis.” Information gets rehashed and reformatted, slicing and dicing into new charts and graphs until the analysts become catatonic. Don’t over complicate any data-based decision because you have more data and tools and want to use them. Occam’s razor exists because it’s true: the simplest is often the best. When we make things more complex than they need to be we run the risks of misinterpretation and making bad decisions.
Third Mistake: Making Inferences Based On The Data At Hand, Rather Than Collecting More Precise Information
Just because you have a truckload of data doesn’t mean that you have the right data. If you discover that the data on hand isn’t sufficient to answer your questions, then define what’s needed in order to do so. Using the wrong data is about as successful as using a carnival fortune teller. There was a great example of this in a 2011 efficiency study conducted by the OIG. They did some research on the productivity of mail carriers and determined that it would be more efficient to move mailboxes to the street so that the postal worked didn’t have to walk the route. In a simple form, it makes sense that we can drive a route faster than we can walk the same distance. However, the USPS didn’t have the authority to force the entire United States population to move every mailbox to ta curbside location. So the Post Office decided to force all new residents to move existing porch mail boxes onto the street or the USPS would refuse to deliver mail to the new resident. The Post Office theorized that the neighborhood would slowly change with new residents over the course of a few decades and then all mailboxes would be moved to the street. How efficient they would then be!
As you might guess, things didn’t quite turn out that way. What happened (and is still happening as a long term effect of this initiative) is that mail carriers took longer to deliver the mail because the mail carriers now had to walk a zig-zag pattern alternating from the street up to the houses instead of cutting across the grass, as they had been. When compared to the original route, this erratic walking pattern more than doubled the distance required to deliver the mail. In addition, the OIG study also failed to consider that even if everyone in a neighborhood had moved their mailboxes to the street, in urban areas there was street parking that made it difficult to reach those street boxes, making it nearly impossible to zip from mailbox to mailbox as the study had suggested.
I don’t know exactly which researcher conducted the study, but my guess is that they compared delivery times from driven routes (which are more likely to be in the suburbs and rural areas) to walking routes (which tend to be in urban areas). I suspect they used the data at hand, but no one thought to bring in the information on the metropolitan and geographic areas that particular analysis. For those who don’t already know, the Post Office has been struggling to stay afloat financially, and decisions like this one will keep it struggling or force it to finally go under. In the meantime, there have been a few additional OIG studies looking at the differences between urban and suburban mail delivery. However, the mandate concerning street boxes has already left the USPS with some lengthy walks for mail carriers and there’s no easy solution to remedy the situation.
Fourth Mistake: Being Error Free Doesn’t Mean Being Right
As big data becomes more accessible to a greater number of individuals within an organization, I’m seeing more employees without a measurement background running queries and making the charts executives think they need. When asked if the analyst is certain that the information is accurate, the common reply that the individual did not receive any errors messages when running their code.
Error messages identify syntax errors in coding (code language based errors), but the lack of an error message doesn’t mean that the query is measuring what it was designed to measure. We can correctly arrange garbage and its will still be garbage. This goes back to finding the right experts to design the analysis and interpret the results. Simply because someone is able to write code or successfully conduct research doesn’t mean that the individual has the expertise to analyze the information. All data exists within the context of the industry in which it is created, so be cautious when relying on mathematical and statistical experts who lack applied knowledge. This doesn’t mean that experts can’t share information or analytical techniques across fields, but the right experts will know what’s appropriate and how to do so in order to be successful. It all goes back to remembering that these aren’t just numbers that can be universally crunched to yield the right answer. Experts are trained differently within their fields and they’ve built up a knowledge base of how to analyze the data in order to create meaningful information. Having the right experts will result in better data-based decisions.
Fifth Mistake: Statistical significance Makes Something Meaningful
One of the fundamental measurement issues encountered in big data is based in statistics. Within any research field, there is a line drawn to define when the results of a study show “true” results and are less likely to be due to sampling error. This line, which most researchers hope to cross, is referred to as “statistical significance.” If your study results are statically significant, then your research is publishable. In any academic field if you don’t cross that line you have nothing.
In an applied setting, however, statistical significance can be meaningless. Statistical significance shows that there’s definitely an effect, but the size of the effect may not make enough of a difference to justify changing the way you do business. This is especially concerning for analyzing big data sets because the more data we have, the easier it is to show statistical significance without having much of an effect. For example, the study might show that a product with advanced features sells better than the product wth basic features. But how much better? What if “selling better” means only selling one more unit per month? If the sample sizes are large enough, the data could be telling you something just like this. The right experts will know what other questions to ask, such as the actual difference in units sold per month, the cost of adding those additional features, any production changes that would need to be made, and whether the changes would significantly slow down time to market.
Sixth Mistake: Drawing The Wrong Conclusion
This may be the most common and at the same time most detrimental error that businesses make. The worst part is that it’s often done unknowingly. For example, data may show that consumers are buying your product, but what the data may actually be showing is that consumers are purchasing a product you offer. These two things are not the same. While the data is objective in sowing the number of widgets sold or percent of market share, any conclusion that those consumers are brand loyal is a conjecture because purchasing behavior and brand loyalty are the same thing. As an example, think about most cable and Internet providers: they have a large number of “loyal” customers because options are limited and people rely on these services. If we looked the data for those companies, the purchasing behavior would mimic brand loyalty. Enter Google Fiber, and now in some areas there’s another option that changes the scene. What had appeared as brand loyalty turns out to be customers choosing the least irksome option. In this situation, once again context is everything. If the existing cable companies had looked to experts in equity theory, they would have asked different questions in their satisfaction surveys and realized that they were only keeping customers because the hassle of changing providers was deemed greater than any benefits derived from a change in services. Marketing research mistakes provide numerous examples of drawing improper conclusions because it’s a complex aspect of human purchasing behavior. Even if your marketing department understands purchasing behavior, your team may interpret the data according to general principles. In this situation, it would take someone who understands other psychological factors such as decision making and equity. Similarly, I’ve seen medical experts interpreting medical claims data without the expertise of a healthcare economist, who understands the impact of the benefits structure on healthcare behavior. The physicians and medical experts often push for education to change behavior, but knowing that something is good for people doesn’t make them do it if it’s going to cost them money out of pocket.
There will always be patterns in data. Some of these are meaningful and some are noise. Anyone looking at the numbers without the full understanding of which metrics influence one another may infer relationships that don’t exist or miss key signals within the findings. Failure to understand the business or culture in which the data was collected can result in some very bad poor decisions, and just because someone has knowledge in one area doesn’t mean that the person has expertise in all areas. Even individuals who are highly proficient in working with data may not have the necessary skills to assist in making informed decisions for your business. Context is everything.
Seventh Mistake: Making Inferences Based On A Different Time Period.
The last and possibly most prevalent error made with data involves the time period in which the data was collected. Data is not like wine or cheese: it doesn’t get better with age. In fact, it may even become useless. How quickly data spoils depends upon your industry and how rapidly things change there. For example, it’s generally considered appropriate to use one or two years of data when constructing a predictive model. The reason so much data is used is because there are random fluctuations (for example, buying behavior can change for a brief period following a national news story), as well as seasonal fluctuations (such people catching cold more often in the winter or sales of fireworks in the US suddenly increasing in late June and early July). Using a full year or two of data gives a better perspective if what we’re measuring is relatively stable year over year. However, while we can deal with some volatility this way, it also creates a false assurance that what happened in the past is likely to continue in the future. A good example of this is the current presidential election. I’ve watched several experts predict that the only way that the democrats could win an election in November is if the current vice-president ran against the republican challenger. These experts had aggregated data over the past 200 years showing the outcomes of elections and calculated the probability of each party winning. However, what isn’t included in this 200 years’ worth of data is a situation in which a former First Lady ran for office. Also lacking is any trend that shows the impact of the Internet on candidate popularity, such as we’re seeing with Bernie Sanders. Sanders has gained momentum never before seen in previous elections which relied on newspapers, radio, or television in order to generate name recognition significant enough to be a contender in an election. Therefore we have two novel situations that call into question the predictions made on past elections. The world has changed over the last 200 years, so while aggregating this data on a national population for the same type of election (presidential) may appear valid, it might be more appropriate to study the the outcomes of some more recent local elections to make an accurate prediction. It remains to be seen who will win the election, but I’m not putting my faith in the predictions of experts who’ve used the antiquated data. If you use historical data to predict the outcomes of new products or other market trends, you may be making poor business decisions for the present time. Similarly, using data from one season to make inferences on a different season (for example using data from a winter month to forecast product demand in a Spring or Summer month) can result in poor business decisions. The primary issue is that the past will predict the future as long as the environment and other factors remain constant, so it’s important to understand the interaction of different environmental factors and how much of a change in these would invalidate the inferences being drawn from the data.
There have always been ways to misuse data, and the more data we have, the more likely we are to use it inappropriately. So, with the advent of big data also comes the advent of big mistakes being made with big data. The real problem is the confidence we get when using data incorrectly because we believe that the decision is correct simply because it was data-driven. In fact, using the wrong data or interpreting it incorrectly is worse that not using data at all. The organizations who understand this and learn to use data correctly will be the ones to watch.
OIG Study on postal efficiency.
Amy Neftzger is researcher who has been analyzing data and creating assessment tools for over 25 years. She has worked with companies such as Gallup, CMS, United Healthcare, and Optum. She has a Master’s degree in Industrial/ Organizational Psychology and is also the author of both fiction and nonfiction, including several peer-reviewed publications.