The title of “Data Scientist” is a relatively recent phenomena. Coined and popularized in the late ‘90s, the title didn’t really seem to have much widespread adoption until the last decade where the extreme proliferation of data and the need to analyze it forced companies to seek out the specialized resources best suited for this task. When I first entered the field a decade ago, the term “data scientist” really didn’t even exist (don’t most scientists use data?). To illustrate, I started as an “engineer” characterizing and predicting wafer processes in a large semiconductor company. I then moved to a “decision support analyst” position, prototyping and implementing production predictive fraud models for online ACH (Automatic Clearing House) transactions for giant banks at a financial company. Following that experience, I went to Allstate as a “predictive modeler,” analyzing and prototyping next-generation pricing models for their main book-of-business, before joining, as a “Scientist,” at a company that created production pricing models for hundreds of billions of dollars in sales for the biggest retailers in the world. I eventually ended up heading the algorithms team responsible for retail science at IBM after the company was acquired, before finally landing at Guidewire as the Director of Data Science. Yes, this is my first “Data Science” job, even though I have been doing the same thing and using the same basic toolsets across a variety of industries for the past decade. Though the companies I left eventually started using the “data science” title as well, the term is a relatively new one that describes a very old and established tradition of applying statistical models on data for prediction or classification. Most people I know are surprised that “data science” has been going on for centuries, since the popularization of regression in the early 19thcentury. However, why did this profession balloon into such a sought-after field with job postings ballooning 15,000% in 2011? The answer likely lies in the increased stakes resulting from massive data availability and the available computational hardware power to crunch and process it.
There are roughly 4.4 trillion gigabytes of digital data in existence today and that staggering number is expected to grow tenfold in the next decade, according to a report commissioned by EMC. To put that number into context, in ten years there will be more bits of data than known stars in the universe. This data explosion is the result of the recent emergence of the digital world in which social networks, online commerce/services, new digital technologies, and archiving of older formats have filled up servers with more accessible data than we’ve ever had access to. Buried in this data is the seemingly limitless potential of monetization to be unlocked, either through new products or improved information over competitors. Facebook itself, which is essentially a service that allows it to collect intimate data about its users for direct marketing services in exchange for the use of a social network platform, is currently worth over $200 billion alone. In retail, companies drive billions in sales, based on marketing campaigns that try to characterize customer segments or price sensitivity behavior through spending habits. Closer to home, the insurance carriers’ core business is knowing enough about their customers to price a policy correctly and avoid high-risk underwriting (adverse selection), which can result in a book-of-business that bleeds profits due to suboptimal pricing models. Additionally, identifying fraudulent claims from your book necessitates as much data as possible to fight the almost $50 billion worth of non-health insurance fraud annually estimated by the FBI. Clearly, the stakes are high.
This drive for better information, as well as the availability of data, has changed the traditional landscape of statistical analysis/data science. Machine learning algorithms parallelized to operate on petabytes of data have become staple tools of data scientists. Though they initially look very different from traditional regression/classification models in use from the early 19th century, expert data scientists who actually understand the algorithms in depth recognize the similarities. Support vector machines look like estimation efficient versions of logistic regression-based discriminant analysis; feed forward neural networks look like regressions on top of complex transformations given the correct architecture; the list goes on. The reality is that machine learning is predominantly grounded in statistical principles which require optimization techniques to solve, so the current iteration of “data science” is just a natural evolution of more complex statistical tools enabled by the vast improvement in computer resources to crunch this data. However, this job requires a rigorous dedication to math. This helps avoid the “black box” mentality that it’s not important to understand how something works, as long as you can shove data into a program and have it return results. Sadly, roughly 75% of the data science candidates I have come across do not understand what they are doing. More disturbingly, they don’t see the need. I actually had candidates tell me that it was pointless for them to understand the algorithm, because the computer already handled all the details. Not understanding the algorithms, and therefore how the data needs to be cleaned or used, is not only suboptimal, but can result in spectacular failures once production systems are implemented. This is especially true when the data is dirtier and contains less relevant information than expected, which tends to be the norm rather than the exception.
The takeaway here is that “data science” in its current iteration is a natural progression of what people have traditionally been doing for centuries. The availability of digital data and the new hardware infrastructure available makes the benefits of successfully analyzing data billion dollar ventures. Often overlooked, however, is the fact that the cost of entry into the field is higher than most realize in terms of education and knowledge. Most mature data science teams in large organizations consist largely of Ph.D’s who tend to cluster heavily in the statistics and applied math disciplines, and the reason is that the complexity of the algorithms require a lot of knowledge to effectively utilize for analytics. People who have spent 4-6 years after obtaining their undergraduate degrees investigating and researching these algorithms and tools in depth have a much higher likelihood of successfully leveraging data. Obviously, there are exceptions – brilliance is brilliance, regardless of education). Organizations that take shortcuts to finding and developing this expertise often find their analytics lacking, without a clear reason why. I plan to address the topic of common shortcuts organizations attempt to take, and best practices for leveraging analytics, in a future blog post.