Data Catalyst - InnovationA few months ago, FlowingData highlighted the fact that “data scientist” has surpassed “statistician” on Google search trends for the first time. Since the latter has been at a stable level for over five years, this shows an upswing of interest in the data scientist category. Prior to 2008, “data scientist” did not even register as a search term, whereas in the last 18 months it has experienced 300% growth. Data science has grown tremendously, to the point of being a relatively mainstream concept for businesses; but with this growth comes misunderstanding, and not just in the business community.

“Data Science vs. Analytics—Approaches to Problem Solving” by Nick Kolegraff posits that data science is fundamentally about how you ask questions. According to Kolegraff, it’s the greater emphasis on asking “why” that distinguishes data science from data-related software development, business analysis or straightforward data analysis. So if you work in this realm but your approach doesn’t hit on the “why” question regularly, you might want to revisit calling yourself a data scientist.

Growing pains around how to define a data scientist are not unexpected; when an industry grows, career opportunities expand with it, and the vocabulary around it struggles to keep up. Fortunately, the folks who plan the O’Reilly Strata Conferences have been tackling these questions head-on with surveys and interviews. Last year, O’Reilly produced a report titled “Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work.” As the data scientist job description proliferated, this was a first attempt to create a typology of data scientists. The report identified four main types of data scientists, with subcategories within each:

  • Data Businesspeople: encompassing entrepreneurs, leaders and people in more traditional business roles
  • Data Creatives: dilettantes, hackers and more artistically inclined data designers
  • Data Researchers: statisticians, scientists and researchers, often with a more academic background
  • Data Developers: the integration, engineering and programmer base of the data world

It’s clear that O’Reilly has an inclusive view of what data science comprises, and given that the term is still up for grabs, their attempt to tackle the issues around classification is commendable.

Looking for further clarity, we showed this typology to Dan Theirl, co-founder of Rubikloud, a MaRS client from the Investment Accelerator Fund (IAF) portfolio. Theirl says their small team encompasses three of these categories:

Personally, I am a Data Businessperson, one of our team members comes from academia and fits the Data Researcher type, and one team member fits the Data Developer type as an applied math graduate and great programmer.

O’Reilly is not the only organization that has tried to untangle the definition of a data scientist. Data Community DC has come up with a related typology that combines skills and outputs in the form of the Pyramid of Data Science. The pyramid focuses on the processing of a data set, or the “life cycle” if you will; from a twinkle in a data scientist’s eye, all the way to a finished product.

Pyramid of Data Science
Data Community DC: Pyramid of Data Science

There is a clear flow from the bottom—the nitty-gritty of data collection, warehousing and cleaning—to the ultimate data product at the top. Many data-driven MaRS clients operate in the middle of the data science pyramid, with a focus on feature extraction and knowledge extraction. Since existing software tools for visualizations are so varied and robust, and because data gathering has become much simpler with the proliferation of APIs and open data, the meat of data-based business often occurs in the middle.

While data science may touch on all of these stages in the data life cycle, it’s not always the data scientist performing these functions. For example, a graphic designer might put the finishing touches on visualizations, and a database administrator might handle the storage and integration. So although this is the “Pyramid of Data Science,” it is wrong to think that only data scientists are involved. However, you can see how stakeholders one step away from the process might have limited understanding and respond to any need on the pyramid by throwing a data scientist at it.

Vincent Granville, founder of AnalyticBridge, has yet another way to segment the field. An excerpt from his blog post “Vertical vs. Horizontal Data Scientists” makes his distinction clear:

  • Vertical data scientists have very deep knowledge in some narrow field: A software engineer with years of experience writing Python code applied to API development. Or a database expert with strong data modeling, data warehousing, graph databases, Hadoop and NoSQL expertise.
  • Horizontal data scientists are a blend of business analysts, statisticians, computer scientists and domain experts. They combine vision with technical knowledge…. They can design robust, efficient, simple, replicable and scalable code and algorithms.

In his post, Granville makes a value judgment; he sees the “horizontal” data scientists as a more realistic and practical way to approach data analysis in a business environment.

Some data-driven startups we spoke with had mixed opinions about the horizontal vs. vertical view. Jonathan Latsky, founder of Envirolytic Insights, also a MaRS client, described the data scientist role similarly, as “50/50,” combining technical expertise with business skills. He is looking for a data scientist with a big-picture vision of the business’s process and clients’ needs.

On the other hand, Jonathon Polak, founder of the 1Datapoint health firm housed in the MaRS Commons, wants a more specific data scientist. His ideal data scientist has a doctorate in mathematics from a top university and works on the development of new algorithms; other employees or contractors could do any associated programming and design work. Dan Theirl at Rubikloud agreed:

Currently our data scientists are primarily model builders, which means they need a deep understanding of machine learning and how to apply specific algorithms to the data set in order to build and test a usable model. Yes, they must know how to best represent the data visually, but they’re not responsible for the look and feel and colours of the visualizations; they’re not the final UX designers.

There seems to be as much a need for “vertical” data scientists as for “horizontal.”

In spite of the above, there can be too much emphasis on software skills leading to more confusion. Talks and workshops on data science can get bogged down with software and programming. A recent presentation about data visualization theory resulted in audience members repeatedly asking about software choices, to the frustration of the presenter who wanted to keep the talk platform-agnostic. However, if software skills directly impact salary, as it seems according to the 2013 Data Science Salary Survey by O’Reilly, then throwing technical requirements into the mix is inevitable. The report concludes that:

Tools that correlate with higher salary are scalable and generally open source; they are often script-based or built for machine learning.… Perhaps just as interesting is that some of the traditional, popular tools such as Excel and SAS were not used as widely as R and Python. This might be food for thought for those data analysts who have thus far resisted learning how to code or moving beyond query-based data tools.

This situation may be good news to some, but data scientists may need to bring other “horizontal” skills to the table in addition to software know-how. Of course, the problem is that some of those skills may be harder to quantify.

If you are hiring, one way to pry out the real data scientists is to get at those soft skills and tailor job postings and interviews accordingly. Another way is to make sure the understanding and subsequent vocabulary works its way to those who actually do the hiring, such as HR professionals, departmental managers and directors. And finally, if you are an aspiring data scientist, you have plenty of choices—just make sure you know what kind of data scientist you want to be.

Dr. Adam Jacobs

Adam is an analyst, statistician and data designer for the innovation economy team at Data Catalyst. He is interested in new data visualization tools and open source tools for research. Adam joined MaRS in 2013 after working in data development at a geographic information systems (GIS) software company in Toronto. See more…