行业动态

What is Data Science?

2018/06/14

Many people new to data science might believe that this field is just about R, Python, Hadoop, SQL, and traditional machine learning techniques or statistical modeling. Below you will find fundamental articles that show how modern, broad and deep the field is. Some data scientists are actually doing none of the above. In my case, I don't even code, but instead, I make various applications talk to each other, in a machine-to-machine communication framework. It is true though that most data scientists use R, Python and Hadoop-related systems

 

The article on deep data science (see below) shows that data science is also about automating the tasks that many people (calling themselves data scientists) do routinely. And it can be done using very little mathematical / traditional statistical science. I like to put it this way: data science is about automating data science, and much of what I do consists of designing systems that automate what I do.

 

Many of these articles below are a few years old, but their content is even more relevant today than ever before. These articles should help the beginner have a better idea about what data science is. Some are technical, but most can be understood by the layman.

 

Categories of data scientists

 

Those strong in statistics:
they sometimes develop new statistical theories for big data, that even traditional statisticians are not aware of. They are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.Those strong in mathematics: NSA (national security agency) or defense/military people working on big data, astronomers, and operations research people doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization) as they collect, analyse and extract value out of data.Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API's, Analytics as a Service, optimization of data flows, data plumbing.Those strong in machine learning / computer science (algorithms, computational complexity)Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)Those strong in production code development, software engineering (they know a few programming languages)Those strong in visualizationThose strong in GIS, spatial data, data modeled by graphs, graph databasesThose strong in a few of the above. After 20 years of experience across many industries, big and small companies (and lots of training), I'm strong both in stats, machine learning, business, mathematics and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that's the very reason why data science was created.

 

Understand Data:
Data is useless and can (and should) be misleading without the context. Data needs a story to tell a story. Data is like a color that needs a surface to even prove its existence, as color red for example, can’t prove its existence without a surface, we see a red car, or red scarf, red tie, red shoes or red something, similarly data needs to be associated with its surroundings, context, methods, ways and the whole life cycle where it is born, generated, used, modified, executed and terminated. I have yet to find a “data scientist” who can talk to me about the “data” without mentioning technologies like Hadoop, NoSQL, Tableau or other sophisticated vendors and buzzwords. You need to have an intimate relationship with your data; you need to know it inside out. Asking someone else about anomalies in “your” data is equal to asking your wife how she gets pregnant. One of the distinct edge we had for our relationship with the UN and the software to secure schools form bombings is our command over the underlying data, while the world talks about it using statistical charts and figures, we are the ones back home who experience it, live it in our daily lives, the importance, details, and the appreciation of this data that we have cannot be find anywhere else. We are doing the same with our other projects and clients.

换一张