Practical statistics for data scientists
Statistician and data scientist
A data scientist is only as good as his questions. They should ask probing questions:
Does our money grow as we get older?
How much do I have to pay for this phone?
How can Google "guess" my search query?
Statistics is the art of adding numbers to these questions so that the "answers" evolve! Establishing quantitative connections with mostly qualitative questions is essential for statistics. I'd like to share one of my favorite descriptions of a data scientist: "A data scientist is someone who knows more statistics than a programmer and more programming than a statistician." Data science is a great spot that fits neatly into the domain of computer programming, statistics, and analysis.
Statistics is a set of principles and parameters for obtaining information for decision making in the face of uncertainty. When someone asks me, "What kind of statistics do I need to know to be a good data scientist?" I would say, "Please don't really worry about learning or knowing statistics for "data science" sake, just learn statistics because it is really the "art" of discovering the secrets hidden in data sets."
As data scientists, we are solving a problem or helping someone make a decision based on the available data. So what do we do as data scientists to achieve this?
We define the problem statement (by asking the right questions).
We then collect the right type of data to perform our analysis.
We try to explore the data to see what it tells us.
We use various techniques to draw conclusions from data or predict some answers to a problem.
Finally, we confirm that our guesses/predictions are quite accurate (by scientific methods of course!).
To further investigate the “right type” of questions and data, let's take an example. Suppose we are trying to answer a research question "How will the world's educational landscape change in the next 30 years?" Now, to get the answer to this big question, we need to ask smaller questions like:
What is the current scenario of education worldwide?
What is the percentage of people who finish high school or college?
What are the latest trends in the global labor market and how will they affect the education sector?
Next, we need to collect the correct data and information to answer these short questions. For example, to understand the current landscape, we may collect data from the UNESCO and UNICEF websites and we may use LinkedIn to collect some data on the latest trends in the labor market. Please note that most of these data sets are available as open source. Therefore, defining a problem statement gives us clarity on how to approach and solve the "big" question in a systematic way.
To do all of the above, a data scientist needs to have an adequate idea of the domain to which the problem statement belongs. For example, if a data scientist tries to answer the question "Why is this particular summer so hot compared to the last 50 years?" They must have a solid understanding of climate change and environmental science. Second, except for the first step, all other steps involve handling large amounts of data in digital form. A data scientist must be able to acquire data, clean it, read it, analyze it, and apply methods to get answers, in a very short period of time. To do this they must have knowledge of computer programming. All listed steps are not performed directly by the data scientist, but are performed from the computer as directed by the data scientist.
By Creating data science project report Data scientist can study the overall project and showcase there idea