All Things Data @kanjasaha - Tumblr Blog

The joy of data.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

My Favorite Quote

In God we trust, all others must bring data.W. Edwards Deming

My Favorite Movies

These are few of the movies(in no particular order) I enjoyed more than once. Hope you like them.

1. The Lives of Others (German)

2. The Secret in Their Eyes (Spanish)

3. The Invisible Guest (Spanish)

4. Kahaani (Hindi)

5. Vinci Da (Bengali)

6. Amour (French)

7. The Wings of the Dove (English)

8. It's a Wonderful Life (English)

9. Thelma & Louise (English)

10. One Flew Over the Cuckoo's Nest (English)

11. Zindagi Na Milegi Dobara (Hindi)

12. Black Panther (English)

13. Casablanca (English)

14. It Happened One Night (English)

15. The Shawshank Redemption (English)

16. The Good, the Bad and the Ugly (English)

17. The Finest Hours (English)

18. Hell or High Water (English)

19. Nothing to Lose (English)

20. The Curious Case of Benjamin Button (English)

21. Finding Nemo (English)

22. The Godfather (English)

23. Rashomon (Japanese)

24. Alpha (English)

continued...

#world movies

And those who were seen dancing were thought to be insane by those who could not hear the music. ― Friedrich Nietzsche

#perception

Whatever you can do or dream you can, begin it; Boldness has genius, power, and magic in it.

Johann Wolfgang von Goethe

#vision

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Life Cycle of a Machine Learning Project

Today, the term Machine Learning comes up in every other discussion. In fact, in the bay area, it is a staple. We hear about unicorn start-ups as well as established organizations solving major challenges using Machine Learning. Then, there are many more companies, who are in the process of figuring out what and how long it takes to implement Machine Learning models in their organization. This article is an effort to share my insight into the process of this new edge phenomenon, a major paradigm shift from the traditional rule-based system. Before I go into the details, let me start with how Machine Learning differs from a rule-based system. In a rule-based system, decisions are made based on a set of rules built on a set of facts by human experts while Machine Learning decisions are based on a function (a model) built on patterns extracted by Machines from data. A rule-based system is considered rigid as it cannot make a decision when there is no historical data. Machines, on the other hand, can make an estimate based on similar patterns found in the historical dataset. Let me take a business case to explain this further. Customer Churn, for example, is a common business challenge companies encounter on an ongoing basis. We spend our marketing dollars to acquire customers, they come onboard and after a few months leave for reasons unknown. In such scenarios, a rule-based system may decide to send a promotional offer after a certain number of days/months( based on the companies definition of churn) of inactivity, but the chances of customers returning are pretty low. They may have moved on to a different company or lost interest in the product. In a rule-based system, there is no easy way to predict and intervene if and when a customer is going to churn. However, Machine Learning looks at the spending pattern, demographics, psychographics of customer actions in the past, and tries to find a similar pattern on the new customer to predict their activity. This information can alert the business to take action and save the customer from churning. This intervention makes a significant difference in the customer experience and impacts the business metrics. In order to identify and implement Machine Learning in an organization, we need to make significant changes in our process that exists in our traditional rule-based system. 1. Define clear use case with a measurable outcome 2. Integrate enterprise-wide data seamlessly 3. Create a lab environment for experimentation 4. Operationalize successful pilots and monitor Essentially, embrace the paradigm shift, The ML Mindset. Let’s see how we incorporate “The ML Mindset” in a machine learning workflow. This workflow is implemented by domain experts, data engineers, data scientists and software engineers contributing to various tasks. These days, however, companies are looking for individuals, who have knowledge of the whole workflow and known by the title full-stack data scientist. The following diagram shows all the tasks a Full Stack Data Scientist performs to complete a project. As I was looking for inspiration to draw an appropriate flowchart to show the ML workflow and came across this AWS presentation. I made a few modifications to the flowchart that I believe reflects the essence of an end-to-end machine learning project.

1. Business Problem: The broad ML technique selection/elimination process starts at the very beginning of the Data Science/Machine Learning project workflow when we define the business goal. Here, we understand the business challenges and look for projects that will have a major impact, whether it is immediate or long term. Many a time, existing business reports will indicate the challenge, and the goal will be to improve a metric or KPI. Other times, a new business initiative will drive the project. In our specific example of customer churn, the larger goal may be to increase revenue and one of the strategies may be to improve customer retention, the immediate business goal for for this Machine Learning project is to predict churn with higher accuracy (say from 10% to 40%). Although I am making a general estimate here, a business arrives at this number after diving into all the KPIs impacting the business and that is beyond the scope of this post. 2. ML Problem Framing: We then decide on basic Machine Learning tasks. When working with structured/tabular data, the task at hand is primarily one of the following: supervised, unsupervised, or reinforcement learning. There are many articles that explain each task and its application. Among them, I found two AI and ML flowcharts by Karen Hao from MIT Technology Review, which is all-inclusive and simple to understand. At the end of this stage, we should know the broad ML technique (Supervised/Unsupervised, Regression/Classification/Forecasting) to be implemented for the project and have a good understanding of data availability, model evaluation metrics and their target score to consider a model reliable. Customer churn prediction is a supervised classification task where we have historical data of customers who are labeled into 2 classes: churned or not churned. For a supervised classification task, evaluation metrics are based on the confusion matrix. 3. Data Collection & Integration: In business, collecting data is like a treasure hunt; all the joy and agony of it. The process is complicated, painstaking, but eventually rewarding. Often enough, we find crucial data stored in a spreadsheet. Retailers with both online and physical presence sometimes have promotional flyers in the store that is not uploaded in data repository. Model accuracy relies heavily on data size and as I mentioned earlier, it is essential to integrate enterprise-wide data for Machine Learning. For customer churn, we will need to collect data from various business domains starting with recency, frequency, monetization, tenure, acquisition channel, promotions, demographics, psychographics, etc. 4. Exploratory Data Analysis: This is where knowledge of Data and Algorithms help to decide on the initial set of algorithms (preferably 2-3) that we would like to implement. EDA is the process of understanding our data set through statistical summary, distribution, and the relationships between features and targets. It helps us build intuition on the data. I would like to emphasize the word intuition. While developing intuition, refrain from drawing a conclusion. It is very easy to get carried away and start making assumptions without running a data set through a model. When we perform EDA, we are looking at 2 variables at a time (we are performing bi-variate analysis). Our world, on the other hand, is multivariate, such as how seedling growth rate is dependent on the sun, water, minerals, etc. Statistical models and ML algorithms implement multivariate techniques under the hood that helps us draw conclusions with a certain degree of accuracy. No single factor is responsible for the change. One or two factors may be the driving factors, but there are still many others behind the change. Do keep this thought in mind during EDA. This step is essential and guidelines are similar for all datasets. Infact, you can create a template to use it for all the projects.

5. Data Preparation: Our observations made during Exploratory Data Analysis give guidance to various data processing steps. This includes removing duplicates, fixing misspelled words, ensuring data integrity, aggregating categorical values with limited observations, dropping features with sparse data, imputing missing data for important features, handling outliers, processing and integrating semi-structured & unstructured data.

6. Feature Engineering: It is a well-known fact that Data Scientists spend the majority of their time exploring and preparing the data, engineering features before applying a model. Of all the three, Feature Engineering is the most challenging and can make a big difference in model performance. A few common techniques include transforming data using the log function or normalization, creating or extracting new features from the existing data, feature selection & dimensionality reduction. Although the limelight of the workflow is model training and evaluation, I would like to reiterate that the previous three steps (Exploratory Data Analysis, Data Preparation and Feature Engineering) consume 80% of the total time and is highly related to the success of a Machine Learning project . 7. Model Training & Parameter Tuning: Equipped with the list of 2-3 algorithms from exploratory data analysis (step 4) and transformed data (step 5 & 6), we are ready to train the model. For each algorithm, we select various ranges of hyperparameters to train and choose the configuration that yields the best model score. There are various algorithms (Grid search, random search, Bayesian optimization) available for parameter tuning. We will use Hyperopt, one of the open-source libraries used to optimize searching the hyperparameter space, using the Bayesian optimization technique. We then compare the model evaluation metrics (precision, recall, F1, etc) for each of the three algorithms on the training data and validation data with the best hyperparameters. Besides performance measures, a good model will perform similarly(generate similar scores on evaluation metrics) in both the training and validation datasets. Understanding and interpreting relevant model evaluation metrics is the key to success in this step.

8. Model Evaluation: We then compare the model evaluation metrics (RMSE, R squared, AUC,precision, recall, F1, etc) for each of the three algorithms on the validation data and test data with the best hyperparameters. Understanding and interpreting relevant model evaluation metrics is the key to success in this step. Our expectation is that good models produce comparable results in validation and test. They won’t produce identical results, but auc/f1/precision/recall/RMSE scores on test and validation sets will be close.

9. Model Deployment: Once we are content with the model outcomes, the next step is to run the model with test data and make its output is available via API, web applications, reports, or dashboards. If the model is to work with streaming data, it is being incorporated in applications through a Web API. If the result is to be delivered to business users for insight, the results are shared in dashboards or automated reports delivered via email. Operationalization involves up-front investment in systems that smooth the deployment, maintenance, and adoption of whichever data processes we choose to employ. It is worth the extra effort to avoid runtime failures. 10. Monitoring Drift & Decay: Monitoring production models is different from monitoring other applications. A product recommendation model won’t adapt to changing tastes. A loan risk model won’t adapt to changing economic conditions. With fraud detection, criminals adapt as models evolve. Data science teams need to be able to detect and react quickly when models drift. As we detect drift and decay, we are back to the beginning of the cycle where we may adjust the business goal, collect more data, and repeat the cycle. 11. Delivering Model Output: When a model output is not directly consumed by a web application, it is often used to deliver business insights through a dashboard or report. One of the most difficult tasks of machine learning projects is explaining a model’s outcomes to an audience. Data visualization tools like Tableau or Google Data Studio are very helpful in building storylines to share insights from Data Science work.

Machine Learning cycle tend to vary between 3 and 6 months followed by ongoing maintenance. ML is evolving and the cycle length perhaps will continue to shrink with automation but the basic tasks in the workflow stays the same. I encourage you to embrace the ML Mindset. Take a look at your current projects in your team/organization and think of ways to integrate Machine Learning that will impact your business metrics significantly.

10 posts and many more!

#10 posts #tumblr milestone

Preparing for a Data Science/Machine Learning Bootcamp

If you are reading this article, there is a good chance you are considering taking a Machine Learning(ML) or Data Science(DS) program soon and do not know where to start. Though it has a steep learning curve, I would highly recommend and encourage you to take this step. Machine Learning is fascinating and offers tremendous predictive power. If ML researcher continue with the innovations that are happening today, ML is going to be an integral part of every business domain in the near future.

Many a time, I hear, "Where do I begin?". Watching videos or reading articles is not enough to acquire hands-on experience and people become quickly overwhelmed with many mathematical/statistical concepts and python libraries. When I started my first Machine Learning program, I was in the same boat. I used to Google for every unknown term and add "for dummies" at the end :-). Over time, I realized that my learning process would have been significantly smoother had I spent 2 to 3 months on the prerequisites (7 to 10 hours a week) for these boot camps. My goal in this post is to share my experience and the resources I have consulted to complete these programs.

One question you may have is whether you will be ready to work in the ML domain after program completion. In my opinion, it depends on the number of years of experience that you have. If you are in school, just graduated or have a couple of years of experience, you will likely find an internship or entry-level position in the ML domain. For others with more experience, the best approach will be to implement the projects from your boot camp at your current workplace on your own and then take on new projects in a couple of years. I also highly recommend participating in Kaggle competitions and related discussions. It goes without saying that one needs to stay updated with recent advancements in ML, as the area is continuously evolving. For example, automated feature engineering is growing traction and will significantly simplify a Data Scientist's work in this area.

This list of boot camp prerequisite resources is thorough and hence, long :-). My intention is NOT to overwhelm or discourage you but to prepare you for an ML boot camp. You may already be familiar with some of the areas and can skip those sections. On the other hand, if you are in high school, I would recommend completing high school algebra and calculus before moving forward with these resources.

As you may already know, Machine Learning (or Data Science) is a multidisciplinary study. The study involves an introductory college-level understanding of Statistics, Calculus, Linear Algebra, Object-Oriented Programming(OOP) basics, SQL and Python, and viable domain knowledge. Domain knowledge comes with working in a specific industry and can be improved consciously over time. For the rest, here are the books and online resources I have found useful along with the estimated time it took me to cover each of these areas.

Before I begin with the list, a single piece of advice that most find useful for these boot camps is avoid going down the rabbit hole. First, learn how without fully knowing why. This may be counter-intuitive but it will help you learn all the bits and pieces that work together in Machine Learning. Try to stay within the estimated hours(maybe 25% more) I have suggested. Once you have a good handle on the how, you will be in a better position to deep five into each of the areas that make ML possible.

Machine Learning:

Machine Learning Basics - Principles of Data Science: Sinan Ozdemir does a great job of introducing us to the world of machine learning. It is easy to understand without prior programming or mathematical knowledge. (Estimated time: 5 hours)

Applications of Machine Learning - A-Z by Udemy: This course cost less than $20 and gives an overview of what business problems/challenges are solved with machine learning and how. This keeps you excited and motivated if and when you are wondering why on earth you are suddenly learning second-order partial derivatives or eigenvalues and eigenvectors. Just watching the videos and reading through the solutions will suffice at this point. Your priority code and domain familiarity. (Estimated time: 2-3 hours/week until completion. If you do not understand fully, that is ok at this time).

Reference Book - ORielly: Read this book after you are comfortable with Python and other ML concepts that are mentioned here but not necessary to start a program.

SQL:

SQL Basics - HackerRank: You will not need to write SQL as most ML programs provide you with CSV files to work with. However, knowing SQL will help you to get up to speed with pandas, Python's data manipulation library. Not to mention it is a necessary skill for Data Scientists. HackerRank expects some basic understanding of joins, aggregation function etc. If you are just starting out with SQL, my previous posts on databases may help before you start with HackerRank. (Estimated time: Couple of hours/week until you are comfortable with advanced analytic queries. SQL is very simple, all you need is practice!)

OOP:

OOP Basics - OOP in Python : Though OOP is widespread in machine learning engineering and data engineering domain, Data Scientists need not have deep knowledge of OOP. However, we benefit from knowing the basics of OOP. Besides, ML libraries in Python make heavy use of OOP and being able to understand OOP code and the errors it throws will make you self sufficient and expedite your learning. (Estimated time: 10 hours)

Python:

Python Basics - learnpython: If you are new to programming, start with the basics: data types, data structures, string operations, functions, and classes. (Estimated time: 10 hours)

Intermediate Python - datacamp: If you are already a beginner python programmer, devote a couple of weeks to this. Python is one of the simplest languages and you can continue to pick up more Python as you undergo your ML program. (Estimated hours: 3-5 hours/week until you are comfortable creating a class for your code and instantiating it whenever you need it. For example, creating a data exploration class and call it for every data set for analysis.

Data Manipulation - 10 minutes to Pandas: 10 minutes perhaps is not enough but 10 hours with Pandas will be super helpful in working with data frames: joining, slicing, aggregating, filtering etc. (Estimated time: 2-3 hours/week for a month)

Data Visualization - matplotlob: All of the hard work that goes into preparing data and building models will be of no use unless we share the model output in a way that is visually appealing and interpretable to your audience. Spend a few hours understanding line plots, bar charts, box plots, scatter plots and time-series that is generally used to present the output. Seaborne is another powerful visualization library but you can look into that later. (Estimated time: 5 hours)

Community help - stackoverflow: Python's popularity in the engineering and data science communities makes it easy for anyone to get started. If you have a question on how to do something in Python, you will most probably find an answer on StackOverflow.

Probability & Statistics:

Summary Statistics - statisticshowto: A couple of hours will be sufficient to understand the basic theories: mean, median, range, quartile, interquartile range.

Probability Distributions - analyticsvidya: Understanding data distribution is the most important step before choosing a machine learning algorithm. As you get familiar with the algorithms, you will learn that each one of them makes certain assumptions on the data, and feeding data to a model that does not satisfy the model's assumptions will deliver the wrong results. (Estimated time: 10 hours)

Conditional Probability - Khanacademy : Conditional Probability is the basis of Bayes Theorem, and one must understand Bayes theorem because it provides a rule for moving from a prior probability to a posterior probability. It is even used in parameter optimization techniques. A few hands-on exercises will help develop a concrete understanding. (Estimated Time: 5 hours)

Hypothesis Testing - PennState: Hypothesis Testing is the basis of Confusion Matrix and Confusion Matrix is the basis for most model diagnostics. It is an important concept you will come across very frequently. (Estimated Time: 10 hours)

Simple Linear Regression - Yale & Columbia Business School: The first concept most ML programs will teach you is linear regression and prediction on a data set with a linear relationship. Over time, you will be introduced to models that work with non-linear data but the basic concept of prediction stays the same. (Estimated time: 10 hours)

Reference book - Introductory Statistics: If and when you want a break from the computer screen, this book by Robert Gould and Colleen Ryan explains topics ranging from "What are Data" to "Linear Regression Model".

Calculus:

Basic Derivative Rules - KhanAcademy: In machine learning, we use optimization algorithms to minimize loss functions (different between actual and predicted output). These optimization algorithms (such as gradient descent) uses derivatives to minimize the loss function. At this point, do not try to understand loss function or how the algorithm works. When the time comes, knowing the basic derivative rules will make understanding loss function comparatively easy. Now, if you are 4 years past college, chances are you have a blurred the memory of calculus (unless, of course, math is your superpower). Read the basics to refresh your calculus knowledge and attempt the unit test at the end. (Estimated Time: 10 hours)

Partial derivative - Columbia: In the real world, there is rarely a scenario where there is a function of only one variable. (For example, a seedling grows depending on how sun, water, minerals it gets. Most data sets are multidimensional. Hence the need to know partial derivatives. These two articles are excellent and provide the math behind the Gradient Descent. Rules of calculus - multivariate and Economic Interpretation of multivariate Calculus. (Estimated Time: 10 hours)

Linear Algebra:

Brief refresher - Udacity: Datasets used for Machine learning models are often high dimensional data and represented as a matrix. Many ML concepts are tied to Linear Algebra and it is important to have the basics covered. This may be a refresher course, but at their cores, it is equally useful for those who are just getting to know Linear Algebra. (Estimated Time: 5 hours)

Matrices, eigenvalues, and eigenvectors - Stata: This post has intuitively explained matrices and will help you to visualize them. Continue to the next post on eigenvalues and vectors as well. Many a time, we are dealing with a data set with a large number of variables and many of them are strongly correlated. To reduce dimensionality, we use Principal Component Analysis (PCA), at the core of which is Eigenvalues and Eigenvectors. (Estimated Time: 5 hours)

PCA, eigenvalues, and eigenvectors - StackExchange: This comment/answer does a wonderful job in intuitively explaining PCA and how it relates to eigenvalues and eigenvectors. Read the answer with the highest number of votes (the one with Grandmother, Mother, Spouse, Daughter sequence). Read it multiple times if it does not make sense in the first take. (Estimated Time: 2-3 hours)

Reference Book - Linear Algebra Done Right: For further reading, Sheldon Axler's book is a great reference but completely optional for the ML coursework.

These are the math and programming basics that are needed to get started with Machine Learning. You may not understand everything at this point ( and that is ok) but some degree of familiarity and having an additional resource handy will make the learning process enjoyable. This is an exciting path and I hope sharing my experience with you helps in your next step. If you have further questions, feel free to email me or comment here!

#machine learning bootcamp #pre-requisites

Data Analytics Basics

Now that you have the SQL basics under your belt, we will learn to drive insights from historical data and predict performance which in turn will help us make decisions. The industry term for this process is Data Driven Marketing.

I will discuss this process in 3 steps:

Analyze: Analyze historical data from organization as well as market and provide insights.

Predict: Forecast or predict performance based on historical data.

Decide: Make decision i.e. take action to invest in relevant marketing programs and campaigns.

Data driven marketing is an iterative process. i.e. once you reach the third step of “Decide” or “Take Actions”, you go back to step 1 and repeat these steps again to address changes(market as well as organization) and optimize. One can write a book in each of these steps but today I will dedicate couple of paragraphs to each.

Analyze:

The very first step of data analysis is data exploration. In most companies, your data engineer will have the data in reporting database for you to analyze. In some scenarios though, especially in startups, when you start a new marketing campaign you may have to work with your data engineer to ensure proper tracking of the campaign level insights(acquisition,revenue,retention).

Once we have the data ready, we summarize or aggregate data to answer some questions that is important for our business. We want to know how much revenue has been generated on a monthly or quarterly basis. We will also want to know how much does a customer spend on average and so on.

When a company is launching a new product, they will be interested in knowing revenue share of the new product with respect to total revenue or how this product revenue compares to the other products by the same company. It is also very important to compare the performance of the product with respect to the industry and its major players. For example, if the product market is growing at 30% and the newly launched product is growing at 15%, they may want to revisit their marketing campaigns given the new product itself is competitive. These questions will vary depending on the business objectives. And the quantitative measure of these key business objectives are known as Key Performance Indicator.

Predict:

Forecasting or predicting is the process of estimating future performance based on historical data. As we already know, it is impossible to predict a future event with 100% certainty; be it weather, election, company revenue or customer retention. So, what we do is make an estimate, calculate likelihood or probability of an event outcome. There are various statistical algorithms already available to address different scenarios and one needs to understand these and use them accordingly. Many of these statistical algorithms are highly complex and can read data minutely that is impossible for a human eye to catch. Based on these readings an algorithm discovers a pattern and use that pattern to predict future events. This process of reading, discovering and estimating data is popularly known as Machine Learning.

Machine Learning is at the core of intelligent data analysis also known as Data Science, a term very recently popularized, thanks to data revolution. It is a field of study that combines various traditional disciplines (Mathematics, Statistics, Computer Science, Business) along with Industry specific knowledge to extract insights from data.

Decide:

The sole purpose of the previous steps (Analyze and Predict) is to guide us to make decisions and drive future investments and campaigns in a more analytical way rather than solely depending on our intuition. For digital marketers, the decision is to build a marketing strategy to channel marketing investments towards different marketing vehicles in rewarding markets and products. Popular digital marketing vehicles include advertising, email campaigns, promotions, incentives, website, social media as well as online customer support community.

In the coming weeks, I will discuss each of the above 3 steps for a specific marketing operations and how it can benefit from data driven marketing. I will start with customer segmentation.

Customer segmentation is act of grouping customer into groups of individuals that are similar in multiple ways relevant to marketing. One group, for example, can be women in mid thirties who spend more than $1000 couple of time. We use unsupervised machine learning model to form clusters. This technique is known as clustering. Once we find a way to name these clusters and find what works for them, we can market to each of these groups separately. As we acquire new customers, we can use supervised classification models to assign new customer to these groups.

#data analytics

SQL Exercise

One way you can work on your SQL skills without local data access is through SQL Exercise website. Once you register, you will be provided with the name of the tables and columns and the query you need to write. After you write your SQL you can check for accuracy by clicking on the “Run” button.

Here is a couple of snapshots of how SQL exercise work.

I find the above website to be a good tool to practice SQL without installing any database engine on your local machine.

The next section will have exercise on all the elements we learned in the previous articles using the database schema below. Answers will be provided at the end of this article to review your work.

Let’s assume the above database schema belongs to a company you are consulting with and you need to answer the following questions. (I would suggest that you create the table and populate them as a part of the exercise. However, if you want the script for the tables in the above diagram, let me know. I will upload it in Github.)

1. The total number of departments, the total number of employee and the total number of managers.

2. List the employees in “Marketing” Department.

3. Total number of managers with title “Vice President” and name of the “Vice President” who has held that title the longest

4. List the Employees who have been in more than one department.

5. Find the newest employee and the employee who has been with the company the longest.

6. Find the % of Employees who are managers

7. List the employees and their titles who joined the company in the first year of operation.

8. List the current lowest and the highest salary.

9. Total number of Male and Female Employees

10. The total number of employees hired each year in the past three years.

With this, I will wrap up the first chapter of “Database for Digital Marketers”. In the next chapter, I will cover the basics of data analysis and various methods of understanding, analyzing and interpreting marketing data.

----------------------------------------------------------------------------------------------------------

Answers:

--1. The total number of departments, the total number of employees and the total number of managers.

Select count(*) from departments

Select count(*) from dept_manager

Select count(*) from employees

--2. List the employees in “Marketing” Department.

Select e.First_name,e.Last_name

from employees e

Join dept_emp b

On e.emp_no=b.emp_no

Join departments d

On b.dept_no=d.dept_no

Where d.dept_name='Marketing'

--3. Total number of managers with title “Vice president” and name of the “Vice president” who has hold that title the longest

Select count(*) from

Titles where title = 'Vice President'

Select top 1 e.First_name, e.last_name, datediff(DD,from_date,to_date)/365 as number_years_as_vp

From

employees e

Join Titles t

on e.emp_no=t.emp_no

where title = 'Vice President'

Order by datediff(DD,from_date,to_date) desc

--4. List the employees who have been in more than one department.

Select Emp_no, count(dept_no)

From dept_emp

Group by Emp_no

Having count(dept_no) > 1

--5. Find the newest employees.

Select first_name,last_name

From employees where hire_date = (Select min(hire_date) from employees)

--6. Find the % of Employees who are managers

Select Count(Manager)*100/Count(AllEmployee)

From

(Select

Case when m.emp_no is not null then m.emp_no else null end as Manager,

e.emp_no as AllEmployee

From employees e

Left join dept_manager m

On e.emp_no= m.emp_no) a

--7. List the employees and their titles who joined the company in the first year of operation.

Select first_name,last_name,titles.title

From employees

join titles

on employees.emp_no=titles.emp_no

where hire_date <= (Select dateadd(dd,365,min(hire_date)) from employees)

--8. List the current lowest and the highest salary.

Select min(salary), max(salary)

From salaries

Where to_date> GETDATE()

--9. Total number of Male and Female Employees

Select count(case when gender='M' Then emp_no else null end) as MaleemployeesCount,

count(case when gender='M' Then emp_no else null end) as FemaleemployeesCount

From employees

--10. The total number of employees hired each year.

Select Year(hire_date) as hire_year,count(*) from employees

Group by Year(hire_date)

#sql exercise

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

SQL query performance

Regardless of how small your database is, it is a good idea to know how a database engine would process your request. Database performance can be optimized in various ways. At this time, we do not want to go to the details of all the database optimizations options available except for query tuning. Besides, many of these options(creating index ), are addressed by Database Administrators. Database developers however are responsible for fine tuning the SQL query they write.

Sometimes we focus primarily on the accuracy of the query and tend to ignore the performance. Most of the queries will not even show the performance issues in sandbox or development environment. So, once you are comfortable with the business logic of the query, review the query and look out for the following common issues that cause slow performance.

1. Specify the columns you need after the Select Clause instead of Select *.

Always remember to specify the columns you need in your result set. It is obviously easier to read when you have a selected set of columns but most importantly, by doing so you retrieve your data quickly.

The database I have used to write my previous articles is not large enough to show the time differences for various queries. So, I added a publicly available database “NORTHWIND” that has reasonably large dataset. If you are eager to try these performance tuning tips on your own, email me and I will send you a copy of the database backup file.

Here, you may not be able to visibly differentiate the time taken between the two queries, but I will show you the time taken by these two queries using the Client Statistics Tool.

SELECT * FROM [Customers] SELECT [CompanyName],[Country] FROM [Customers]

When you right click on the query window, click the option to “Include Client Statistics”. Here are the reports I get when I run each of the above queries separately.

Select * from Customer took 18 milliseconds

While SELECT [CompanyName],[Country] FROM [Customers] took just 8 milliseconds.

You may be wondering why do we care for such small performance gain but as you will run these queries in live environment on tables with millions of records and thousands of users querying at the same time, the difference will be significant.

I once got a timeout while trying to query just one record by using Select top 1 * from a large table with millions of records and around 8000 fields. I will leave it at that. :-)

2. Avoid functions in the where clause

Let’s say we are looking to get the list of customers who has bought a product in 2016. Following our previous tips we will select only the columns that we need to display. So in this case we want to display the FirstName and LastName. Now, we will filter only those records where the order create date falls between January 1st 2016 and December 31st 2016. Here are a couple of ways you can get the dataset. Look at the Execution time of each of these Queries.

Query 1 does not use any function in the Where clause

Query 2 uses year function around OrderDate.

Query 1 took 1 milliseconds and Query 2 took 18 milliseconds

As you can see, the second query takes 10 ms while the first one took just 1 ms. Although the second query is quick to right and is easy to read, it took longer because the function Year got called for every record in the result set and then checked against the right hand side value 2016 to decide whether to include the record in the query. Another not so obvious reason is that we cannot use available index on the column around which you have the function.

3. Subqueries: Use it only when you expect less than 1000 rows in that resultset.

Sub query is a SQL query which is part of a larger SQL query but it can also stand alone and function on its own. Let me explain this concept with an example in our own database, [DigitalMarketing]. Get all the products that are sold in the state of California.

Select b.ProductID,p.Product From Join Products p Join Orders b On b.ProductID = p.ProductID Join Customers a On a.CustomerID=b.CustomerID Where a.state='CA'

You can write the above query using subquery like this. The section within the () is a subquery.

Select b.ProductID,p.Product From Join Products p Join Orders b On b.ProductID = p.ProductID Where b.CustomerID in (Select CustomerID from Customers where state=’CA’)

Almost all SELECT statements that join tables and use the join operator can be rewritten as subqueries, and vice versa. Writing the SELECT statement using the join operator is often easier to read and understand and can also help the SQL Server Database Engine to find a more efficient strategy for retrieving the appropriate data. However, there are a few problems that can be easier solved using subqueries, and there are others that can be easier solved using joins.

Subqueries are advantageous over joins when you have to calculate an aggregate value on-the-fly and use it in the outer query for comparison. Example 1 shows this. Get the list of customers that have orders worth more than average customer spend per order.

Select a.FirstName as Customer, OrderID, Quantity*Price as CustomerSpend From Customers a Join Orders b On a.CustomerID=b.CustomerID Join Products p On b.ProductID=p.ProductID Where Quantity*Price > (Select Avg(Quantity*Price) as AverageCustomerSpend From Customers a Join Orders b On a.CustomerID=b.CustomerID Join Products p On b.ProductID=p.ProductID)

This problem cannot be solved with a join in just one query, because we would have to write the aggregate function in the WHERE clause, which is not allowed. (We can solve this using two separate queries).

Joins are advantageous over subqueries if the SELECT list in a query contains columns from more than one table. Example 2 shows the customers in California and the products they have bought.

Select a.CustomerID,a.State,b.ProductID,p.Product From Customers a Join Orders b On a.CustomerID=b.CustomerID Join Products p On b.ProductID = p.ProductID Where a.state='CA'

The SELECT list of the query in the above contains CustomerID and State from Customers table and ProductID and Product column from Products table. For this reason, the equivalent solution with a subquery will not work, because subqueries can display information only from the master table or the table that is not within the subquery(). If we use subquery, we will either get data from Customers table or Product table but not both.

4. Using Left or Full Outer joins

In one of my previous articles I have explained in depth the join clause and when you need to use each of the joins. Let me revive that memory with a simple example.

Suppose you are working on a promotional campaign for a certain product and you want to find out what percentage of your customers have bought the product. You may or may not chose to include a certain region based on this number.

And it is in this scenario you have to use LEFT JOIN. INNER JOIN will not be able to give you the result you want. And it is ok to take the extra time to process the additional data. Just ensure that keys used in the joins are indexed.

Now, most queries where INNER JOIN will suffice will also give same result if you use LEFT JOIN instead. And that is what I would want to caution you against. Those who are new to SQL perhaps will use LEFT JOIN for INNER JOIN because both gives you the same set of data. But LEFT JOIN can take a significant long time if you have a large dataset. Besides, in INNER JOIN Database server will take the smaller table first and them match the available keys with the larger table. In LEFT JOIN we are somewhat forcing the database to use table on the left first and then match all the corresponding keys with the table on the right and then you will need to filter out the rows where there is null value from the table on the right.

So, to summarize, use LEFT JOIN only when you need it. Don’t use it in scenarios where it will give you the same result set as INNER JOIN.

5. Using Order By

As we already know, Order By clause is used to sort a specific column in the Select statement. When we add an Order By clause, it adds an extra step to get the data in order. In smaller data set, this is rarely going to be a problem.

If we are dealing with large dataset, database will first sort all the rows before presenting the first data set. This can sometimes slow down the process. Although it does not cause a major performance impact in many scenarios, I avoid using Order By clause unless I want the data to appear in a certain order. One example will be ordering an aggregate value by ascending or descending value to find the top ranking or low ranking records.

And in many cases, you are interested in the data but in a certain order. For example, when you want to find out the states that generate top sales number, you will need to use the Order By clause in descending order.

Now, if you get the list of Cities/Countires where ShipRegion is NULL, you can write the query without using order by clause.

SQL Case Statement

The definition of the word “case” in Dictionary is “a set of circumstances or conditions” or “an instance of a particular situation” or “occurrence of a particular kind or category”. I think the case statement in SQL closely resembles the last definition ““occurrence of a particular kind or category”.

Let’s take the simple case of Customer Spend. You are starting a marketing campaign post Christmas and would like to give a discount to your customers based on the spending this year.

You want to give 20% discount if the customer has spent $40+, 15% to those who spent between $20 and $40 and 10% to those who spent between $10 and $20 and 5% to the rest of the customers who may have spent less than $10 or nothing at all. Now let’s write the query to see which brackets your customers fall into.

In the SQL statement you will notice the “Else 5” for all customers who have either spend less than $10 or spend nothing. Now note the NULL value in Sale column. These accounts have signed up but have not spend a dollar yet and thus gets a 5% discount. Another important thing you need to note in this query is the LEFT JOIN. We used LEFT JOIN because we want to send discount to all Accounts/Customers whether they have spend any dollar. Left join includes all account from the Customer and displays null if the customer has not spend anything yet. If we had used INNER JOIN instead of LEFT, we would have got a list of accounts that have already spend, i.e. have an entry in the Orders tables as well.

Here is what we will get if we had used INNER JOIN instead.

I will conclude this article with a few “need to know” statements before we move onto to tracking/analyzing Marketing Campaigns. Comparison Operators When we are looking for a specific set of records based on a certain condition, we use the Where clause along with a operator like =, > etc. Some of these operators like =,IN, <> will work for both numerical(Age, Sale etc) and string/text values (name,city,state etc. columns). In case of string comparison, we wrap the comparing string within single quotes. Here are a few examples.

The IN clause is a replacement of one or more “OR” clause. Here is an example that shows how they return the same records.

Execute the following statements and see what you get. SELECT * FROM [Customers] Where age !=30

SELECT * FROM [Customers] Where age >30

SELECT * FROM [Customers] Where State<> 'CA'

To compare a part of a text, we use LIKE as shown below. The percent sign (%) stands for any character(s). The first statement returns all records where LastName ends with “on” and we get Ruxton,Wixon,Johnson,Pon.

The second statement returns all records where LastName starts with “p” and we get Pon, Perez, Paliska. The last statement returns all records that has letters “an” in the beginning or at the end or in the middle and we get Coleman, Wilman, Fang and Chang.

Most of these numerical operators are also used in Having clause along with an aggregate function like count or sum like this.

We can also use BETWEEN to get records within a certain date range or between two numbers.

You can also filter rows based on NULL values. If you want to find out all the customers who have signed up but has not spend any money, you will use the following query.

This concludes the basics of writing SQL Query. You may wonder why I use image instead of plain text to show examples. Plain text would have been easier for you to copy and paste and run the query. However, I believe like mathematics, you learn SQL better if you understand and then write every query. In my next article, we will learn how we measure impact of Marketing Campaigns on Revenue or simply put measure ROI, which BTW is the primary goal of these articles.

#sql case

SQL group by and having clause

The clause “Group By” is very similar to the english word “group” or “classify”. In our daily life we classify anything and everything. It helps us count a set of objects easily. In database, we use “Group By” to perform calculation on a set of rows which have some common value or can be grouped together. In Customers table (as mentioned in previous articles), we can group or classify Customers by State. So if you want to find out the total number customers by state you will use the Group By clause. Along with Group By clause you will need to use an aggregate function such as count/sum function in the Select clause.

Select state, count(*) as TotalCustomers

from Customers

group by state

You can also write count(CustomerID) instead Count(*) to get the total customer count by state.

Count(*) adds the numbers of rows that exists per state in the Customers Table.

However, count(CustomerID) counts the records that have a CustomerID. In the above case, since each record have a CustomerID, we will get the same result.

You will get different results for Count(CustomerID) if CustomerID was nullable i.e. may or may not hold a value. You can ignore this scenario for the time being. This is not a common scenario but we do come across cases like this, especially when the data is not clean for some reason, i.e. information not collected properly or lost during migration.

You will also get inaccurate customer count for count(CustomerID) if CustomerID value was not unique i.e. have same CustomerID value in more than one rows/records. This is a scenario which you will come across regularly, infact is applicable to our Orders table. I have added a few more records to the orders table to explain this and this is what we have now in the table.

Suppose you want to find out how many customers have bought product A(ProductID=1). We see that CustomerID {3,6,7,12} have bought ProductID=1. Although, CustomerID {3} has bought the product twice, only 4 customers(not 5) that have bought ProductID=1.

When we run the query without and with distinct, we get 5 and 4 respectively and obviously the later returns the accurate count.

Try this query with any column in any tables that have duplicate values and you will see the difference in count. Along with count, there are a few more aggregate functions available that are very useful. Lets look at some scenarios when you can use them. I have added a new column “Price” in Products table and set a price for each of the 4 products. To see how much the customers have spent in their individual orders, we write the following query.

Aggregate Functions

SUM: Sum returns you summation(obviously) of numeric data of a column grouped by data from another non-numeric column like FirstName, LastName.

To see, how much each customer has spent in all, we will use the aggregate function SUM around Quantity * Price and group it by a.FirstName +' '+a.LastName.

MAX: If you Max instead of Sum, you get to see the maximum amount each customer have spent.

MIN: You can also use these aggregate function without the group by clause. In that case, it will look at all rows and return min/max values. For example, to find what is the minimum any customer have spent in any order, you can use min clause around Quantity * Price and remove the group by clause like this.

#sql group by

SQL Joins

Join in database in fact carries the same meaning as in English dictionary. It connects two tables in a database through a common column or key. If we want to find the customers who live in California and have bought Product A, we will need to “Join” or connect the three tables(mentioned in my previous article) through common keys.The common key that connects Customers and Orders table is CustomerID and the common key that connects the Products table and Order table is ProductID.

In this particular scenario, since we are looking for customers who live in ‘CA’ and bought Product ‘A’, (i.e. intersection of two data sets), we will be using “Inner Join”. Let’s write the query, one step at a time. First, let’s join the Customer table with Orders table.

Select a.CustomerID,a.State,b.ProductID From Customers a Join Orders b On a.CustomerID=b.CustomerID Where a.state='CA'

The query above will return the following rows.

Here, ProductID does not really give any information about the Product. However, ProductID is the common key between Orders Table and Products Table. And Product Table hold information about the product with ProductID 1. Let's join the Products table to the above query.

Select a.CustomerID,a.State,b.ProductID,p.Product From Customers a Join Orders b On a.CustomerID=b.CustomerID Join Products p On b.ProductID = p.ProductID Where a.state='CA' And Product='A'

This above query will return the same number of records but notice the Product column (highlighted).

Try this exercise of using inner join with various small data sets. Write the SQL and then manually check against Venn diagram representing the data sets.

Now let’s look at the second scenario where you want to find out what all customers living in California ordered. Select a.CustomerID,a.State,b.ProductID,p.Product,b.Quantity From Customers a left Join Orders b On a.CustomerID=b.CustomerID left Join Products p On b.ProductID = p.ProductID Where state='CA'

Notice that you now see all 5 California customers in the list as well as the additional information of products each one have bought so far. Left join is primarily used to get additional optional information or to distinguish between customers who have certain data from those who do not. In this scenario we learned that 2 of 5 California customers have bought product A and the other three customers have not.

You may initially find it challenging to decide on when to use left join. In that scenario, visualize the tables or data set and Venn diagram. If we go back to Venn diagram, you see all the 5 CustomerID in the California set and CustomerID { 3,12} from the Product A set.

Now, let’s look at the third scenario, where you want to see all the customers who live in California or have bought Product A. “or” is represented as union in Set Theory which is equivalent to full outer join in database. As per the above diagram, Union of California and Product A will return all the CustomerID inside the two circles. {3,4,5,6,7,8,12}. The SQL query that represents the above scenario will look like this.

Select a.CustomerID,a.State,b.ProductID,p.Product,b.Quantity From Customers a full outer Join Orders b On a.CustomerID=b.CustomerID full outer Join Products p On b.ProductID = p.ProductID Where state='CA' or Product='A'

Resulting rows contains all California customers as well as customers who bought product ‘A’

With this, I will end this session. As I mentioned before, JOIN is the most important thing you need to learn for querying database. So, spend plenty of time writing queries for all the three join scenarios. Like Mathematics or Piano, the more you practice these joins, the more comfortable you will get. As we go along, I will show many more examples.

#sql join

SQL Basics

To represent the data sets (discussed in previous article) in database, we will need to create 3 tables: Customers - Holds customer Information Products - Holds Product Information Orders - Hold information about orders of customers that bought the Company Products. Once the tables are created, create a database diagram and drag all the three tables to get an overview of all tables (above) .

Customers

Proucts

Orders

Since these tables contain only a few records you can find your answers simply by looking at the tables. However, in real world, you will have many more customers(hopefully) and you will need to query(search) your database to gather information about your customers. The programming language that is used for querying a database is known as SQL (Structured Query Language). The syntax of the query varies for different database environment. In my articles, I will write the queries in T-SQL for MS SQL database. If you are using any other database, there will be some difference but they are close enough to replicate in your database environment. In case you do not have access to any database, you will need to download and install either MySql Workbench or MS SQL Express. Both are available to download for free. Once you have downloaded SQL Express, refer to the above database diagram to create your database and tables.

When you have access to your customer database, the first question that comes to your mind is how many customers do you have. For that you will need only 3 SQL clause. SELECT, COUNT and FROM.

Use [DigitalMarketing] Go Select count(*) from Customers ----------- 14

To find how many customers are based in the state of California, you will need to use the WHERE clause;

Select count(*) from Customers Where state=’CA’ ----------- 5

From the Venn diagram, we can confirm that we have 5 Customers in California. If you want to see the details of these customers, you can choose the column names you want to see in the SELECT clause.

SELECT CustomerID,FirstName,LastName,State FROM Customers WHERE State ='CA'

You can also write SELECT * FROM Customers WHERE State ='CA' and get the same results. However, many reporting tables have more than 100 columns and using * in those scenario will not be an efficient call. Make it a practice to select only the columns/data you want to see.

You have now written your first SQL query. Congratulations! This is very simple and easy but you will be using these clause all the time as we move towards more complicated concept/queries.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Database Basics

Before I start talking about Database, let me give a brief introduction to Set Theory and how it relates to database joins. You will only need to know the basic set theory from your high school days: Unions and Intersections.

Each table in a database is considered a set of data and join in database is a way to get data from two or more tables (or data sets) with a common key.

In the above diagram, let us make the following assumptions.

All numbers inside the rectangle are CustomerID(key) of a company X who have bought one or more of the products.{A,B,C,D,E}

All numbers inside circle “California” are customers within the state of California.

All numbers inside circle “Product A” are customers who bought product A.

To represent a set of customers of Company X, we will write Customers C = {1,2,3,4,5,6,7,8,9,10,11,12,13} Products P={A,B,C,D} CaliforniaCustomers CC = {3,4,5,8,12} ProductACustomers PA ={3,6,7,12}

If you need to get the list of Customers who lives in California and bought Product A, the customers need to exist in both sets:CC and PA. From the Venn diagram above, you can tell that Customerid 3 and 12 fall in this bucket. In set theory, you can represent this {3,12} as an intersection(set symbol ∩) of CC and PA. CC ∩ PA = {3,12} In database terms, Intersection(∩ ) is known as inner join. For CC ∩ PA, {3,12} are the two keys that belongs to data set CC and data set PA.

If you need to get the list of Customers who lives in California or bought Product A, the customers need to exist in either of the two sets: CC and PA. From the Venn diagram above, we know that CustomerID 3,4,5,6,7,8 and 12 fall in this bucket. This set {3,4,5,6,7,8,12} is represented as union(set symbol ∩) of CC and PA. CC U PA = {3,4,5,6,7,8,12} In database terms, Union (U) is known as full outer join. For CC U PA, {3,4,5,6,7,8,12} are the keys that belongs to either data set CC or data set PA.

Now, if you need to get the list of all Customers who lives in California and may or may not have bought Product A, the customers must exist in sets CC but may or may not exist in PA. From the Venn diagram above, you can tell that CustomerID 3,4,5,8 and 12 fall in this bucket. This set {3,4,5,,8,12} is represented as super set (set symbol ⟕) of CC and PA. CC ⟕ PA = {3,4,5,8,12} In database terms, this symbol(⟕) is known as left outer join. For CC ⟕ PA, {3,4,5,8,12} are the keys that belongs to data set CC and may or may not be present in data set PA.

It is very important that you understand these concepts as this knowledge will make your database journey smooth down the road.

#database #set theory

In God we trust, all others must bring data.

W. Edwards Deming

Trending Blogs

Last Seen Blogs

All Things Data