Untitled @dev-py - Tumblr Blog

K-means

K-Means clustering is one of the popular unsupervised machine learning algorithm. In supervised learning, we provide set of features and label our model learn what should be the output when we give specific type of inputs and when we give new input it predict the output. In unsupervised learning there is no labels provided. The algorithm finds patter in the inputs and form clusters. If the clusters are far from each other and inner elements of clusters are close to each other we consider it as good cluster. K-Means clustering uses simple 4 steps to get the clusters. First we give the number of clusters (k). Then a random centroid is selected for each cluster. Then the each points is assigned to the cluster and based on distance and new clusters are formed with new centroid. This process keeps on iterating until centriods of new clusters don't change. K-mean clustering can be used in document classification, recommended system, image classification, segmentation of customers etc.

Data description

we can clearly see columns like Balance, Bonus_miles, Bonus_trans, Flight_miles_12mo and Days_since_enroll have different scale. We'll have to standardize it. It will help in reducing the traning time and get better output.

Lets check the inertia of cluster by plotting the elbow curve graph. Inertia is sum of distance of all point in a cluster from its centroid.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Lasso Regression

LASSO Regression is used to reduce the model overfitting. It increase the bias and reduce the variance in model.

Full form of LASSO is Least Absolute Shrinkage and Selection Operator. So the model itself is capable of feature selection. It shrinks the less important features and remove the features which are not important by making the value of features zero.

LASSO regression also know as L1 regularization. It takes the absolute value of variable and remove variables which don't much contribute to the model.

#Conclusion We can clearly see the prediction accuracy is stable when we used both the dataset When we add more data the prediction error decreases. The R-square values of .74 and .70 indicate training and test model have variance of .74 and .70

#LassoRegression #machinelearningdataanalysis

Random Forest

Random forest is a supervised machine learning model. Like decision tree random forest can be use to predict continuous and categorical variable.

Random forest is an ensemble model it uses more than one model. It uses multiple decision tree to derive the output.

It uses bagging method, it divide the training sample data into subsets and majority output vote of different subset is considered so it's also called bootstrap aggregation because it decide the output based on output aggregation.

Since it derive its output by aggregating the majority vote from multiple models it has low variance and high bias. Ensemble models are generally preferred to build classification models

We can clearly see parameter tuning slightly helped in improving performance

#coursera #Machine Learning for Data Analysis #Random Forest

Decision Tree

Decision tree can be easily interpreted and visualized and working of the model can be easily explained to stakeholders, unlike black box algorithms such as Neural Network and SVM

Decision Tree can be used as a regression model and also classification model. Regression model output will be continues variable while the classification model output will be categorical variable.

The top node of the decision tree is called root node or master node. Then the nodes can be divided into multiple parent node also called child node. They will have sub nodes. The last node will not have any sub node and it's called leaf node or Terminal node.

The decision tree split nodes based on homogeneity of elements and make decision. The last leaf node will have less homogeneous element and the master node will have the more homogeneous elements.

If the model have continues variable as output the decision tree use reduction in variance method. For classification model which have categorical variable as output the decision tree will use Gini Impurity, Information Gain, Chi-Square methods to split the nodes based on homogeneity of elements.

Note: Decision tree have low bias and high variance. They tend to overfit the training data and cannot generalize the data so the model will perform well in training environment but might not perform when when deployed. so we must carefully decide the right bias-variance trade-off. We can prune the tree to reduce the biasness.

Dataset Discription

Avlanche dataset is by microsoft and have 1095 rows and 7 columns:

The "avalanche" column is the target variable, zero in avalanche means avalanche not occurred, one in avalanche columns means avalanche occurred. Other columns are feature variables. "tracked_out" column is only categorical variable in the feature variables other columns are continues variables

We can clearly see there is no null value in dataset

EDA

Feature vs Target variable

no_visitors,fresh_thickness,tracked_out feature and target output have correlation and not contributing much to target variable

And surface_hoar, wind, weak_layers target output have variation and can be used to build the model

From the confusion matrix we see 104 out put is true positive and 141 is true negative. False positive and False negative is 42

Since its a huge data I have used export_text instead of plotting the decision tree. We can see how the decision tree first split <=20.50 and then second layer <=4.61 and so on.

#coursera #Machine Learning for Data Analysis #decision tree

Trending Blogs

Last Seen Blogs

Untitled