Top Posts Tagged with #train test split

Popular Recent

Train-test split model validation is a simple and common technique in machine learning. Simply, it evaluates the performance of a model.

#model validation #train test split #ML #AI #model training

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Never Skip the Inspection

If you have ever been fortunate (or dumb) enough to buy a house, you know the rush of the searching process, especially in “hot markets”, where houses come and go as fast as popping popcorn. You waited all this time saving for your down payment and your credit score has finally given you the green light to even enter the bank and say the word “loan”. Surfing Zillow and making Pinterest boards are your new hobbies and you just have the best realtor you could possibly wish for.

A house meeting all requirements pops-up. You and your realtor go and check it out…it is perfect! You call some of your buddies to the scene and they all agree that it all looks great: good paint, good bathroom, no cracks on doors or windows…looks ready to roll. It looks so good in the surface that you decide to skip the inspection in spite of your realtor’s best advise…it is a newer construction anyway (you repeated to yourself while closing the deal).

Six months later you are hiring a contractor to come and fix the pillars under the house because “under the surface” things were actually not-so-good (bummer!!). This would not have happened if you had followed the advice of your realtor and done the inspection!

This is exactly what happened to me while working on my first data science project. I read in the curriculum something about train/test split and Mean Squared Error, but who needs those when all of my other indexes are looking so pretty?! My R-Squared was looking great, all p-values under 0.05 and my QQ plot was looking okay (to me anyway). I was feeling accomplished the night I thought I had my model ready, closed my computer and slept like an angel for a solid 5 hours.

I was getting ready to start working on visualizations of my model when for some reason I though that maybe maybe maybe I should look into that train/test split thing…it was going to be a formality anyway since I was pretty sure my other indexes were not lying to me. An hour later I was starring at a Train MSE =0.04 and a Test MSE = 114.83…. after searching the net to figure out what those numbers meant I came to the realization that my model was overfitting (only good to the particular dataset I used to built it). After a long pause and a coffee break (because caffeine), I started almost from scratch my second model.

After that bitter-sweet experience my fellow Data Scientists I come to remind you of the importance of train/test split your data to get the so called Mean Squared Error.

Mean Squared Error

The Mean Squared Error (MSE) is a simple but powerful way to get a sense of the accuracy of your model and can also help confirm if the model is reliable to be used in datasets different than the one it was built on. All MSE does is getting the average of differences (errors) between the real values and the model’s predicted values. If you have a low MSE, this means that the average error of your model is low, therefore your regression line is very accurate. Now, this alone does not tell us if the model is overfitting or underfitting…we need one prior step to be able to confirm this: train/test split.

Train/Test Split

The name gives away the content: train/test split is simply that, splitting the dataset in two parts: one part to build the model (train) and one part to perform the sanity check (the inspection of the house!). There are Python libraries with methods that do this for you as the split needs to be random. I used the train_test_split method from Scikit-Learn library and found it extremely easy and intuitive to work with.

Now, in terms of the data itself, make sure that if you perform any type of data cleaning or transformation you do it in both training and testing data; or better yet, do all data cleaning and transformations prior to splitting the dataset.

The Moment of Truth

Time to confirm if the model will serve well to its future masters or if it is too loyal to the original dataset. You take the Train Test and get the MSE. In a separate exercise, you take the Test data and take the MSE. Finally, you compare both MSE and hope for a small difference between them. If you get a small difference then go treat yourself, you have a model that will do well with future datasets. If you have a big difference...well...go grap a cup of coffee as I did and brace yourself.

Below the code I used to get the MSE of both training and testing data in case you were wondering:

#data science #mean squared error #train test split #regression analysis