In the Python Project, we will use Pandas and Seaborn to perform our exploratory data analysis in en effort to understand how our features impact our target. A big aspect of this is paying close attention to the distributions of our features use in our supervised machine learning project.
After we've explored the data and extracted valuable insights for our business partners and ideas on how to build our model we will use Sklearn to preprocess the data for our ML model.
Part 1
Part 2
Part 3
Follow Data Science Teacher Brandyn
dataGroups:
data:image/s3,"s3://crabby-images/84bc2/84bc247c15a76a17efab4a5d123a48a06e0f424b" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, sklearn, random forest, bagging, gradient boosting, adaboost"
A good practice in Machine Learning is to try one of every major model type on your ML problem. All models try to predict the same thing but with different maths. No matter how well you understand the math humans just aren't capable of thinking through all the interrelationships among features and how they relate to the target. An easier solution is to try them all. Sklearn makes it rather easy to try RandomForest, Bagging, AdaBoost, and GradientBoosting
data:image/s3,"s3://crabby-images/b58db/b58db066d79b63cd235eee021c36b0e7bc9549fa" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, sklearn,randomforest, gradientboosting, ada boost"
As we go through our EDA section we will collect insight specifically about the distributions of our features because that will be very important to allow us to correctly preprocessing our features for our Machine Learning model.
data:image/s3,"s3://crabby-images/36b6a/36b6ada0e1ec895cdd91fcea8c5a7439d857be0d" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, heatmap"
In our bivariate analysis we will plot the correlation matrix with Pandas .corr() and Seaborn's heatmap() to give use easy understanding of the linear correlations in our features.
data:image/s3,"s3://crabby-images/73f32/73f32f9b04fd926cfd5bdf6394613c6b46d93851" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, numpy, log transformation, exponential distribution"
data:image/s3,"s3://crabby-images/24e04/24e041975579246a0c41d2f9b5ef8269fdacd500" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, log transformation"
Exponential distributions are difficult for ML models and it often is better to take a log transform of your exponential distribution to bring it to a more normal distribution. This is an imperfect technique but will most likely make the average a better representation of the data and make for better predictions.
data:image/s3,"s3://crabby-images/e89ea/e89ea4567e8b699920200f0c5d3a9c23f6d9fd3a" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, get dummies, one hot encode, 0 1,"
Using Pandas get_dummies() to one hot encode our categories and bucketized continuous features.
data:image/s3,"s3://crabby-images/a64db/a64db20fc07c170e536cdac6ab1e0d9ed89a2388" alt="Instruct, instructional, instructional education,free python learn, seaborn, python, project, data analysis project, pandas, analyze, sklearn models"
With Sklearn it's handy to create a little function fits and get the scores for each model.
Comments