Thank you to Ula Yousef on linkedin who designed the entire workbook herself and prepared this post, basically everything but the follow-along video. Thank you so much, the beautiful plotting and wonderful clustering for a beginner Python project.
In this python guided project, you will learn how to build your first k-means clustering project step by step using these libraries:
1_ Pandas: which has multiple functions to perform analysis tasks.
2_ Numpy: which is used to perform computations.
3_ Matplotlib/Seaborn: useful to draw visualizations.
4_ Sklearn: this module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
5_ Yellow brick to draw clusters and elbow method
We will start by performing statistical analysis on our data like checking missing values, duplicates, and types of values.., and this can help us to understand the data we work with and pre-process it, why do we do that? Because the more your data is clean the more your model accuracy is high.
Then we will visualize our data using Matplotlib and Seaborn, this step is very important to explore, monitor, and explain it, after we finish this step our journey in machine learning mode will be starting:
choose the features that we want to cluster our data based on, so we will drop the species column because we are dealing with k-means clustering which is unsupervised machine learning (there are no labels to train the data on it).
Choosing the best number of clusters using the Elbow method.
Fit the model.
Predict, and here our model will divide the Iris flower into clusters.
Check the accuracy of clustering using the Silhouette score.
Follow Data Science Teacher Brandyn
dataGroups:
The Seaborn Countplot Shows the counts of observations in each categorical bin using bars. We notice there is an equal number of each species.
The Seaborn pairplot allows us to plot pairwise relationships between variables within a dataset. This creates a nice visualization and helps us understand the data by summarizing a large amount of data in a single figure. This is essential when we are exploring our dataset and trying to become familiar with it
As we see Petal length feature properly differentiates species.
Now to see how are features correlated to each other we will use df.corr() method, which returns numerical values from -1 to +1 , when the value is close to +1 that means there is a strong positive relationship between the two features, and when it is closed to -1, that means there is a strong negative relationship between the two features, notice that each feature is correlated with itself by +1 value.
The Seaborn Heatmap values represent various shades of the same color for each value to be plotted here the darker shades of the chart represent higher values than the lighter shade.
The elbow method is a heuristic used in determining the number of clusters in a dataset. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. Here YellowBrick has a elbow method plot the makes it easier to understand where a good place is for the number of clusters.
In the plot above we see our data clusters with centroids.
We use YellowBrick Library to access the silhouette score plotting function to examine our clusters.
Comments