Skip to main content

Clustering

 𝐏𝐨𝐰𝐞𝐫 𝐨𝐟 𝐇𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐢𝐜��𝐥 𝐂𝐥𝐮𝐬𝐭𝐞𝐫𝐢𝐧𝐠: 𝐀 𝐆𝐮𝐢𝐝𝐞 𝐭𝐨 𝐃𝐚𝐭𝐚 𝐆𝐫𝐨𝐮𝐩𝐢𝐧𝐠

 In the world of Data Science, Hierarchical Clustering stands out for its elegance and versatility. This powerful method helps group similar data points, uncover hidden patterns, and explore relationships within datasets. 🌐

 

🔑 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐇𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐢𝐜𝐚𝐥 𝐂𝐥𝐮𝐬𝐭𝐞𝐫𝐢𝐧𝐠?

Hierarchical Clustering is an unsupervised learning technique that builds a tree of clusters, called a dendrogram, by progressively merging smaller clusters into larger ones. Here's how it works:

 

 𝐒𝐭𝐚𝐫𝐭 𝐰𝐢𝐭𝐡 𝐈𝐧𝐝𝐢𝐯𝐢𝐝𝐮𝐚𝐥 𝐃𝐚𝐭𝐚 𝐏𝐨𝐢𝐧𝐭𝐬: Initially, each data point is treated as its own cluster.

 

 𝐌𝐞𝐚𝐬𝐮𝐫𝐞 𝐃𝐢𝐬𝐭𝐚𝐧𝐜𝐞𝐬: The distance between clusters is calculated using a defined metric.

 

 𝐌𝐞𝐫𝐠𝐞 𝐭𝐡𝐞 𝐂𝐥𝐨𝐬𝐞𝐬𝐭 𝐂𝐥𝐮𝐬𝐭𝐞𝐫𝐬: The closest clusters are merged, repeating until all points belong to a single cluster.

 

 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐞 𝐰𝐢𝐭𝐡 𝐚 𝐃𝐞𝐧𝐝𝐫𝐨𝐠𝐫𝐚𝐦: The dendrogram visually shows how clusters merge and at what distance.

 

🔍 𝐋𝐢𝐧𝐤𝐚𝐠𝐞 𝐌𝐞𝐭𝐡𝐨𝐝𝐬: 𝐂𝐡𝐨𝐨𝐬𝐢𝐧𝐠 𝐘𝐨𝐮𝐫 𝐃𝐢𝐬𝐭𝐚𝐧𝐜𝐞 𝐌𝐞𝐭𝐫𝐢𝐜

The effectiveness of hierarchical clustering depends on how we measure distances between clusters. Here are the most common methods:

 

 𝐀𝐯𝐞𝐫𝐚𝐠𝐞 𝐋𝐢𝐧𝐤𝐚𝐠𝐞

What it does: Calculates the average distance between all points in two clusters.

Formula:

 𝐃_𝐚𝐯𝐠(𝐀, 𝐁) = (1 / |𝐀| * |𝐁|) * Σ (𝐢 ∈ 𝐀) Σ (𝐣 ∈ 𝐁) 𝐝(𝐢, 𝐣)

 

 𝐒𝐢𝐧𝐠𝐥𝐞 𝐋𝐢𝐧𝐤𝐚𝐠𝐞

What it does: Measures the shortest distance between any two points, one from each cluster.

Formula:

 𝐃_𝐬𝐢𝐧𝐠𝐥𝐞(𝐀, 𝐁) = 𝐦𝐢𝐧(𝐢 ∈ 𝐀, 𝐣 ∈ 𝐁) 𝐝(𝐢, 𝐣)

 

 𝐂𝐨𝐦𝐩𝐥𝐞𝐭𝐞 𝐋𝐢𝐧𝐤𝐚𝐠𝐞

What it does: Focuses on the farthest distance between two points.

Formula:

 𝐃_𝐜𝐨𝐦𝐩𝐥𝐞𝐭𝐞(𝐀, 𝐁) = 𝐦𝐚𝐱(𝐢 ∈ 𝐀, 𝐣 ∈ 𝐁) 𝐝(𝐢, 𝐣)

 

 𝐖𝐚𝐫𝐝’𝐬 𝐋𝐢𝐧𝐤𝐚𝐠𝐞

What it does: Minimizes the variance within clusters.

Formula:

 𝐃_𝐖𝐚𝐫𝐝(𝐀, 𝐁) = (|𝐀| * |𝐁|) / (|𝐀| + |𝐁|) * 𝐝(𝐀, 𝐁)

 

🌳 𝐓𝐡𝐞 𝐃𝐞𝐧𝐝𝐫𝐨𝐠𝐫𝐚𝐦: 𝐀 𝐕𝐢𝐬𝐮𝐚𝐥 𝐆𝐮𝐢𝐝𝐞 𝐭𝐨 𝐘𝐨𝐮𝐫 𝐂𝐥𝐮𝐬𝐭𝐞𝐫𝐬

The dendrogram is one of the most powerful aspects of hierarchical clustering. This tree-like diagram illustrates how clusters merge and provides a clear view of cluster similarities at different levels. The height of the branches shows the distance at which clusters were merged, helping to choose the optimal number of clusters.

 

🚀 𝐖𝐡𝐲 𝐔𝐬𝐞 𝐇𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐢𝐜𝐚𝐥 𝐂𝐥𝐮𝐬𝐭𝐞𝐫𝐢𝐧𝐠?

Hierarchical clustering is ideal for datasets where the structure isn’t immediately obvious. It’s perfect for:

 

 𝐂𝐮𝐬𝐭𝐨𝐦𝐞𝐫 𝐒𝐞𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧: Grouping customers based on behaviours.

 𝐃𝐨𝐜𝐮𝐦𝐞𝐧𝐭 𝐂𝐥𝐮𝐬𝐭𝐞𝐫𝐢𝐧𝐠: Organizing documents into topics.

 𝐁𝐢𝐨𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐜𝐬: Classifying genes or proteins with similar functions.

 𝐌𝐚𝐫𝐤𝐞𝐭 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡: Identifying patterns in consumer behaviour.

Comments

Popular posts from this blog

Python road map

 

Ways of pandas making faster

 FireDucks makes Pandas 125x Faster (changing one line of code) 🧠 Pandas has some major limitations: - Pandas only uses a single CPU core. - It often creates memory-heavy DataFrames. - Its eager (immediate) execution prevents global optimization of operation sequences. FireDucks is a highly optimized, drop-in replacement for Pandas with the same API.  There are three ways to use it: 1) Load the extension:  ↳ %𝐥𝐨𝐚𝐝_𝐞𝐱𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬; 𝗶𝗺𝗽𝗼𝗿𝘁 𝗽𝗮𝗻𝗱𝗮𝘀 𝗮𝘀 𝗽𝗱 2) Import FireDucks instead of Pandas:  ↳ 𝐢𝐦𝐩𝐨𝐫𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬 𝐚𝐬 𝐩𝐝 3) If you have a Python script, execute is as follows:  ↳ 𝗽𝘆𝘁𝗵𝗼𝗻3 -𝗺 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝗽𝗮𝗻𝗱𝗮𝘀 𝗰𝗼𝗱𝗲.𝗽𝘆 Done! ✔️ A performance comparison of FireDucks vs. DuckDB, Polars, and Pandas is shown in the video below. Official benchmarks indicate: ↳ Modin: ~1.0x faster than Pandas ↳ Polars: ~57x faster than Pandas ↳ FireDucks: ~125x faster than Pandas Credit- Ultan...

Top excel formula,master it