Hi everyone, I am going to share a case analysis about market analysis and conjoint analysis. From this example, you can learn to use basic statistical output and learn how to use cluster analysis. The dataset is golf.csv. If you need this data set, please contact me and I will be happy to share it with you.If you like, remember to give me a thumb up oh!
The dataset is the information about some golf course manufacturing costs, courses, etc. Through this example, we can learn how to observe statistical data and perform a simple cluster analysis to observe clustering.
##
%cd /Users/shimonyagrawal/Desktop
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
golf = pd.read_csv('golf.csv')
#Why will courseID not be relevant in a clustering model?
golf = golf.drop('courseID', 1)
golf
The course ID is not relevant in clustering just number indicating unique identifier. This variable will not yeild any results in the analysis since it's just a number depicting each response by golf course vendors. Here, it does not have any significant value required for analysis.
elevation | square_feet | est_playing_time | land_obstacles | water_obstacles | tunnel_shots | est_construction_cost | est_maintenance_cost | average_hole_length | average_hole_width |
0 | 11.64 | 21037.18 | 43.35 | 10.0 | 3.0 | 3.0 | 103082.72 | 7261.22 | 18.99 | 3.90 |
1 | 6.58 | 23646.44 | 42.30 | 10.0 | 4.0 | 3.0 | 91637.93 | 6553.91 | 21.35 | 2.49 |
2 | 11.08 | 20012.28 | 41.43 | 9.0 | 3.0 | 3.0 | 107049.47 | 5847.06 | 19.09 | 2.63 |
3 | 9.91 | 20761.90 | 46.04 | 10.0 | 4.0 | 3.0 | 101799.55 | 8876.01 | 19.36 | 3.51 |
4 | 11.99 | 19818.75 | 44.82 | 7.0 | 6.0 | 4.0 | 94731.84 | 8445.70 | 16.81 | 2.67 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
245 | 10.45 | 23963.73 | 51.46 | 8.0 | 4.0 | 3.0 | 99027.52 | 9333.81 | 20.39 | 2.63 |
246 | 13.77 | 23337.77 | 51.39 | 9.0 | 3.0 | 4.0 | 61096.93 | 7864.50 | 17.06 | 2.93 |
247 | 7.01 | 23951.83 | 41.96 | 7.0 | 3.0 | 2.0 | 106438.63 | 2745.81 | 18.13 | 2.53 |
248 | 8.70 | 23850.69 | 45.10 | 6.0 | 3.0 | 3.0 | 98163.76 | 7955.81 | 20.32 | 3.79 |
249 | 10.26 | 26820.41 | 47.58 | 8.0 | 5.0 | 3.0 | 92221.42 | 10027.75 | 20.28 | 2.66 |
##Call the describe() function on your dataset.
golf.describe()The describe function gives the descriptive statistics of the variables which summarises the distribution of the variables in the datastet. Summary statistics give a quantitative analysis of the data which can be useful in simplifying large amount of data. Using the summary statistics, the analyst can identify outliers as well as where the data is skewed.
[td]
elevation | square_feet | est_playing_time | land_obstacles | water_obstacles | tunnel_shots | est_construction_cost | est_maintenance_cost | average_hole_length | average_hole_width |
count | 250.00000 | 250.000000 | 250.000000 | 250.000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 |
mean | 10.90348 | 22052.677600 | 44.916120 | 7.840 | 3.968000 | 2.944000 | 94956.160200 | 7779.288160 | 19.757280 | 2.964680 |
std | 2.52390 | 2708.177478 | 5.001146 | 1.544 | 0.775488 | 0.598579 | 11656.524492 | 1990.536582 | 1.750693 | 0.459777 |
min | 2.92000 | 14357.480000 | 31.630000 | 3.000 | 2.000000 | 2.000000 | 61096.930000 | 2682.460000 | 14.690000 | 1.640000 |
25% | 9.45250 | 20162.532500 | 41.430000 | 7.000 | 3.000000 | 3.000000 | 86997.525000 | 6527.242500 | 18.560000 | 2.632500 |
50% | 11.08500 | 22030.715000 | 44.780000 | 8.000 | 4.000000 | 3.000000 | 94727.880000 | 7760.000000 | 19.780000 | 2.990000 |
75% | 12.70500 | 23974.867500 | 48.157500 | 9.000 | 4.000000 | 3.000000 | 102375.640000 | 8935.887500 | 20.925000 | 3.300000 |
max | 17.77000 | 29712.520000 | 58.020000 | 14.000 | 6.000000 | 4.000000 | 126247.720000 | 12589.800000 | 25.490000 | 4.210000 |
##Build a k-means model.
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters = 3, random_state = 101)
kmeans_model.fit(golf_normalize)
cluster_labels = kmeans_model.labels_
golf_cluster = golf.assign(Cluster = cluster_labels)
grouped = golf_cluster.groupby(['Cluster'])
grouped.agg({
'square_feet': 'mean',
'est_construction_cost' : 'mean',
'est_maintenance_cost' : 'mean'}).round(2)
[td]
square_feet
| est_construction_cost | est_maintenance_cost |
Cluster |
|
|
|
0
| 24754.41 | 84712.52 | 7551.36 |
1 | 19985.26 | 105199.56 | 7382.96 |
2 | 22149.62 | 92833.78 | 8187.60 |
golf_cluster.head()
[td]
elevation | square_feet | est_playing_time | land_obstacles | water_obstacles | tunnel_shots | est_construction_cost | est_maintenance_cost | average_hole_length | average_hole_width | Cluster |
0 | 11.64 | 21037.18 | 43.35 | 10.0 | 3.0 | 3.0 | 103082.72 | 7261.22 | 18.99 | 3.90 | 1 |
1 | 6.58 | 23646.44 | 42.30 | 10.0 | 4.0 | 3.0 | 91637.93 | 6553.91 | 21.35 | 2.49 | 2 |
2 | 11.08 | 20012.28 | 41.43 | 9.0 | 3.0 | 3.0 | 107049.47 | 5847.06 | 19.09 | 2.63 | 1 |
3 | 9.91 | 20761.90 | 46.04 | 10.0 | 4.0 | 3.0 | 101799.55 | 8876.01 | 19.36 | 3.51 | 1 |
4 | 11.99 | 19818.75 | 44.82 | 7.0 | 6.0 | 4.0 | 94731.84 | 8445.70 | 16.81 | 2.67 | 1 |
This completes the implementation of Cluster. Finally, cluster analysis should be summarized.
golf_cluster = golf.assign(Cluster = cluster_labels)
grouped = golf_cluster.groupby(['Cluster'])
grouped.agg({
'square_feet': 'mean',
'est_construction_cost' : 'mean',
'est_maintenance_cost' : 'mean'}).round(2)
[td]
square_feet | est_construction_cost | est_maintenance_cost |
Cluster |
|
|
|
0
| 24754.41 | 84712.52 | 7551.36 |
1 | 19985.26 | 105199.56 | 7382.96 |
2 | 22149.62 | 92833.78 | 8187.60 |
If you like, remember to give me thumb up oh!