Segmentation and Clustering of neighborhoods
1. Introduction to business problem
Problem Background: -
Tourism in France has directly contributed 78.9 billion euros to total Gross Domestic Product (GDP) in the past years. 30% of which comes from international visitors and 70% comes from domestic tourism spending. France was visited by 89 million foreign tourists in 2018, making it most popular tourist destination in the world. In this project, we will be exploring two famous cities of France, Paris and Strasbourg. Even though France was most populous tourist destination, considering the number of nights spent in the country, it is in sixth place after United States, United Kingdom, China, Spain and Italy.
Paris, the Capital City of France is the third most visited city in the world. It has some of the world’s largest museums including Louvre which is most visited art museum in the world. It hosts some of the world recognizable landmark such as Eiffel Tower, The Arc de Triomphe and many more.
Strasbourg is one of the four main capital of European Union alongside Brussels, Luxembourg and Frankfurt. It is among the few cities in the world not being a state capital and hosting international organization of the first order. Economically, it is an important center of manufacturing and engineering. It is the second largest river port in France after Paris. The city is chiefly known for its sandstone Gothic Cathedral with its famous astronomical clock.
Evidently, both of these cities are rich in cultural heritages and thus attract millions of international tourists every year. As France stands at sixth position in terms of nights spent by these tourists, even though it is most popular tourist destination in the world, it will be helpful for tourists to have a rough idea about luxurious apartments, hotels and restaurants, pub, café etc. to make their stay more comfortable. This might change the current scenario by improving it’s rank from sixth to 2nd or 3rd. If possible, France might stand at first in terms of total number of nights spent by tourists.
Problem description: -
It is obvious that people who visit these places are somewhere in need of a physical/virtual guide. Through this project, we will explore these two main tourist places to dig some of the useful information about all those luxurious amenities, tourist would be looking for. These basic luxurious resources could be: -
· Hotels
· Restaurants
· Multiplexes
· Opera House
· Mountains
· Museums
· Night Club
· Super Market etc.
In addition to these, it can be quite helpful for those people who all are international migrants and are looking for perfect place to rent apartments. So, our project could be proven helpful for these immigrants as virtual guide. Our main aim is to provide an outlook of all these available venues within these cities so that people would be less reliable on local guides who often charge these immigrants huge amount in exchange of service.
2. Data wrangling
Data source — 1
In this project, we will be exploring Paris and Strasbourg.
The dataset has been collected from Kaggle. It can be downloaded from this link. The dataset prepared by INSEE. It is the official French institute gathering data of many types around the France.
There are four files in the dataset, but as per the requirements, we will be using only name_geographic_information.csv dataset. Given Dataset contains following features: -
· EU_circo: name of the European Union Circonscription
· Code_region: code of the region attached to the town
· nom_région: name of the region attached to the town
· chef.lieu_région: name the administrative center around the town
· numéro_département : code of the department attached to the town
· nom_département : name of the department attached to the town
· préfecture : name of the local administrative division around the town
· numéro_circonscription : number of the circumpscription
· nom_commune : name of the town
· codes_postaux : post-codes relative to the town
· code_insee : unique code for the town
· latitude : GPS latitude
· longitude : GPS longitude
· éloignement : I couldn’t manage to figure out what was the meaning of this number
out of above features, only few features are helpful. So, we will have to performed Data wrangling to extract useful features such that appropriate Machine-Learning algorithm could be used to extract useful information with more accuracy. Let’s check how many rows and columns are there in our downloaded dataset. we can check it by using following code-
french_df.shape
There are 36840 rows and 14 columns in the dataset. we will only use primary features in our final dataset.
Those features which were used as primary features for our models are listed below:
o prefecture — renamed as Borough
o nom_commune — renamed as Neighborhood
o codes_postaux — renamed as Postal-codes
o latitude
o longitude
Given dataset contains many prefectures out of which only Paris and Strasbourg have been taken into consideration as only we are interested in exploring only these two cities.
Data source — 2
we will be using Foursquare API to leverage neighborhood venues by providing geographical coordinates along with user credentials. To explore venues using Foursquare API, please refer to my GitHub link. our new data frame would look like -
3. Methodology
Business Understanding: -
Our main aim is to segregate the suitable places for tourists to stay where they could afford all the facilities mainly hotels, restaurants, and amusement parks at reasonable cost. We also aim at providing virtual guides to international migrants regrading suitable places to buy apartments.
Analytic approach: -
Original data frame consists of 36840 rows and 24 columns. After cleaning the data, the clean data frame consists of 519 rows and 5 columns. There is significant decrease in the number of rows because we would be considering those rows only which contains the information about Paris and Strasbourg. We will be using K-Means clustering machine learning model to cluster neighborhoods of these cities based on certain criteria.
Exploratory Data Analysis (EDA): -
Geographical data exploitation: -
Original dataset contains many rows out of which only some are useful. As our main aim is to cluster neighborhood, we will extract useful information from the given dataset at first.
Original dataset contains missing values which can be seen below.
As we can see that latitude and longitude columns of french_df dataset contains missing values, we will ignore all those rows which contains missing values by using below code –
french_df.dropna(axis =0,inplace = True)
french_df.shape
As of now, only 519 rows remain in the french_df dataset.
We used Geopy along with Folium library to plot the neighborhoods on the map.
From the above plot, we can conclude that these cities are highly saturated.
Let’s cluster these neighborhoods into two cluster namely, Paris and Strasbourg. We will visualize these cluster using folium library and Geopy.
From the above plot, we can see that there are 21 localities in the Paris’s neighborhood while there are 498 localities in Strasbourg’s neighborhood.
Quantitative analysis of Neighborhoods’ venues –
Foursquare API will be used to collect all the neighborhood venues of these two cities. Once the neighborhoods data are collected, new data frame would look like below.
We will get a total of 2793 rows along with 7 columns.
We will explore these cities one by one. We will find that Strasbourg is having total of 69 restaurants out of which there are 27 French restaurants only which is 39% of total. We can observe this by looking at graph below.
Once we explore Paris, we find that there are total of 861 restaurants in its neighborhood, out of which there are 273 Japanese restaurants only, which is even more than the total of Strasbourg. we can verify it by looking at the graph below.
Now, we will compare these two cities’ restaurants, we will find that there are a smaller number of restaurants in Strasbourg than in Paris. There could be many reasons for this huge margin. One such reason is that Strasbourg is European capital region where there are a greater number of administrative blocks than restaurants. We can verify it by using joint plot of above bar graphs.
4.Result
On the above dataset, we performed K-Means cluster analysis to cluster these neighborhoods on the basis of mean of frequency of these venues. We will study all the neighbors of these two cities jointly.
K- Means cluster –
we will cluster these neighborhoods into k number of clusters. We will derive the optimal value of k using gap statistics. The optimal value of k comes out to be 15. We will cluster these neighborhoods into 15 cluster using K-Means clustering machine learning algorithm.
we will use below code to obtain optimal value of k.
def optimalK(data,max_cluster):
result_df = pd.DataFrame({'gap':[],'cluster_count':[]})
n_ref = 500
gap_arr = np.zeros(max_cluster-1)
original_inertia = np.zeros(max_cluster-1)
reference_inertia = np.zeros(max_cluster-1)
for gap_index,k in enumerate(range(1,max_cluster)):
ref_set = np.zeros(n_ref)
for i in range(n_ref):
ref_dist = np.random.random_sample(data.shape)
km = KMeans(k)
km.fit(ref_dist)
ref_inertia = km.inertia_
ref_set[i] = ref_inertia
km_orig = KMeans(k)
km_orig.fit(data)
orig_inertia = km.inertia_
original_inertia[gap_index] = np.log(orig_inertia)
reference_inertia[gap_index] = np.mean([np.log(x) for x in ref_set])
# calculate gap statistics
gap_statistics = np.mean([np.log(x) for x in ref_set]) - np.log(orig_inertia)
gap_arr[gap_index] = gap_statistics
result_df = result_df.append({'gap':gap_statistics,'cluster_count':k},ignore_index=True)
return original_inertia,reference_inertia,gap_arr.argmax()+1,result_df
Let’s plot k vs. gap values.
ax = plt.figure(figsize = (8,6))
plt.plot(gap_df.cluster_count,gap_df.gap,linewidth = 3)
plt.scatter(gap_df[gap_df.cluster_count == optimal_k].cluster_count,gap_df[gap_df.cluster_count == optimal_k].gap,s = 250,c ='r')
plt.grid()
plt.show()
we will get the curve as below.
By looking at the above curve, we can say that optimal value of k is 15.
Once we are done with clustering, we will visualize these cluster using Folium map.
By looking at above folium map, we can conclude that Blue dots are more compact and are greater in numbers. By looking at the labels, we can say that cluster-1 is having a greater number of venues that can be help migrants decide where to rent apartment.
Let’ examine Cluster-1 in details.
In the above dataset, we can see that most common venue in cluster-1 is Train station, followed by Mountain, Museum, music-store etc. So, by looking at these venues, we can suggest tourists to visit these places without any discomfort.
If someone is fond of French food, then neighbors in cluster-2 as well as in cluster-5 would be best places to visit. We can verify this below.
Venues in cluster-2 are not limited to these only. If someone is foody and want to explore different kinds of cuisines, cluster-2 could be the best choice.
5. Conclusion
This analysis has been performed on the legal dataset. These two cities are not the only places to visit in France and to rent apartments. There are many such destinations. But our main idea was to highlight the available venues within these two places as they are the center of tourism. There are many venues within these cities like book-store, bar, music store, supermarket, Clothing store etc. These venues can be added benefits to the tourists’ interests. One might look on these two cities in different perspective. Choice is completely independent of interests. Some tourists might be interested in doing outdoor recreations, some might be looking for suitable places to open a restaurant, shopping malls, book-store, clothing-store, Pastry shop etc. Our model can be helpful to these people and thus they can decide according to their choice of interests.
For complete project, checkout my notebook.