Targeted Marketing Campaign Prediction & User Behavior Analysis using K-Means Clustering & Folium For Data Visualization

1. Introduction — Business Understanding

1.1 Background

There has been an evolution of marketing. Since the 1900s starting on both radio and television, the marketing focus was on “selling” starting. The Golden age of Advertising introduced such ads as “Uncle Sam Wants You for the Army” and “Eat Your Wheaties”. Marketing became more personalized with a focus on brand awareness and problem solving. Then there was the digital ad revolution that began with online advertising in the 1990s and mobile ads in 2000. There is a plethora of data collected daily about users and the ability to harness this data to produce more targeted and personalized ad campaigns to create better customer experience and revenue generation.

1.2 Business Problem

An employee at a fictitious big data marketing company, Insights LLC has been tasked with helping its customer determine an ideal marketing campaign in San Francisco to increase revenue & customer satisfaction.

1.3 Interest

Insights, LLC has a customer who would like to create more personalized ad campaigns for its target customer segments. With the plethora of data collected & speed in which it is collected on its customers, the ability to harness it for either a) an increase of revenue via new products/services b) identification of user behavior for both positive & negative trends in customer satisfaction. I am using the data science methodology to solve this business problem.

2. Data Science Methodology

2.1 Data Requirements — Data Tooling, Sources of Collection & Cleansing/Pre-processing

The data tooling I will be using will be Python language for (data cleansing, data manipulation, data modeling, data analytics & visualization), Jupyter notebook within Watson Studio for sharing code & data analysis pushed to GitHub for source control.

  • Web scrape: Neighborhood data for the various cities & population.
  • Nominatim: Retrieval of latitude and longitude of the neighborhoods for neighborhood segmentation via clustering
  • Foursquare Places API: venue, rating data for these neighborhoods
  • Foursquare check-ins/Cities/POI CSV file(s): To show frequently checked in venues & their cities
  • Kaggle datasets CSV: SF crime data
  • venue id: unique id for the restaurant
  • venue category: type of venue
  • rating of venue: indicator of how successful or good the venue is
  • crime description: details type of crime(violent or non-violent)
  • venues nearby a specific neighborhood
  • venues most frequented per neighborhood
  • population of neighborhood: count of people that reside within a given neighborhood

2.2 Exploratory Data Analysis & Machine Learning Methods

Within the dataframes I removed missing data, duplicates, anomalies, corruption using Python. I created two columns in neighborhood dataframe named lat and long and used a loop to populate that data from Foursquare API. I also renamed columns and perform some merging of dataframes (neighborhood, crime) to create a map for exploration.

2.3 Results

Neighborhood Clusters
San Francisco Crime Near Castro Neighborhood
Similar Venue Data Clustering

2.4 Discussion

The recommendation would be to avoid bar or nightclub as there is potential for a violent crime. Recommendation would be to target a marketing campaign for a new restaurant located in Market & Castro areas & Van Ness. Also observing our check-in data specifically for San Francisco, we find that the most frequently checked-in restaurant is a Pizza Bar which gives us a potential category of venue and marketing campaign we can explore. Lastly, looking at our rating data, a burger place has the lowest rating & oyster bar(highest rated restaurant venue) also gives a potential for improvement of an existing venue.

2.5 Limitations & Expansion of the Project

Crime data was older(stale) but for the sake of true data science you would stream real-time data optimally but it was used to illustrate concepts in the data science methodology. Streaming data is optimal.Also to avoid rate limiting, I performed a manual entry of rating data from the Foursquare API into the pandas dataframe.

2.6 Conclusion

We have identified a business problem. We have gathered data, cleansed the data,used k-means clustering & visualization to explore the data, make assumptions & provide recommendations.

2.7 References

List of Neighborhoods in San Francisco:



Technical Customer Success Architect of Data Science, Data, AI,Cloud Technologies. Love learning all things new and exciting.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Brandy Guillory

Brandy Guillory


Technical Customer Success Architect of Data Science, Data, AI,Cloud Technologies. Love learning all things new and exciting.