People Map of India: Part 1 (Scraper)

5 minute read

I was going through The Pudding yesterday and came across their “People Map of the UK” and “People Map of the US” posts. These seemed pretty cool and since I’ve been trying to pick up making visualizations using JavaScript, I thought this was the perfect thing to attempt to replicate. So I decided to do the same for India.

Before getting to the visualization, there is data collection so I decided to follow their methodology of using the “People by City” listings on Wikipedia. However. these lists aren’t exhaustive and there are some more links but for the sake of brevity I decided not to try and scrape those. Those were a bit too hard to scrape, I’ve left the link to the collection of those pages, let me know if you see a good way to scrape the names from those pages!

Importing Dependencies

import requests
import pandas as pd
import bs4
import wikipedia
from time import sleep
import pageviewapi.period
from tqdm import tqdm

Retry function

Let’s define a function that puts the script to sleep for exponentially incrementing timesteps so as to make sure we don’t get a “Too Many Requests” error.

def retrieve(url):
    page = ''
    l = 0
    while page == '':
        try:
            page = requests.get(url)
            break
        except:
        #    print("Connection refused by the server..Let me sleep for", pow(2,l), "seconds")
            sleep(pow(2,l))
         #   print("Was a nice sleep, now let me continue...")
            continue
    return page

Define Number of Views function

I used the pageview API in python to collect the total views over the past five years. Again, I put in an exponentially incrementing sleep function in case we get an error for making too many calls.

def viewj(person):
    r = ''
    l = 0

    while r == '':
        try:
            r = pageviewapi.period.sum_last('en.wikipedia', person, last = 365*5,
                        access='all-access', agent='all-agents')
            break
        except:
            print("Let me sleep for", pow(2,l), "seconds")
            print("ZZzzzz...")
            sleep(pow(2,l))
            l+=1
            if l > 10:
                r = 0
                print(person, "views error")
                break
      #      print("Was a nice sleep, now let me continue...")
            continue
        
    return r
            

Scraping From Wikipedia

Using BeautifulSoup and the requests module, I scraped the people listed under each city in the link and discarded all of them except the person with the highest number of views in each city.

central_url = "https://en.wikipedia.org/wiki/Category:People_by_city_in_India"
s = requests.get(central_url)
data = pd.DataFrame(columns = ['City', "Person", 'Number of Views', 'Description', 'link'])
soup = bs4.BeautifulSoup(s.content, 'lxml')
raw = soup.find_all('div', {'class':"mw-category-group"})
c = [] #Array of cities
p = []  #Array of most viewed people corresponding to those cities
v = []  #Array of thee views those people got
d = [] #Descriptions of the people
miss = [] 
err = []
li = [] #Links to their profiles
for i in tqdm(range(2,len(raw))): #First two links are redirects to other categorizaations
                    #We basicallu just want to deal with the letters
        l = raw[i].find_all('a') #Get links within the letter categories
        for j in l:
            categ_link = "https://en.wikipedia.org"+j['href']       #Get link
            city = j['title'][21:] #Get rid of the "Category: People from " bit
            t = retrieve(categ_link)
            tsoup = bs4.BeautifulSoup(t.content, 'lxml')
            try:
                r = tsoup.find_all('div', {'class':"mw-category-group"})
                select = r[0].find('a')['title']
                max_v = viewj(select)

                for k in r:
                    person = k.find('a')['title']
                    if viewj(person) > max_v:
                        select = person
                        max_v = viewj(person)
                        people_link = "https://en.wikipedia.org"+k.find('a')['href']
                        try:
                            description = wikipedia.summary(select)
                        except:
                            description = "Error"
                
                miss.append(select)
                views = viewj(select)
                c.append(city)
                v.append(max_v)
                p.append(select)
                d.append(description)
                li.append(people_link)
                #print(city, person, views)
            except:
                r = tsoup.find_all('div', {'class':"mw-content-ltr"})
                r = r[len(r)-1].find_all('a')
                select = r[0]['title']
                max_v = 0

                for k in r:
                    person = k['title']
                    if viewj(person) > max_v:
                        select = person
                        max_v = viewj(person)
                        people_link = "https://en.wikipedia.org"+k['href']
                        views = viewj(select)
                        try:
                            description = wikipedia.summary(select)
                        except:
                            description = "Error"
                miss.append(select)
                views = viewj(select)
                c.append(city)
                v.append(max_v)
                p.append(select)
                d.append(description)
                li.append(people_link)

               # print(city, person, views)
sleep(120)              
                
                
data['City'] = c
data['Person'] = p
data['Number of Views'] = v
data['Description'] = d
data['link'] = li
data.head()
City Person Number of Views Description link
0 Agartala Dipa Karmakar 1345973 Dipa Karmakar (born 9 August 1993) is an India... https://en.wikipedia.org/wiki/Dipa_Karmakar
1 Agra Mumtaz Mahal 2758109 Mumtaz Mahal (Persian: ممتاز محل [mumˈt̪aːz mɛ... https://en.wikipedia.org/wiki/Mumtaz_Mahal
2 Ahmedabad Abdul Latif (criminal) 2087052 Abdul Latif was an underworld figure in Gujara... https://en.wikipedia.org/wiki/Abdul_Latif_(cri...
3 Ahmednagar Sadashiv Amrapurkar 925429 Abdul Latif was an underworld figure in Gujara... https://en.wikipedia.org/wiki/Abdul_Latif_(cri...
4 Aizawl Zohmingliana Ralte 24268 Zohmingliana Ralte (born 2 October 1990) is an... https://en.wikipedia.org/wiki/Zohmingliana_Ralte

There seem to be something weird with certain cells having values from the row above them for Description and Links. Let’s just fix them since rerunning the code will take too damn long. This way if a person is associated with two cities, their entries won’t be deleted.

y = []
for i in range(len(data.duplicated(subset ='Description'))):
    if(data.duplicated(subset ='Description')[i]==True):
        y.append(i)
for i,j in data.iterrows():
    if i in y:
        data.at[i, "Description"] = wikipedia.summary(j[1])
        data.at[i, "link"]= wikipedia.page(j[1]).url

Scraping Left

The following link is to a page which has a list of 30 more pages which have lists of people from specific cities. Some are common with those extracted above and some aren’t. I couldn’t figure out a nice and automated way of scraping them so I’m not doing this for now. Perhaps some other time when I feel like it.

link = "https://en.wikipedia.org/wiki/Category:Lists_of_people_by_city_in_India"

City Coordinates

Now, let’s get coordinates for each of the cities using the geopy module.

### import geopy as gp
lats = []
longs = []
locator = gp.Nominatim(user_agent="apoorv")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
from tqdm import tqdm
data['location'] = data['City'].apply(geocode)

data['point'] = data['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
data[['latitude', 'longitude', 'altitude']] = pd.DataFrame(data['point'].tolist(), index=data.index)

Let’s see what we got.

data.head()
City Person Number of Views Description link location point latitude longitude altitude
0 Agartala Dipa Karmakar 1345973 Dipa Karmakar (born 9 August 1993) is an India... https://en.wikipedia.org/wiki/Dipa_Karmakar (Agartala, Mohanpur, West Tripura, Tripura, 79... (23.8312377, 91.2823821, 0.0) 23.831238 91.282382 0.0
1 Agra Mumtaz Mahal 2758109 Mumtaz Mahal (Persian: ممتاز محل [mumˈt̪aːz mɛ... https://en.wikipedia.org/wiki/Mumtaz_Mahal (Agra, Uttar Pradesh, 280001, India, (27.17525... (27.1752554, 78.0098161, 0.0) 27.175255 78.009816 0.0
2 Ahmedabad Abdul Latif (criminal) 2087052 Abdul Latif was an underworld figure in Gujara... https://en.wikipedia.org/wiki/Abdul_Latif_(cri... (Ahmedabad, Ahmadabad City Taluka, Ahmedabad D... (23.0216238, 72.5797068, 0.0) 23.021624 72.579707 0.0
3 Ahmednagar Sadashiv Amrapurkar 925429 Sadashiv Dattaray Amrapurkar (11 May 1950 – 3 ... https://en.wikipedia.org/wiki/Sadashiv_Amrapurkar (Ahmednagar, Maharashtra, India, (19.162772500... (19.162772500000003, 74.85802430085195, 0.0) 19.162773 74.858024 0.0
4 Aizawl Zohmingliana Ralte 24268 Zohmingliana Ralte (born 2 October 1990) is an... https://en.wikipedia.org/wiki/Zohmingliana_Ralte (Aizawl, Tlangnuam, Aizwal, Mizoram, 796190, I... (23.7414092, 92.7209297, 0.0) 23.741409 92.720930 0.0

Remove unnecessary columns of point, location and altitude and export.

data = data.drop(columns = ['point', 'location', 'altitude'])
data.to_csv("F:/Playin/PeopleMap.csv")

Updated: