People Map of India: Part 1 (Scraper)

5 minute read

I was going through The Pudding yesterday and came across their “People Map of the UK” and “People Map of the US” posts. These seemed pretty cool and since I’ve been trying to pick up making visualizations using JavaScript, I thought this was the perfect thing to attempt to replicate. So I decided to do the same for India.

Before getting to the visualization, there is data collection so I decided to follow their methodology of using the “People by City” listings on Wikipedia. However. these lists aren’t exhaustive and there are some more links but for the sake of brevity I decided not to try and scrape those. Those were a bit too hard to scrape, I’ve left the link to the collection of those pages, let me know if you see a good way to scrape the names from those pages!

Importing Dependencies

import requests
import pandas as pd
import bs4
import wikipedia
from time import sleep
import pageviewapi.period
from tqdm import tqdm

Retry function

Let’s define a function that puts the script to sleep for exponentially incrementing timesteps so as to make sure we don’t get a “Too Many Requests” error.

def retrieve(url):
    page = ''
    l = 0
    while page == '':
        try:
            page = requests.get(url)
            break
        except:
        #    print("Connection refused by the server..Let me sleep for", pow(2,l), "seconds")
            sleep(pow(2,l))
         #   print("Was a nice sleep, now let me continue...")
            continue
    return page

Define Number of Views function

I used the pageview API in python to collect the total views over the past five years. Again, I put in an exponentially incrementing sleep function in case we get an error for making too many calls.

def viewj(person):
    r = ''
    l = 0

    while r == '':
        try:
            r = pageviewapi.period.sum_last('en.wikipedia', person, last = 365*5,
                        access='all-access', agent='all-agents')
            break
        except:
            print("Let me sleep for", pow(2,l), "seconds")
            print("ZZzzzz...")
            sleep(pow(2,l))
            l+=1
            if l > 10:
                r = 0
                print(person, "views error")
                break
      #      print("Was a nice sleep, now let me continue...")
            continue
        
    return r
            

Scraping From Wikipedia

Using BeautifulSoup and the requests module, I scraped the people listed under each city in the link and discarded all of them except the person with the highest number of views in each city.

central_url = "https://en.wikipedia.org/wiki/Category:People_by_city_in_India"
s = requests.get(central_url)
data = pd.DataFrame(columns = ['City', "Person", 'Number of Views', 'Description', 'link'])
soup = bs4.BeautifulSoup(s.content, 'lxml')
raw = soup.find_all('div', {'class':"mw-category-group"})
c = [] #Array of cities
p = []  #Array of most viewed people corresponding to those cities
v = []  #Array of thee views those people got
d = [] #Descriptions of the people
miss = [] 
err = []
li = [] #Links to their profiles

for i in tqdm(range(2,len(raw))): #First two links are redirects to other categorizaations
                    #We basicallu just want to deal with the letters
        l = raw[i].find_all('a') #Get links within the letter categories
        for j in l:
            categ_link = "https://en.wikipedia.org"+j['href']       #Get link
            city = j['title'][21:] #Get rid of the "Category: People from " bit
            t = retrieve(categ_link)
            tsoup = bs4.BeautifulSoup(t.content, 'lxml')
            try:
                r = tsoup.find_all('div', {'class':"mw-category-group"})
                select = r[0].find('a')['title']
                max_v = viewj(select)

                for k in r:
                    person = k.find('a')['title']
                    if viewj(person) > max_v:
                        select = person
                        max_v = viewj(person)
                        people_link = "https://en.wikipedia.org"+k.find('a')['href']
                        try:
                            description = wikipedia.summary(select)
                        except:
                            description = "Error"
                
                miss.append(select)
                views = viewj(select)
                c.append(city)
                v.append(max_v)
                p.append(select)
                d.append(description)
                li.append(people_link)
                #print(city, person, views)
            except:
                r = tsoup.find_all('div', {'class':"mw-content-ltr"})
                r = r[len(r)-1].find_all('a')
                select = r[0]['title']
                max_v = 0

                for k in r:
                    person = k['title']
                    if viewj(person) > max_v:
                        select = person
                        max_v = viewj(person)
                        people_link = "https://en.wikipedia.org"+k['href']
                        views = viewj(select)
                        try:
                            description = wikipedia.summary(select)
                        except:
                            description = "Error"
                miss.append(select)
                views = viewj(select)
                c.append(city)
                v.append(max_v)
                p.append(select)
                d.append(description)
                li.append(people_link)

               # print(city, person, views)
sleep(120)              
                
                

data['City'] = c
data['Person'] = p
data['Number of Views'] = v
data['Description'] = d
data['link'] = li
data.head()

	City	Person	Number of Views	Description	link
0	Agartala	Dipa Karmakar	1345973	Dipa Karmakar (born 9 August 1993) is an India...	https://en.wikipedia.org/wiki/Dipa_Karmakar
1	Agra	Mumtaz Mahal	2758109	Mumtaz Mahal (Persian: ممتاز محل [mumˈt̪aːz mɛ...	https://en.wikipedia.org/wiki/Mumtaz_Mahal
2	Ahmedabad	Abdul Latif (criminal)	2087052	Abdul Latif was an underworld figure in Gujara...	https://en.wikipedia.org/wiki/Abdul_Latif_(cri...
3	Ahmednagar	Sadashiv Amrapurkar	925429	Abdul Latif was an underworld figure in Gujara...	https://en.wikipedia.org/wiki/Abdul_Latif_(cri...
4	Aizawl	Zohmingliana Ralte	24268	Zohmingliana Ralte (born 2 October 1990) is an...	https://en.wikipedia.org/wiki/Zohmingliana_Ralte

There seem to be something weird with certain cells having values from the row above them for Description and Links. Let’s just fix them since rerunning the code will take too damn long. This way if a person is associated with two cities, their entries won’t be deleted.

y = []
for i in range(len(data.duplicated(subset ='Description'))):
    if(data.duplicated(subset ='Description')[i]==True):
        y.append(i)
for i,j in data.iterrows():
    if i in y:
        data.at[i, "Description"] = wikipedia.summary(j[1])
        data.at[i, "link"]= wikipedia.page(j[1]).url

Scraping Left

The following link is to a page which has a list of 30 more pages which have lists of people from specific cities. Some are common with those extracted above and some aren’t. I couldn’t figure out a nice and automated way of scraping them so I’m not doing this for now. Perhaps some other time when I feel like it.

link = "https://en.wikipedia.org/wiki/Category:Lists_of_people_by_city_in_India"

City Coordinates

Now, let’s get coordinates for each of the cities using the geopy module.

### import geopy as gp
lats = []
longs = []
locator = gp.Nominatim(user_agent="apoorv")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
from tqdm import tqdm
data['location'] = data['City'].apply(geocode)

data['point'] = data['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
data[['latitude', 'longitude', 'altitude']] = pd.DataFrame(data['point'].tolist(), index=data.index)

Let’s see what we got.

data.head()

	City	Person	Number of Views	Description	link	location	point	latitude	longitude
0	Agartala	Dipa Karmakar	1345973	Dipa Karmakar (born 9 August 1993) is an India...	https://en.wikipedia.org/wiki/Dipa_Karmakar	(Agartala, Mohanpur, West Tripura, Tripura, 79...	(23.8312377, 91.2823821, 0.0)	23.831238	91.282382
1	Agra	Mumtaz Mahal	2758109	Mumtaz Mahal (Persian: ممتاز محل [mumˈt̪aːz mɛ...	https://en.wikipedia.org/wiki/Mumtaz_Mahal	(Agra, Uttar Pradesh, 280001, India, (27.17525...	(27.1752554, 78.0098161, 0.0)	27.175255	78.009816
2	Ahmedabad	Abdul Latif (criminal)	2087052	Abdul Latif was an underworld figure in Gujara...	https://en.wikipedia.org/wiki/Abdul_Latif_(cri...	(Ahmedabad, Ahmadabad City Taluka, Ahmedabad D...	(23.0216238, 72.5797068, 0.0)	23.021624	72.579707
3	Ahmednagar	Sadashiv Amrapurkar	925429	Sadashiv Dattaray Amrapurkar (11 May 1950 – 3 ...	https://en.wikipedia.org/wiki/Sadashiv_Amrapurkar	(Ahmednagar, Maharashtra, India, (19.162772500...	(19.162772500000003, 74.85802430085195, 0.0)	19.162773	74.858024
4	Aizawl	Zohmingliana Ralte	24268	Zohmingliana Ralte (born 2 October 1990) is an...	https://en.wikipedia.org/wiki/Zohmingliana_Ralte	(Aizawl, Tlangnuam, Aizwal, Mizoram, 796190, I...	(23.7414092, 92.7209297, 0.0)	23.741409	92.720930

Remove unnecessary columns of point, location and altitude and export.

data = data.drop(columns = ['point', 'location', 'altitude'])
data.to_csv("F:/Playin/PeopleMap.csv")

Share on

Twitter Facebook LinkedIn

Apoorv Saxena

People Map of India: Part 1 (Scraper)

Importing Dependencies

Retry function

Define Number of Views function

Scraping From Wikipedia

Scraping Left

City Coordinates

Share on

You may also enjoy

Implementing Katz’s BackOff Model

People Map of India:Part 2

Modi vs RaGa:Twitter Edition

Unknown Pleasures Artwork Generator