Avatar
|
August 4, 2019
|

How to Scrape flippa.com with Scrapy

In our previous tutorials, we discussed how important scraping is in the current information centred world. It is one of the easiest and best ways to gather bulk data from a website. We also discussed how good Scrapy is among other scraping frameworks. It improves scraping functionality to a great extent. In this tutorial, you will learn to face new challenges in scraping. 

Today we are discussing how to scrape flippa.com. flippa.com is a website where you can buy and sell websites and android apps. Actually, they are being put into auction so interested parties can bid on it. In this tutorial, we will scrape android apps to be sold.

There could be many reasons why you want to scrape a site like flippa.com. It is easier for analysing if you can bring all the data into one place. Also, with scraping, you can store data in formats like JSON, CSV, so it’s easier to do computations on them. For example, suppose you want to find the app with lower price and the greater number of installs, you can easily find it with scraped data. 

As usual, let’s start with installing Scrapy. There are two popular ways to install Scrapy. You can either use pip or Anaconda to install Scrapy. Installing Scrapy through pip is quite easier. But sometimes it can install an older version of Scrapy. So we recommend using Anaconda for installing Scrapy. 

Anyhow, we will teach both ways to do it.

Installing Through pip

First, make sure you have the latest python version installed on your machine. 

python -V

If you don’t have pip already installed on your computer, get it from hereIf you already have pip, update it with the following command.

python -m pip install --upgrade pip

Install Scrapy with one command.

pip install Scrapy

Installing Through Anaconda 

Sometimes people assume Anaconda and pip do the same job. This isn’t true. Actually, they have different purposes. pip is a tool for installing packages from the python packaging index, pypi. Anaconda is much more than that.  It is a cross-platform package and an environment manager. Anaconda has its own package manager called conda which is similar to pip. When you use conda it installs packages from Anaconda repository. BTW now Anaconda includes pip as well.

First, get anaconda from hereYou can install Scrapy with conda using the following command.

conda install -c conda-forge scrapy

Next, we need to create a project for our application. Open your cmd console and go to the location where you want to create the project.

cd <Path-to-your-project-dir>

Scrapy has its own command to create a Scrapy project. It will create the initial files required for the Scrapy project. It takes the following form.

scrapy startproject project_name

I will name my project scraping_flippa. So I have to type the following command to create the directory. 

scrapy startproject scraping_flippa

It will create a folder with the project name. You can see the following files in that folder.

Let’s see what each file is about

├── scrapy.cfg          # configuration file
└── scraping_reddit     # This is project's Python module, you need to import your code from this
  ├── __init__.py      # Needed to manage the spider in the project
  ├── items.py            # define modules of scraped items
  ├── middlewares.py      # Define modules of spider middleware
  ├── pipelines.py        # Item pipeline file of the project
  ├── settings.py         # add settings here
  └── spiders             # directory to locate spiders
    ├── __init__.py

Now it’s time to create our spider. The real code necessary to do the scraping resides in this file. We will discuss more the code later. For now, we will create the spider file. 

Go to your project folder

cd scraping_flippa

genspider command is used to create spiders. It takes two arguments.

scrapy genspider spider_name start_url

I’ll name my spider flippa_spider and start URL is https://www.flippa.com/search?filter[property_type]=android_appSo the command I should type is 

scrapy genspider flippaSpider https://www.flippa.com/search?filter[property_type]=android_app

Let’s examine the spider we just created. If you open the file, you can see the following code in it. 

# -*- coding: utf-8 -*-
import scrapy


class FlippaspiderSpider(scrapy.Spider):
    name = 'flippaSpider'
    allowed_domains = ['https://www.flippa.com/search?filter[property_type]=android_app']
    start_urls = ['http://https://www.flippa.com/search?filter[property_type]=android_app/']

    def parse(self, response):
        pass

Scrappy has already created a stub code for us from the information given in the command. It has completed class name, spider name, allowed domains and start_urls according to its naming convention. 

You probably need to change the allowed domains to www.flippa.com. Also, please remove the last slash in start_urls.

allowed_domains = ['www.flippa.com']
start_urls = ['https://www.flippa.com/search?filter[property_type]=android_app']

Writing Scraping Code

parse() method is where we write our logic to scrape the content. You can access the response object here. Before writing the scraping logic, we need to closely examine flippa.com website as scraping code depends on the web page structure and contents.

We want to extract the following details about each app from flippa.

 

Apart from that, I’ll scrape, Buy now price, current price and the link to the app sell page.

flippa.com is a listing website. That means if we go to starting_url from our browser you can see a list of android apps to be sold.

We have to go to each link to get the information we want. Next, our starting_url page only has 50 android apps. To get all the apps we need to go through pagination below the page. So let’s see how we are going to do that.

We can extract the set of links using class name Basic__linkWrapper. 

urls = response.xpath('//a[@class="Basic___linkWrapper"]/@href').extract()

Also, we can extract buy now price from the listing page.

buynw_prices = response.xpath('//div[@class="Basic___buyItNowCol grid__col-3 grid__col-md-2"]/text()').extract()

Next, we create a Request for each link. With Scrapy, you can do it as follows.

for (url,buynw) in zip(urls,buynw_prices):
    url = response.urljoin(url)
    yield scrapy.Request(url = url, callback =self.parse_items, dont_filter=True, meta={'url':url,'buynw':buynw})

When each request is made, the callback function will be called. If we want to pass any parameter to the callback function, we do it with a meta parameter.

Our next task is to write the parse_items function.

def parse_items(self,response):
    app_names = response.xpath('//a[@class="ListingHero-propertyIdentifierLink"]/text()').extract()
    current_prices = response.xpath("//h2[contains(@class, 'ListingStatus-price')]/text()").extract()
    buynw_price = response.meta['buynw']
    url = response.meta['url']
    installs = response.xpath('//div[@id="number_of_installs"]/text()').extract()
    app_ages = response.xpath('//div[@id="app_age"]/text()').extract()
    reviews = response.xpath('//div[@class="Snapshot-subvalue"]/text()').extract()
    prices = response.xpath('//div[@id="app_store_price"]/text()').extract()
    profits = response.xpath('//div[@id="net_profit"]/text()').extract()
    reskins = response.xpath('//div[@id="reskin"]/text()').extract()
    for (app_name, install, app_age,review,price,profit,reskin,current_price) \
            in zip(app_names, installs, app_ages, reviews,prices,profits,reskins,current_prices):
        yield {'App Name':app_name.encode('utf-8').strip(),'Number of Installs': install.encode('utf-8').strip(),
               'App Age': app_age.encode('utf-8').strip(),
               'Rating': review.encode('utf-8').strip(), 'App Store Price': price.encode('utf-8').strip(),
               'Net Profit': profit.encode('utf-8').strip(),'Reskin': reskin.encode('utf-8').strip(),
               'Current Price':current_price.encode('utf-8').strip(),
               'URL':url,'Buy Now Price':buynw_price.encode('utf-8').strip()}

 

Although the code looks long, it basically does two types of things. First, it extracts details about each app and then it yields it. I saw many unimportant characters were extracting with just extract() function. You can remove unwanted characters and trailing spaces with encode(‘utf-8’) and strip() methods. 

We have one more thing to do. We need to go through each pagination page. Let’s inspect the pagination source.

You can access the active page with pagination__item pagination__item–active Class name. You want to ask Scrapy to follow next sibling of that anchor tag. I did it with the following code. 

next_page = response.xpath('//a[@class="pagination__item pagination__item--active"]/following-sibling::a[1]/@href').extract()
if next_page is not None:
    next_page = 'https://www.flippa.com' + next_page[0]
    print(next_page)
    yield response.follow(next_page, self.parse)

 

Now our code is completed. Let’s see how the complete code looks. 

# -*- coding: utf-8 -*-
import scrapy


class FlippaspiderSpider(scrapy.Spider):
    name = 'flippaSpider'
    allowed_domains = ['www.flippa.com']
    start_urls = ['https://www.flippa.com/search?filter[property_type]=android_app']

    def parse(self, response):
        urls = response.xpath('//a[@class="Basic___linkWrapper"]/@href').extract()
        buynw_prices = response.xpath('//div[@class="Basic___buyItNowCol grid__col-3 grid__col-md-2"]/text()').extract()
        for (url,buynw) in zip(urls,buynw_prices):
            url = response.urljoin(url)
            yield scrapy.Request(url = url, callback =self.parse_items, dont_filter=True, meta={'url':url,'buynw':buynw})

        next_page = response.xpath('//a[@class="pagination__item pagination__item--active"]/following-sibling::a[1]/@href').extract()
        if next_page is not None:
            next_page = 'https://www.flippa.com' + next_page[0]
            print(next_page)
            yield response.follow(next_page, self.parse)


    def parse_items(self,response):
        app_names = response.xpath('//a[@class="ListingHero-propertyIdentifierLink"]/text()').extract()
        current_prices = response.xpath("//h2[contains(@class, 'ListingStatus-price')]/text()").extract()
        buynw_price = response.meta['buynw']
        url = response.meta['url']
        installs = response.xpath('//div[@id="number_of_installs"]/text()').extract()
        app_ages = response.xpath('//div[@id="app_age"]/text()').extract()
        reviews = response.xpath('//div[@class="Snapshot-subvalue"]/text()').extract()
        prices = response.xpath('//div[@id="app_store_price"]/text()').extract()
        profits = response.xpath('//div[@id="net_profit"]/text()').extract()
        reskins = response.xpath('//div[@id="reskin"]/text()').extract()
        for (app_name, install, app_age,review,price,profit,reskin,current_price) \
                in zip(app_names, installs, app_ages, reviews,prices,profits,reskins,current_prices):
            yield {'App Name':app_name.encode('utf-8').strip(),'Number of Installs': install.encode('utf-8').strip(),
                   'App Age': app_age.encode('utf-8').strip(),
                   'Rating': review.encode('utf-8').strip(), 'App Store Price': price.encode('utf-8').strip(),
                   'Net Profit': profit.encode('utf-8').strip(),'Reskin': reskin.encode('utf-8').strip(),
                   'Current Price':current_price.encode('utf-8').strip(),
                   'URL':url,'Buy Now Price':buynw_price.encode('utf-8').strip()}

Feed Export:

You have many options to save your data with Scrapy. These formats include JSON, CSV, XML. I’ll store my data in CSV format. For that, you have to configure the settings.py page by adding the following lines. 

#FEED FORMAT
FEED_FORMAT = "csv"
FEED_URI = "flippa.csv"
FEED_EXPORT_ENCODING = 'utf-8'

Finally, run your spider with runspider command.

scrapy runspider spiders/flippaSpider.py

 

Output File:

An output file will be created directly in the scraping_flippa folder

 

Summary:

Let’s summarize the important points in this tutorial. Scraping data into one place will make your data analyzing job easy. Scrapy is probably the best tool for scraping. Scraping code very much depends on the structure and content of the page. You need scrapy. Request function if you want to scrape a listing website. A clear knowledge of XPath is essential for scraping. Finally, you can store your data with many formats such as CSV, JSON, and XML.

LEAVE A COMMENT