shanika - June 29, 2019
Web scraping is a process to gather bulk data from internet or web pages. The data can be consumed using an API. But there are sites where API is not provided to get the data. During this condition, we can use Web Scrapping where we can directly connect to the webpage and collect the required data. Now, python is one of the best programming languages, to use for development and web applications. It further helps in the analysis of the data with numeric and scientific calculations. So for this tutorial, we would prefer using python language for scrapping.
For web scrapping, there are multiple tools used but the best one is the Scrapy framework. It is specially designed to crawl websites using structured data. It can easily scaled form scrapping one page to multiple pages. It gives an option to extract this data into various formats like CSV, JSON, XML, JSON lines, etc.
In this tutorial, we will extract the data from Reddit website, for title, date, upvotes and image links.
We need an initial system setup before we start with our scraping project. We can install Scrapy framework either by using Python or Anaconda framework. We will see both the ways of installation; you can pick any of these ways to install Scrapy.
pip install Scrapy
If you are using python, it is good to know the dependencies used in the Scrapy framework. You may need to install dependencies for Scrapy for any issues during installations. Also, note that these packages may depend on the non-python framework which you would require, for Scrapy run successfully.
lxml, is used as an efficient XML and HTML parser,
Parsel, it’s a library written on top of lxml. This is used to extract HTML/XML data.
w3lib, this is a helper which deals with URLs and web page encodings.
Twisted, is used for an asynchronous and event-based network programming.
Cryptography and pyOpenSSL, is used to deal with various network-level security needs.
If you are familiar with Anaconda, you can use it to install Scrapy. You just need to make sure that you are running Python version required for your Anaconda framework to run. Check out their website to know the required python version mentioned for the Anaconda framework. With Anaconda installed you need not worry of other dependencies required for Scrapy, Anaconda will take care of it while installing.
To install Scrapy, open Anaconda Prompt, and type below command.
conda install -c conda-forge scrapy
I would recommend using Anaconda, as it’s faster and would save from getting into installations of the dependencies.
Make a directory for your project. Point your prompt to your project directory.
cd <Path-to-your-project-dir>
To create a project using Scrapy framework type below command.
scrapy startproject scraping_reddit
Once your project is created, you will get the below folder structure.
We will take a brief look at these files, but for this project, we will create a spider in the spider folder and set a few settings in the setting.py file.
├── scrapy.cfg # configuration file
└── scraping_reddit # This is project's Python module, you need to import your code from this
├── __init__.py # Needed to manage the spider in the project
├── items.py # define modules of scraped items
├── middlewares.py # Define modules of spider middleware
├── pipelines.py # Item pipeline file of the project
├── settings.py # add settings here
└── spiders # directory to locate spiders
├── __init__.py
Next, we will create a spider for our webpage to be crawled. Spider is a python code or a template used to crawl the given page using XPath and CSS expressions. We will take a look at these selectors and XPath while writing a spider code. To create spider, navigate to the inner folder of your project. i.e to …/scrapying_reddit/scrapying_reddit, and type below command.
cd scraping_reddit
scrapy genspider redditSpider <your-reddit-link>.com
This command will create a template of a spider, named as “redditSpider” with the link to set as “allowed_domains”. Our Reddit link will be “https://www.reddit.com/r/cats/”. Below is the command we will be using for the given url.
scrapy genspider redditSpider https://www.reddit.com/r/cats/
We will now see the web page where our redittSpider is pointed to and would code it to extract the marked items on this site. We are extracting, time content, image and number of votes.
We will inspect the path for each element, which needs to be extracted. Let’s inspect the image element.
While scraping the web pages, most tasks performed are extracting the data from HTML source. There are options to use various libraries such as, BeautifulSoap which is a python libery, Ixml. Scrapy has its own mechanism to work on, these are called Selectors. They select certain parts of the HTML document specified either by XPath or CSS expressions. names = response.xpath(‘//span[@class=”a-profile-name”]/text()’).extract()
“//”: defines start from the tag which is defined in XPath.
“span”: defines the tag which contains the reviews.
“@class=” a-profile-name”: defines the class used to display reviews. There are a lot of span elements in the web page. The class attribute will help Scrapy locate the span elements that have only reviews.
“text” refers to text of the tag </span>
“extract” : extract every instance of a web page that follows the XPath.
This is the basic structure of the Spider. The variable ‘allowed_domain’ is taken from the command above. You can set the ‘allowed_domain’ to the domain name of the site. Now we will create XPath for the items and crawl the spider recursively.
# -*- coding: utf-8 -*-
import scrapy
class RedditspiderSpider(scrapy.Spider):
name = 'redditSpider'
allowed_domains = ['https://www.reddit.com/r/cats/']
start_urls = ['http://https://www.reddit.com/r/cats//']
def parse(self, response):
pass
Below are the XPath we will use to extract Title, Images, UpVotes, Date and Time.
titles = response.xpath('//*[@class="_eYtD2XCVieq6emjKBH3m"]/text()').extract()
imgs = response.xpath('//img[@alt="Post image"]/@src').extract()
upVotesList = response.xpath('//*[@class="_1rZYMD_4xY3gRcSS3p8ODO"]/text()').extract()
datetimes = response.xpath('//*[@data-click-id="timestamp"]/text()').extract()
We will use attribute selector ‘//*’, to search titles, votes and date time. Once we extract the data, we will then yield it using the zip. Zip() is to map a similar index of different containers, so they can be used using a single entity.
for (title, img, upVotes, datetime) in zip(titles, imgs, upVotesList, datetimes):
yield {'Title': title.encode('utf-8'), 'Image': img, 'Up Votes': upVotes, 'Date Time': datetime}
You know that Reddit only sends a few posts when you make a request to its subreddit. To scrape more data, you need to set up Scrapy to scrape recursively. the first step is to find out the XPath of the Next button. Then use response.follow function with a call back to parse function. This is how I did it in my code.
next_page = response.xpath('//link[@rel="next"]/@href').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
Anyhow, you don’t need to scrape pages endlessly. you can control how many pages you follow using DEPTH_LIMIT property.
custom_settings = { 'DEPTH_LIMIT': 10 }
Below is the complete code for scrapping the Reddit page.
# -*- coding: utf-8 -*-
import scrapy
class RedditspiderSpider(scrapy.Spider):
name = 'redditSpider'
allowed_domains = ['www.reddit.com']
start_urls = ['https://www.reddit.com/r/cats/']
custom_settings = {
'DEPTH_LIMIT': 10
}
def parse(self, response):
titles = response.xpath('//*[@class="_eYtD2XCVieq6emjKBH3m"]/text()').extract()
imgs = response.xpath('//img[@alt="Post image"]/@src').extract()
upVotesList = response.xpath('//*[@class="_1rZYMD_4xY3gRcSS3p8ODO"]/text()').extract()
datetimes = response.xpath('//*[@data-click-id="timestamp"]/text()').extract()
for (title, img, upVotes, datetime) in zip(titles, imgs, upVotesList, datetimes):
yield {'Title': title.encode('utf-8'), 'Image': img, 'Up Votes': upVotes, 'Date Time': datetime}
next_page = response.xpath('//link[@rel="next"]/@href').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
Scrapy gives us the flexibility to store the scraped data in the formats link JSON, CSV, XML, JSON lines given below. We will set the output format in csv file. So add these lines in settings.py file.
#FEED FORMAT
FEED_FORMAT = "csv"
FEED_URI = "reddit.csv"
To run our spider we will use runspider command.
scrapy runspider spiders/redditSpider.py
Below is the output of the above code. You get the file in the path given in feed uri. “..\scraping_reddit\scraping_reddit\output”.
Here the output file shows title, image, upvotes and date time extracted using XPath.
Once the command is run, Scrapy will search for the spider and will run it through its crawler engine. The crawl is started by making a request to URL’s specified in the spider and the default method “parse” is called back. It will parse the response object and then will loop through the elements using a CSS Selector or using the mentioned XPath. These requests are scheduled and processed asynchronously. Thus Scrapy framework is built to not wait for a request to be finished, it will send next request to URL. If one request is failed, it won’t hamper the functioning of other requests.
With the concurrent request which helps you to crawl fast, it will also help to delay the download between each request by limiting concurrent request per domain or per IP. It even has an auto-throttling extension that figures out these automatically.
Scraping is the best way to get the data required in bulk amount. You can use the Scrapy framework to crawl to the Reddit using XPath. Scrapy can be easily installed using the Anaconda framework. It gives us the flexibility to set the format for output files in its setting file. You can even set the concurrent request per domain or per IP or set auto-throttling extension. You can further use this extracted data as input to analyzing tools of Python like Panda to convert these scrapped data into meaningful charts.