shanika - June 11, 2019

How to Scrape Amazon Reviews using Python

Tutorials

Amazon is an e-commerce platform for online marketing. It sells items of multiple categories. We will see a few reasons on why we are scraping Amazon currently.
1) Amazon keeps track of product reviews with ratings and this is an excellent way for any seller to get direct feedback from the customer. Scraping these details will help the seller to monitor the customer opinion on the product.

2) Delivering items to almost every corner of the world is done by Amazon. Before setting up a price for an item, there is market research done at every point. Getting to know these repositories of various products, data category, and brand-wise, is an excellent way to start for e-commerce.

3) Scraping helps in Drop shipping the products. By knowing the top selling products on Amazon, we can be ready for shipping our products after selling it to customers.

Before starting with scraping:

To start with scraping we need to be aware of a few things.

Python is the best programming languages for web scraping due to its structure. Also, there are many famous libraries for Python such as BeautifulSoap, Scrapy to make your job easy. I will use Scrapy for this tutorial as I find it easier to use.
We need to make sure we have the latest version of Python installed on our system. If your code works on the previous version then you may have to tweak the code to point to the latest version of libraries offered by Python. To start afresh, it is always better to work on the latest version of the framework. If you are familiar working with IDE, you could install the latest version of Anaconda for python here.
Take a look at the Amazon product and reviews page. Decide which page you need to scrape. Check out its XML/HTML structure and know its XPath/classes used. List out the paths for each element you need to extract.

Find the perfect Proxy Product.

Proxyrack offers a multiple options to suit most use cases, if you are unsure our 3 Day Trial allows you to test them all.

Residential proxies

Never get blocked, choose your location

View all option available

Datacenter proxies

Super fast and reliable

View all option available

3 Day Trial

Test all products to find the best fit

View all option available

Creating a Scrapy Project:

Now we will start by creating our project. We need to install Scrapy, we can install it in two ways. We can install Scrapy using pip or Anaconda. You heard about Anaconda in one of the previous paragraphs. pip is a package management software for python. That means if you want to install python package like Scrapy, pip install all the dependencies for it freeing you from a great burden. You can get pip from here.

Install Scrapy via pip using the following command.

pip install Scrapy

If you use Anaconda to install Scrapy, use the following command.

conda install -c conda-forge Scrapy

We will start by creating a Scrapy project. Different components used in crawling are placed in this folder. To create a project, use the below command.

cd path-to-your-project

scrapy startproject scraping_amazon_reviews

Once we create a project, we will have a folder and configuration file. Let’s have a look at folder structure and supporting files.

├── scrapy.cfg                  # configuration file
└── scraping_amazon_reviews     # This is project's Python module, you need to import your code from this
├── __init__.py
├── items.py                    # define items here
├── middlewares.py              # middlewares file of the project
├── pipelines.py                # pipeline file of the project
├── settings.py                 # add settings here
└── spiders                     # directory to locate spiders
├── __init__.py

Next, we need to create a spider. This is a python code, describing the method for scraping a web page. It basically crawls through the pages and extracts content. Following code creates a very simple spider that actually does nothing. But it provides a template for us to build our own spider.

First, go inside to the outer folder.

cd scraping_amazon_reviews

Second, go inside to the inner folder.

cd scraping_amazon_reviews

Run the following command.

scrapy genspider spidername your-amazon-link-here

spidername is the spider name. You can give any name you want here. You have to give the URL of the webpage or the domain you are going to scrape in place of `your-amazon-link-here`. I will name my spider `reviewspider` while I am going to scrape this link.

scrapy genspider reviewspider https://www.amazon.com/product-reviews/B01DFKC2SO/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=

The above command will create a new file in your spiders folder. In my case, the file name will be reviewspider.py.

Identify Patterns in a Web page:

As we will have to implement spider in python for scraping our Amazon reviews, we need to analyze and identify the XML pattern in the target web page. Below is the page we will be scraping the reviews.

In chrome, we can inspect this page using the developer’s tools. Press f12 to get the developer’s tools. We can see the below HTML structure which renders the reviews of the product. Look at the image I took from the source code of web page.

We have a division with “cm_cr-review_list”. We will extract rating and review comments for this we need to dig in the additional division created in “cm_cr_review_list”.

Below is the structure of one of the reviews on the Amazon page. The highlighted boxes show the rating and comments given to the product. Rating is placed under the review-rating division and comments are placed in the review-text division. We will now pick these structures as the pattern to define our spider.

Skeleton of Spider file:

You remember that we created a spider in this tutorial. It has the following code.

This file lives inside

[code]scraping_amazon_reviews/scraping_amazon_reviews/spiders/reviewspider.py[/code]

# -* - coding: utf - 8 -* -
 import scrapy 
 class ReviewspiderSpider(scrapy.Spider):
name = "reviewspider"
allowed_domains = ["https://www.amazon.com/product-reviews/B01DFKC2SO/ref=cm_cr_arp_d_viewpnt_lft?pageNumber="]
start_urls = (
  'http://www.https://www.amazon.com/product-reviews/B01DFKC2SO/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=/',
  )
  def parse(self, response):
pass

This is the basic template of a spider. Note the allowed_domains and start_urls. They are taken from the command you run to create this spider. To make things more clear, set allowed_domains to amazon.com.

Don’t change the name of the parse function. This is where you tell Scrapy what elements you are going to scrape in the given web page. we can add multiple functions if needed. Now we will write the customized spider file to scrape Amazon reviews.

Defining Spider for Crawling:

Defining spider is all about writing parse() function. Note that the parse function has the response object. The response contains the source code for the given web page. It’s up to you to extract the needed information for you from it.

This is an example for an Amazon review taken from the web page I am going to scrape. Important information about a review includes Name of reviewer, review title, rating, review, and number of comments. The procedure will be very same for any other information you want to scrape.

Xpath is probably the best way to reach any element in a web a page. Good news is Scrapy uses XPath reach a certain node in the response. A common code line to scrape from response would be like this.

reviews = response.xpath('//span[@class="review-text "]/text()').extract()

Let’s have a look at how XPath is formed and what syntax is used.

“//”: defines start from the tag which is defined in XPath.
“span”: defines the tag which contains the reviews.
“@class=”review-text”: defines the class used to display reviews. There are a lot of span elements in the web page. The class attribute will help Scrapy locate the span elements that have only reviews.
“text” refers to text of tag </span>
“extract” : extract every instance of a web page that follows the XPath.

We will need following xpaths to get the information we want.

names = response.xpath('//span[@class="a-profile-name"]/text()').extract()
reviewTitles = response.xpath('//a[@data-hook="review-title"]/span/text()').extract()
starRatings = response.xpath('//span[@class="a-icon-alt"]/text()').extract()
reviews = response.xpath('//span[@data-hook="review-body"]/span/text()').extract()
noOfComments = response.xpath('//span[@class="a-size-base"]/text()').extract()

There is one more step needed to complete our code. We have to yield the information we saved in those variables. There are 5 lists and we need to yield all lists in one go.

for (name, title, rating, review, comments) in zip(names, reviewTitles, starRatings, reviews, noOfComments):
            yield {'Name': name, 'Title': title, 'Rating': rating, 'Review': review, 'No of Comments': comments }

Scrapy allows us to crawl multiple URL at the same time. Create a base URL, create and identified URL list to be appended to base URL using “.urljoin()”. But we will not use it in this example.

So, the complete code for our application is

import scrapy
# Implementing Spider
class ReviewspiderSpider(scrapy.Spider):
# Name of Spider
name = 'reviewspider'
allowed_domains = ["amazon.com"]
start_urls = ['https://www.amazon.com/product-reviews/B01DFKC2SO/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=']
 def parse(self, response):
names = response.xpath('//span[@class="a-profile-name"]/text()').extract()
reviewTitles = response.xpath('//a[@data-hook="review-title"]/span/text()').extract()
starRatings = response.xpath('//span[@class="a-icon-alt"]/text()').extract()
reviews = response.xpath('//span[@data-hook="review-body"]/span/text()').extract()
noOfComments = response.xpath('//span[@class="a-size-base"]/text()').extract()
for (name, title, rating, review, comments) in zip(names, reviewTitles, starRatings, reviews, noOfComments):
yield { 'Name': name, 'Title': title, 'Rating': rating, 'Review': review, 'No of Comments': comments }

Once we build our spider. We will run the spider by using runspider command. It takes the output file name to store the extracted data.

scrapy runspider spiders/reviewspider.py -o amazonreviews.csv

When you run the above code amazonreview.csv file will be created in scraping_amazon_reviews folder. I could successfully produce the results.

The above command appends data to the CSV file. if you run that command again, the same data will be appended at the end of the file. This command is good if you are going to scrape the same type of information from several web pages. If you want to overwrite the file, use the following command.

scrapy runspider spiders/reviewspider.py -t csv -o - > amazonreviews.csv

There are two things to be noted. First,>; tells Scrapy to overwrite the amazonreviews.csv file. Second, you have to specify the file format using -t option. Both of them is needed to execute this command successfully.

Once you get the data after scraping you can do exploratory data analysis. We can have counting of ratings with the words used in complaints. We can also use panda which is an open source tool for data analysis written in Python. Using Panda we can have data charted on various charts, also we can create charts of cloud tags used in reviews.

Summary:

Scraping is the best way to get the required bulk data from web pages and analyze it using web crawling methods. We can create our own customized scraper or use a built-in framework. There are free scraping tools, but to create one we can use an open source web scraping framework called Scrapy. This is a definite choice for large web scraping projects. Advantage of using this framework is, it is built on “Twisted asynchronous networking” framework. We have seen how to set up Python and Scrapy framework, to crawl through the Amazon pages and extract the required data of reviews like rating and comments. This data can be analyzed using Panda, which is an analysis tool of Python.