How to Scrape Data from Twitter
Do you intend to scrape data from Twitter? In this post, we will show you how to do so.
Twitter is one of the top 3 social networking websites in this digital age. It is more of a microblogging app where users can upload a status update of not more than 140 characters.
In this regard, Twitter users (Tweeters) can post just about anything and share ideas and feelings with other tweeter users using the mobile app or the web-based app.
This tutorial focuses on using data scraper to mine data from twitter. Data mined includes user names, number of followers, hashtags, photos and profile pics, links, geo-locations, date of signing up, etc.
Why you need to scrape data from Twitter? Is it Legal?
Twitter is a massive platform of information useful to marketers. With twitter scraping tools, marketers can:
- Connect to great market influencers
- Effectively Monitor their competitors
- Perform sentiment analysis
- Study customer behavior
- Target market audience with the relevant tweets.
- Monitor marketing brands
Also, data Scraping from twitter is essential to researchers for researching and understanding some of the occurrences happening online.
Researchers can use data scraping tools to:
- Monitor the popularity of tweets and people on twitter.
- Gather information about tweeters. Such data include; friends, followers, profile pics, sign-up dates, etc.
- Know who gets mentioned using the ‘@’ usernames
- Survey how trends develop and change with time
- Examine other twitter networks and communities
- Followup on the influence of your tweets on people
Using API (Application Programming Interface) to scrape data from twitter is legal and authorized by twitter for third-party use without running into any form of trouble with Twitter.
Twitter does not permit you to scrape too much data beyond what the API allows you to. For this reason, most twitter scrapers use other web scrapers or develop scrapers of their own. Doing so may, and may not get you into trouble depending on the purpose of collecting data from Twitter.
How to scrape data from Twitter
There are varieties of tools for scraping twitter that does not require you to have programming knowledge. Such tools make data gathering from twitter easy.
Some of the popular tools and how to use them are discussed below.
Octoparse is an excellent tool for scraping data from social media sites.
Follow the guide below to use Octoparse
- Download and install the latest version of Octoparse on your system.
- Your system must meet the below criteria
- Windows OS 7, 8 or 10
- Microsoft NET Framework 3.5 (.Net3.5 SP1)
- Register an account with Octoparse
Getting your Twitter URL
- Copy the URL of your twitter search result
- Paste the copied URL in the ‘Extraction URL’ box and save
Get more Data
- From the ‘Advanced Options,’ select ‘Scroll Down.’
- Set the ‘scroll down’ to a suitable ‘Scroll times’ and ‘Interval.’
- Click on ‘Scroll down for one screen’ as ‘Scroll way’ and click ‘OK.’
Loop extra data from a tweet.
To loop extra data from tweets, create a ‘Loop Item.’
- Select the Data you want to extract from the webpage. The selected data area is highlighted.
- Click on ‘Select all’>> ‘Extract text from the selected elements’ in the ‘Action Tips’ panel.
- You can choose to rename the ‘Field Name’ column if you have to.
Use Regular expression for reformatting data
- You can skip this step if you’re OK with the result.
- You can use the regular expression to delete words like ‘Retweet,’ ‘Like’ ‘Reply,’ etc. to use the regular expression,
- Click on the ‘Reply’ row and select the ‘Customize data field.’
- Click on ‘Refine extracted data’ and select ‘Add step.’
- Click the ‘Replace’ button and paste the ‘Reply **’ with all space values from the extraction data’ Reply 856′ in the ‘Replace’ box.
- Click ‘OK’
- Click on ‘Start Extraction’ >> ‘Local Extraction’
- Click ‘Export’ to export scraped Data
Scrapestorm is a web scraping tool developed based on AI technology. It supports Windows, Mac, well as the Linux OS.
To use the Scrapestorm to scrape data, follow the guide below
- Download and install ScrapeStorm on your system
- Register an account with ScrapeStorm and log in.
Create a task
- To create a task, copy the URL of your twitter search result.
- Create a ‘New smart mode task.’ You can also create a task by importing the task rules.
- Open the ‘URL edit’ window.
- Paste the URL in the opened window.
Set the scraping rules
- Intelligent mode recognizes the fields in your search result URL and automatically creates the fields in your URL.
- You can edit any of the fields, rename, add or delete fields, modify data in the fields, etc. by right-clicking on the field.
Setup your scraping task
- You can set schedule, IP rotation& delay, auto export, speed boost, download images, etc.
- Data scraping starts automatically after a short while
Export your Data
- Click on the ‘Export’ button to export scraped data.
- Choose the file format for viewing export data. File format options available include Excel, CSV, HTML, text, and database.
- Professional plan subscribers can export data files directly to WordPress.
The WebScraper is a useful tool for scraping historical data from twitter. By using the right filters, you can scrape advanced search data from Twitter. Such data can be quite valuable for market analysis.
To use the web scraper to scrape data from Twitter, follow the guidelines below
- Download and install the web scraper chrome extension from Google Chrome store
- Right-click and select ‘Inspect.’
- A developer console pops-up
- Click on the ‘Web Scraper’ tab and click on ‘Create a new sitemap.’
- Click on ‘Import sitemap’ to import parameters from the sitemap JSON box. The sitemap is a navigation guide that navigates you through the site and how data can be extracted.
Finding historical tweets with Twitter Advanced Search
The Twitter Advanced Search is a tool for finding historical tweets that you can filter using parameters like Words, People, and Dates.
- Visit https://twitter.com/search-advanced?lang=en. Filter based on your needs.
- Do a search
- Copy the search result URL from the address bar
- On the WebScraper toolbar, click on the Sitemap button and click on ‘Edit metadata’
- Paste the search URL from ‘Twitter’s advanced search page.
To start scraping,
- Visit the sitemap and click ‘Scrape’ from the drop-down menu
- A new Chrome tab opens up. This enables Google Chrome to crawl and scrape data.
- Once scraping is complete, the browser closes and sends a notification.
Downloading the scraped Data
- Go to the sitemap drop-down
- Click on ‘Export as CSV’
- Select ‘Download Now’
- A CSV file with all the scraped Data starts downloading.
4 .PhantomBuster Twitter API
The PhantomBuster Twitter API is a great data scraping tool for extracting the profiles of key followers. This list is essential in building audiences for twitter ads or as strategies to get more followers.
Follow the steps below to install and use the PhantomBuster Twitter API
- Create an account with PhantomBuster Twitter API
- Add the PhantomBuster account to your Twitter account.
- Click on the configure menu icon in the ‘Console.’
- Create a spreadsheet of the twitter URLs you want to extract from using google spreadsheets.
- Paste the URLs by rows in the spreadsheet.
- Copy the spreadsheet’s URL to Phantombuster.
The Phantombuster extension makes it easy for Phantombuster to authenticate itself using your cookies session.
- Click on ‘Launch’ to start your data scraping automation
- You can schedule repetitive launches of the Phantombuster to circumvent rate limits, mine more data, and spread workflows over days, weeks, or months. You can change the settings using the settings buttons of your dashboard.
- Select the frequency of repetitive launches.
- File output from in Phantombuster is in CSV or JSON format with the following fields.
- Profile URL of a specified twitter account follower
- Name, bio, location, User ID, etc.
(Check for other Google Chrome extensions or any other tools)
Tweepy is a commonly used data scraping tool for gathering hashtags, usernames, tweets, etc. on twitter. It is an interface between Twitter and Python.
To use Tweepy, You will need:
- A valid Twitter account
- A python program installed on your system. You can use python 2.7 or python 3.0
- Anaconda package installer
Visit the twitter application page and log in with your twitter account to generate a series of access codes that permit you to scrape data from twitter.
- Give your application a name.
- Enter a description in the ‘Description’ field. E.g., I want to scrape tweets via hashtags.
- Input a placeholder name in the website field. E.g., http://placeholder.com
- Tick the developer agreement checkbox
- Click on ‘Create your Twitter application’ to create your application.
- Select the ‘Keys and access tokens.’ Take note of four different codes.
- Copy the codes to notepads. The first of the code is the ‘consumer key’ (API Key). The second code is ‘Consumer Secret’ (API Secret).
- Scroll down and click on ‘Create my access token.’ Scroll down and copy the ‘Access Token’ code and ‘Access Token Secret.’ Keep the codes safe.
- Download Tweepy, which is a python library
- Launch and navigate the anaconda terminal ‘C:\users\Ritvik>’
- Type pip install Tweepy, which is the downloaded library. It is an interface between Python and twitter that has a lot of built-in functions
- Press the enter key.
- The python library accesses the internet and collects everything you need to install Tweepy 3.6.0
On the Tweepy, you need to specify the following:
- The hashtags you want to scrape data from
- the consumer key
- The consumer secret access token and access token secret got from the twitter application website.
- The first thing the hashtags function does is create an authentication object called ‘auth’ which is created from the four different access codes. The ‘auth’ validates you as an authentic user.
- The next thing the function does is to create an API object, which is a language that you will use to request data from twitter.
- Type in a name for your spreadsheet in the ‘name’ field. You should name your spreadsheets using the hashtags.
- Name the header rows with the fields you want to fill up on the spreadsheet. The header rows can be the timestamp of the tweet, the text of the tweet, the tweeter (person), the hashtags, and the number of followers.
- Specify the filters you want to apply, and the number of tweets you want to analyze in the item section.
- Fill in the codes extracted from the tweeter application website
- Fill in the hashtag phrase
- Open your working directory
- Open the spreadsheet. Your spreadsheet will contain columns of The timestamp, the tweet text, the username, all hashtags, and the number of followers.
With the web scraper tool, you can generate huge volumes of data from twitter. The generated data can be used for research and market analysis and any other applicative usage. Nevertheless, you can set your parameters and filters to streamline your scraped data.