Sam - November 28, 2018
Call it web scraping, data scraping, data extraction, screen scraping, web harvesting, or more generally DaaS (Data as a Service). By any name, Big Data has become a fundamental business tool in the 21st Century global business environment and collecting and analyzing that data is crucial for any business which finds itself in a highly competitive market. In this post, the techno-wizards at Proxy Rack explore the important role of web scraping in specific business applications across a broad range of industries.
We’inkedinl define exactly what web scraping is and how businesses like yours can use it to boost profits, foster robust growth, and streamline Big Data processes. We’ll explain why web scraping is an essential business tool, and how it is accessible for enterprises of all sizes in a variety of markets and industries.
Web scraping is the general term for the various automated methods applied for collecting information from the internet. To be effective, this is accomplished by software which simulates human web browsing, or web surfing, to collect information from websites. In the old school business environment, the value of knowing your competitor and keeping up to speed with supply and demand variables and other changing market conditions for any enterprise are well-established business fundamentals.
The value of business data is nothing new, but the demands on any business relying on internet marketing and data go far beyond the limited human capacity to collect that information from thousands of relevant websites, and that’s where the automated web scraping technique comes in. Web
“bots” comprise an automated workforce which is able to go about the assigned data collection tasks on a relentless 24/7/365 routine when required.
Businesses extract information from a website for a number of reasons, two of the most common being to grow the business by establishing a sales pipeline and to discover where competitors are setting their prices. Entrepreneur magazine’s Andrew Medal describes how web scraping is used as a growth hack by setting up a common robot.txt file which tells a web crawler what to look for on a web page. In Medal’s example, a sneaker reseller assigns a bot to browse for the terms “Jordan” and “Air Jordan” at popular competing retail sites such as eBay and StockX. In this way, the reseller is able to access the aggregate prices charged by the competition and use that information as a competitive edge.
Of course, the reseller’s competitors are using the same essential web scraping techniques themselves, which leads us to the web scraping measures/countermeasures scenario reminiscent of the old “Spy vs. Spy” comics.
As data security consultant and user acquisition expert Eran Halevy notes in his own 2018 contribution to Entrepreneur, web scraping has quickly developed into an unavoidable online arms race for the internet marketing sector. His article describes the web scraping slug out between retail giants Amazon and Walmart. Amazon made industry news in 2017 by successfully blocking Walmart’s digital army of bots from web scraping Amazon’s listings “several million times a day”. This online cold war spawned a whole new sector of third-party service providers who specialize in identifying and blocking web scraping by competitors. That’s how valuable the data is.
The Amazon/Walmart skirmish also highlights an important reality for enterprises at all levels concerning web scraping. If you’re not doing it yet, you can safely bet that your competitors are, which may explain your enterprise’s otherwise mysteriously diminishing returns. The CEO of NY wholesaler Boxed explained his reasons for scraping his competitors’ sites every 20 minutes saying, “If we’re not decently priced, we’ll see it almost immediately in sales declines.”
Retail price competition is just one rather obvious aspect of the business value of Big Data. Now let’s look at some other rather surprising and creative ways businesses can profit from the Big Data collected by web scraping.
The Beauty Industry is a $445 billion dollar industry according to this article by HuffPost’s style and beauty reporter Julia Brucculieri, with the average American woman spending up to $300,000 just on face products in her lifetime. To date, most “big beauty” brands aren’t selling products tailored to fit the needs of individual consumers, but companies such as Proven are changing that “one-size-fits-all” mentality with advanced web scraping combined with Artificial Intelligence (AI) technology.
The database at the center of Proven and their tailored product development strategy was 2 years in the making. The data was compiled by web scraping more than 8 million consumer reviews about 100,000 skin care products. Bots also scanned for data on 20,000 beauty ingredients while glomming through 4,000 scientific articles about skin and ingredient details. Specific keywords such as “acne” or “wrinkles” are connected to product reviews and ratings using machine learning. In this way, products can be tailored to use the ingredients proven as most successful for various skin conditions.
Consumers contribute to the success of their personal skin care products by taking a short dermatology survey to determine age, skin type, skin goals, ethnicity, and geographic location. Calculations are made using the web scraped data to develop a unique skin profile and a customized skin care regimen tailored to each customer’s specific needs. The massive web scraped database also lets Proven avoid ingredients which are not a good fit with certain skin types and could actually cause harm.
Tristan Dresbach of the NYC Data Science Academy came up with a creative use for web scraping when he asked the question “What characteristics maximize the probability of a successful Kickstarter Campaign?” The popular crowdfunding platform Kickstarter has drawn nearly 4 billion in US dollars pledged for business start-up campaigns.
The crowdfunding platform provides an exciting alternative to traditional start-up funding sources such as small business loans, finding an angel, or risking your own hard-earned cash. The percentage for a successful, fully funded campaign as of October 2018 at Kickstarter is a daunting 36.4%, with a 63.6% failure rate according to Statista. (Donations are returned to the donors in any failed Kickstarter campaign, those which do not achieve full funding.)
Dresbach decided to use web scraping to analyze the winning 36% to identify the key characteristics of successful campaigns. He created a script to extract 20+ variables including city, state, number of updates, reward levels, campaign duration, category, and creator to name just a few. Dresbach was able to determine important parameters for success at Kickstarter including:
Type of project – Dance, music, and theater. (With a warning that hip-hop and electronic dance should be avoided since these risky projects fall below 40% of funding.)
Ideal Funding Goal– $300 to $400 campaigns are most successful of all in the broader success range of $300-$1700.
Best Campaign Duration- 1,9, and 15-day campaigns have the highest probability for success.
Best Campaign Launch Locations- Vermont is the best with Wyoming the worst.
Top Campaign Impact Factors– Surprisingly, comments and updates have more impact on campaign success than reward levels.
Dresbach has just “scraped the surface” of this project and plans to expand to 200 sub-categories to more precisely predict the best ways to create a start-up project, set the minimum funding goal, set reward levels, and choose a deadline for successful funding campaigns at Kickstarter.
Big Data and analytics are enhancing recruiting and talent management in the human resources sector of industries across the board. Companies can engage in proactive hiring, using web scraping to locate and attract the best-qualified candidates for the positions they have available. They no longer need to rely on the intuition and limited resources of individual human recruiters when it’s time to build the dream teams that will attract investors and inspire customer or client loyalty.
Web scraping allows recruiters to expand the search for precisely qualified talent beyond the usual resume sources at LinkedIn or Indeed. Though these massive employment sites contribute a significant amount of hiring data to the recruiting process, web scraping can expand the search to social media and industry websites to aggregate data which enhances hiring with decisions based on facts and eliminates much of the risk and guesswork that is inherent in traditional hiring. In the IT field, for example, web scraping can be used to grade programmer candidates based on their coding abilities and the track record of actual programming contributions they have made online.
Of course, from the job seeker’s side, web scraping can also be useful, as self-described “aspiring data scientist” Michael Salmon explains in his article “Web Scraping Job Postings from Indeed”. Salmon describes his method as working smarter, not harder when parsing massive amounts of job listings at Indeed, which by the way also uses web scraping to compile it’s huge aggregated job lists.
A quick Google search for “Generating leads with web scraping” reveals what is probably the most well-known and widely applied application of web scraping. What enterprise could resist the potential to generate 10,000 leads in 10 minutes? Andrew Fogg, Chief Data Officer and co-founder of Import.io. explains how to use web scraping to generate sales leads “in masse” in his article at Sales Hacker.
Web scraping provides a much higher quality of leads than the old technique of buying databases full of phone numbers and email addresses. The quantity is there, but without important “inside information” about the names in the data it’s impossible to sift out the hot prospects from the cold. Web scraping can be used to change all that.
As Fogg explains, quantity and quality are both enhanced when web-based data is the source tapped using a simple 3-step procedure:
Develop your ideal user (prospect) and locate the websites where they can be found
Use an API (application program interface) which extracts important data about each prospect
Collect the data in a spreadsheet containing names and contact information
The ideal user defined in step 1 is the key to quality leads. The web scraping tools can filter through the massive amounts of bulk information on the internet, extracting only the specific and relevant data using a set of your company’s pre-defined parameters.
Ranking on the almighty SERP, or Search Engine Result Pages, is fundamental to success in today’s competitive business environment, and SEO plays a key role in the online marketing world. Online reviews carry more authority with consumers today than a word-of-mouth recommendation from someone they know personally. SERP ranking is significantly affected by the website’s authority as assessed by the almighty Google search engine algorithms which take into account the number of backlinks to a site, the relevance of keywords users are searching, and the queries, or “long-tailed keywords” which are answered by informative content at the site.
One of the most popular web scraping SEO software suites in the digital marketing field is ScrapeBox. Web scraping functions allow users to:
Harvest thousands of URLs from Google, Bing, Yahoo, and 30 other search engines. Use to research competitors and locate new blogs to post comments about your product or service.
Post comments with backlinks to your website on dozens of relevant platforms. Backlinks are one of the most effective ways to boost SEO and ScrapeBox’s trainable poster can post thousands of comments in minutes.
Harvest the top keywords to create thousands of long-tailed keywords tailored to boost your ranking, scraped from sources such as Google Suggest for maximum SEO impact.
As we mentioned above, web scraping is a highly effective technique to gain a competitive edge over rival enterprises. That means that you’ll want to keep your web scraping programs confidential, and that means you need a reliable proxy service to mask your machine’s IP address. Our proxies work with any kind of software which supports HTTP or SOCKS. We’ve tested our proxies and they operate well in support of:
Also check out this web scraping API
ProxyRack serves more than 50,000,000 page requests and powers some of the largest data mining companies on the web, in data mining operations spanning 3 continents. When you’re ready to give your enterprise the Big Data competitive edge that web scraping provides, don’t hesitate to contact us for the proxy services and technology which support Big Data extraction performance.