Web scraping is the act of pulling data directly from a website by parsing the HTML from the web page itself. It refers to retrieving or “scraping” data from a website. Instead of going through the difficult process of physically extracting data, web scraping employs cutting-edge automation to retrieve countless data points from any number.
- Using a web scraping tool is the easiest and the cheapest way to collect information from Google. Google hides Google results data in the search results as duplicates. If anyone attempts to scrape the search results, Google can block their IP addresses.
- If you are looking for bulk search or building some service around it, you can look into Zenserp. Zenserp is a google search API that solves problems that are involved with scraping search engine result pages. When scraping search engine result pages, you will run into proxy management issues quite quickly.
- Web scraping with Python is easy due to the many useful libraries available. A barebones installation isn’t enough for web scraping. One of the Python advantages is a large selection of libraries for web scraping. For this Python web scraping tutorial, we’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium.
Search engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a specific form of screen scraping or web scraping dedicated to search engines only.
Most commonly larger search engine optimization (SEO) providers depend on regularly scraping keywords from search engines, especially Google, to monitor the competitive position of their customers' websites for relevant keywords or their indexing status.
Search engines like Google do not allow any sort of automated access to their service,[1] but from a legal point of view, there is no known case or broken law.
Web Scraping Google Search Results Python
The process of entering a website and extracting data in an automated fashion is also often called 'crawling'. Search engines like Google, Bing or Yahoo get almost all their data from automated crawling bots.
Difficulties[edit]
Google is the by far largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.[2]
Google does not take legal action against scraping, likely for self-protective reasons. However, Google uses a range of defensive methods that makes scraping their results a challenging task.
- Google is testing the User-Agent (Browser type) of HTTP requests and serves a different page depending on the User-Agent. Google is automatically rejecting User-Agents that seem to originate from a possible automated bot. [Part of the Google error page: Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html ] A typical example would be using the command line browser cURL, Google will directly reject to serve any pages to it while Bing is a bit more forgiving, Bing does not seem to care about User-Agents.[3]
- Google is using a complex system of request rate limitation which is different for each Language, Country, User-Agent as well as depending on the keyword and keyword search parameters. The rate limitation can make it unpredictable when accessing a search engine automated as the behaviour patterns are not known to the outside developer or user.
- Network and IP limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
- Offending IPs and offending IP networks can easily be stored in a blacklist database to detect offenders much faster. The fact that most ISPs give dynamic IP addresses to customers requires that such automated bans be only temporary, to not block innocent users.
- Behaviour based detection is the most difficult defense system. Search engines serve their pages to millions of users every day, this provides a large amount of behaviour information. A scraping script or bot is not behaving like a real user, aside from having non-typical access times, delays and session times the keywords being harvested might be related to each other or include unusual parameters. Google for example has a very sophisticated behaviour analyzation system, possibly using deep learning software to detect unusual patterns of access. It can detect unusual activity much faster than other search engines.[4]
- HTML markup changes, depending on the methods used to harvest the content of a website even a small change in HTML data can render a scraping tool broken until it was updated.
- General changes in detection systems. In the past years search engines have tightened their detection systems nearly month by month making it more and more difficult to reliable scrape as the developers need to experiment and adapt their code regularly.[5]
Detection[edit]
When search engine defense thinks an access might be automated the search engine can react differently.
The first layer of defense is a captcha page[6] where the user is prompted to verify he is a real person and not a bot or tool. Solving the captcha will create a cookie that permits access to the search engine again for a while. After about one day the captcha page is removed again.
The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted or the user changes his IP.
The third layer of defense is a longterm block of the entire network segment. Google has blocked large network blocks for months. This sort of block is likely triggered by an administrator and only happens if a scraping tool is sending a very high number of requests.
All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).
Web Scraping Search Results Free
Methods of scraping Google, Bing or Yahoo[edit]
To scrape a search engine successfully the two major factors are time and amount.
The more keywords a user needs to scrape and the smaller the time for the job the more difficult scraping will be and the more developed a scraping script or tool needs to be.
Scraping scripts need to overcome a few technical challenges:[7]
- IP rotation using Proxies (proxies should be unshared and not listed in blacklists)
- Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective longterm scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
- Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser[8]
- HTML DOM parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
- Error handling, automated reaction on captcha or block pages and other unusual responses[9]
- Captcha definition explained as mentioned above by[10]
An example of an open source scraping software which makes use of the above mentioned techniques is GoogleScraper.[8] This framework controls browsers over the DevTools Protocol and makes it hard for Google to detect that the browser is automated.
Programming languages[edit]
When developing a scraper for a search engine almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.
PHP is a commonly used language to write scraping scripts for websites or backend services, since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/C++ code. Ruby on Rails as well as Python are also frequently used to automated scraping jobs. For highest performance, C++ DOM parsers should be considered.
Additionally, bash scripting can be used together with cURL as a command line tool to scrape a search engine.
Tools and scripts[edit]
When developing a search engine scraper there are several existing tools and libraries available that can either be used, extended or just analyzed to learn from.
- iMacros - A free browser automation toolkit that can be used for very small volume scraping from within a users browser [11]
- cURL – a commandline browser for automation and testing as well as a powerful open source HTTP interaction library available for a large range of programming languages.[12]
- google-search - A Go package to scrape Google. [13]
- GoogleScraper – A Python module to scrape different search engines (like Google, Yandex, Bing, Duckduckgo, Baidu and others) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.[14]
- se-scraper - Successor of GoogleScraper. Scrape search engines concurrently with different proxies. [15]
Legal[edit]
When scraping websites and services the legal part is often a big concern for companies, for web scraping it greatly depends on the country a scraping user/company is from as well as which data or website is being scraped. With many different court rulings all over the world.[16][17][18]However, when it comes to scraping search engines the situation is different, search engines usually do not list intellectual property as they just repeat or summarize information they scraped from other websites.
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service. ([19]) But even this incident did not result in a court case.
One possible reason might be that search engines like Google are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms. A legal case won by Google against Microsoft would possibly put their whole business as risk.
See also[edit]
References[edit]
- ^'Automated queries – Search Console Help'. support.google.com. Retrieved 2017-04-02.CS1 maint: discouraged parameter (link)
- ^'Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly'. searchengineland.com. 11 February 2013.
- ^'why would curl and wget result in a 403 forbidden?'. unix.stackexchange.com.
- ^'Does Google know that I am using Tor Browser?'. tor.stackexchange.com.
- ^'Google Groups'. google.com.
- ^'My computer is sending automated queries – reCAPTCHA Help'. support.google.com. Retrieved 2017-04-02.CS1 maint: discouraged parameter (link)
- ^'Scraping Google Ranks for Fun and Profit'. google-rank-checker.squabbel.com.
- ^ ab'Python3 framework GoogleScraper'. scrapeulous.
- ^Deniel Iblika (3 January 2018). 'De Online Marketing Diensten van DoubleSmart'. DoubleSmart (in Dutch). Diensten. Retrieved 16 January 2019.CS1 maint: discouraged parameter (link)
- ^Jan Janssen (26 September 2019). 'Online Marketing Services van SEO SNEL'. SEO SNEL (in Dutch). Services. Retrieved 26 September 2019.CS1 maint: discouraged parameter (link)
- ^'iMacros to extract google results'. stackoverflow.com. Retrieved 2017-04-04.
- ^'libcurl - the multiprotocol file transfer library'. curl.haxx.se.
- ^'A Go package to scrape Google' – via GitHub.
- ^'A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/GoogleScraper'. 15 January 2019 – via GitHub.
- ^Tschacher, Nikolai (2020-11-17), NikolaiT/se-scraper, retrieved 2020-11-19
- ^'Is Web Scraping Legal?'. Icreon (blog).
- ^'Appeals court reverses hacker/troll 'weev' conviction and sentence [Updated]'. arstechnica.com.
- ^'Can Scraping Non-Infringing Content Become Copyright Infringement... Because Of How Scrapers Work?'. www.techdirt.com.
- ^Singel, Ryan. 'Google Catches Bing Copying; Microsoft Says 'So What?''. Wired.
External links[edit]
- Scrapy Open source python framework, not dedicated to search engine scraping but regularly used as base and with a large number of users.
- Compunect scraping sourcecode - A range of well known open source PHP scraping scripts including a regularly maintained Google Search scraper for scraping advertisements and organic resultpages.
- Justone free scraping scripts - Information about Google scraping as well as open source PHP scripts (last updated mid 2016)
- Scraping.Services source code - Python and PHP open source classes for a 3rd party scraping API. (updated January 2017, free for private use)
- PHP Simpledom A widespread open source PHP DOM parser to interpret HTML code into variables.
- SerpApi Third party service based in the United States allowing you to scrape search engines legally.
You can learn a lot about a search engine by scraping its results. It’s the only easy way you can get an hourly or daily record of exactly what Google, Bing or Yahoo! (you know, back when Yahoo! was a search engine company) show their users. It’s also the easiest way to track your keyword rankings.
Like it or not, whether you use a third-party tool or your own, if you practice SEO then you’re scraping search results.
If you follow a few simple rules, it’s a lot easier than you think.
The problem with scraping
Automated scraping — grabbing search results using your own ‘bot’— violates every search engine’s terms of service. Search engines sniff out and block major scrapers.
If you ever perform a series of searches that match the behavior of a SERP crawler, Google and Bing will interrupt your search with a captcha page. You have to enter the captcha or perform whatever test the page requires before performing another query.
That (supposedly) blocks bots and other scripts from automatically scraping lots of pages at once.
The reason? Resources. A single automated SERP scraper can perform tens, hundreds or even thousands of queries per second. The only limitations are bandwidth and processing power. Google doesn’t want to waste server cycles on a bunch of sweaty-palmed search geeks’ Python scripts. So, they block almost anything that looks like an automatic query.
Your job, if you ever did anything like this, which you wouldn’t, is to buy or create software that does not look like an automatic query. Here are a few tricks my friend told me:
Stay on the right side of the equation
Web Scraping Search Results Examples
Note that I said “almost” anything. The search engines aren’t naive. Google knows every SEO scrapes their results. So does Bing. Both engines have to decide when to block a scraper. Testing shows that equation to make the block/don’t block decision balances:
- Potential server load created by the scraper.
- The potential load created by blocking the scraper.
- The query ‘space’ for the search phrase.
At a minimum, any SERP bot must have the potential of tying up server resources. If it doesn’t, the search engine won’t waste the CPU cycles. It’s not worth the effort required to block the bot.
So, if you’re scraping the SERPs, you need to stay on the right side of that equation: Be so unobtrusive that, even if you’re detected, you’re not worth squashing.
Disclaimer
Understand, now, that everything I talked about in this article is totally hypothetical. I certainly don’t scrape Google. That would violate their terms of service.
And I’m sure companies like AuthorityLabs and SERPBuddy have worked out special agreements for hourly scraping of the major search engines. But I have this… friend… who’s been experimenting a bit, testing what’s allowed and what’s not, see…
Cough.
How I did my test
I tested all these theories with three Python scripts. All of them:
- Perform a Google search.
- Download the first page of results.
- Then downloads the next 4 pages.
- Saves the pages for parsing.
Script #1 had no shame. It hit Google as fast as possible and didn’t attempt to behave like a ‘normal’ web browser.
Script #2 was a little embarrassed. It pretended to be Mozilla Firefox and only queried Google once every 30 seconds.
Script #3 was downright bashful. It selected a random user agent from a list of 10, and paused between queries for anywhere between 15 and 60 seconds.
The results
Script #3 did the best. That’s hardly a surprise. But the difference is:
- Script #1 was blocked within 3 searches.
- Script #2 was blocked within 10 searches.
- Script #3 was never blocked, and performed 150 searches. That means it pulled 5 pages of ranking data for 150 different keywords.
There’s no way any of these scripts fooled Google. The search engine had to know that scripts 1, 2 and 3 were all scrapers. But it only blocked 1 and 2.
My theory: Script 3 created so small a burden that it wasn’t worth it for Google to block it. Just as important, though, was the fact that script 3 didn’t make itself obvious. Detectable? Absolutely. I didn’t rotate IP addresses or do any other serious concealment. But script 3 behaved like it was, well, embarrassed. And a little contrition goes a long way. If you acknowledge you’re scraping on the square and behave yourself, Google may cut you some slack.
The rules
Based on all of this, here are my guidelines for scraping results:
- Scrape slowly. Don’t pound the crap out of Google or Bing. Make your script pause for at least 20 seconds between queries.
- Scrape randomly. Randomize the amount of time between queries.
- Be a browser. Have a list of typical user agents (browsers). Choose one of those randomly for each query.
Follow all three of these and you’re a well-behaved scraper. Even if Google and Bing figure out what you’re up to, they leave you alone: You’re not a burglar. You’re scouring the gutter for loose change. And they’re OK with that.