What is Web Scraping?
Web scraping is a technique to extract a large amount of data from a website and display it or store it in a file for further use.
It is used to crawl and extract the required data from a static website or a JS rendered website.
Few terms to get familiar with:
- Nokogiri will provide us a format from where we can start to extract data out of the raw HTML. Then we have used Byebug. It will set a debugger that lets us interact with some of these variables.
- Web Scraping with Nokogiri/Kimono Robert Krabek CS50 Fall 2015. Workspace setup and installation Creating a new workspace. Basic Nokogiri Scrape Setup Include required libraries Open the url and store into a variable Search variable for unique html tag with.css Output content.
- PyQuery - Extracts data from web pages with a jquery-like syntax. Retrying - Allows you to add a decorator to any function/method to retry on an exception. Requests-HTML - Combines PyQuery, Requests, parse, and other libraries for a pleasant and intuitive web scraping experience. Riko - A python stream processing engine modeled after Yahoo!
- Nokogiri:
- Uses CSS selectors or XPath for web scraping.
- Capybara:
- Allows JS-based interaction with the websites.
- Kimurai:
- It is a web scraping framework in ruby.
- Combination of Nokogiri + Capybara.
- Allows scraping data for JS rendered websites and even static HTTP requests.
There are few tools available for web scrapings such as Nokogiri, Capybara and Kimurai. But, Kimurai is the most powerful framework to scrape data.
Kimurai
There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly. Finally check maybe more about Ruby's basics before continuing with web scraping. Here is the code with fixes: require 'rubygems' require 'mechanize' agent = Mechanize.new page = agent.get ('fp =. 👉 NEW Patreon: 👉 Subscribe For More Ruby Videos: https://www.youtube.com/channel/UCkoEStUK7wxmZef2DcPuCAQ?subconfirmation=1👉.
A web scraping framework in ruby works out of the box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows us to scrape and interact with JavaScript rendered websites.
Features :
- Scrape JS based websites.
- Supports Headless Chrome, Headless Firefox, PhantomJS or Simple HTTP requests(mechanize) engines.
- Capybara methods used to fetch data.
- Rich library for built-in helpers to make scraping easy.
- Parallel Scrapping – Process web pages concurrently.
- Pipelines: To organize and store data in one place for processing all spiders.
You can also scrape data from JS rendered websites, i.e. infinite scrollable websites and even static websites. Amazing right !!!
Read Also: Web scraping using Mechanize in Ruby on Rails
Static Websites:
You can use this framework in 2 ways:
- Making a rails app and extract information with the help of models and controllers.
- Create a new rails app.
rails _5.2.3_ new web_scrapping_demo --database=postgresql
- Change the database configurations in app/config/database.yml as per the requirement to run in the development environment.
- Open rails console and create a database for the web app:
rails db:create
- Add gem ‘kimurai’ to Gemfile.
- Install the dependencies using:
bundle install
- Generate a model using the below command with the parent as Kimurai::Base instead of ApplicationRecord:
rails g model Web Scrapper --parent Kimurai::Base
- Perform database migrations for this generated model.
rails db:migrate
- Generate a controller using:
rails g controller WebScrappersController index
- Make a root path for the index action:
root 'web_scrappers#new'
- Add routes for WebScrapper model:
resources: web_scrapper
- Add a link to the index.html.erb file as shown below:
<%= link_to 'Start Scrap', new_web_scrapper_path %>
- Now add an action in the WebScrappersController to perform scraping:
def new
Web Scrapper.crawl!
end
Note: Here, crawl! Performs the full run of the spider. parse method is very important and should be present in every spider. The entry point of any spider is parse.
- Now add some website configurations in the model for which you need to perform scrapping.
Here,
@name
= name of the spider/web scraper
@engine
= specifies the supported engine
@start_url
s = array of start URLs to process one by one inside parse method.
@config
= optional, can provide various custom configurations such as user_agent, delay, etc…
Read the Case Study about – Web Scraping RPA (Data Extraction)
Note: You can use several supported engines here, but if we use mechanize no configurations or installations are involved and work for simple HTTP requests but no javascript but if we use other engines such as selenium_chrome, poltergeist_phantomjs, selenium_firefox are all javascript based and rendered in HEADLESS mode.
- Add the parse method to the model for initiating the scrap process.
Here, in the above parse method,
response
= Nokogiri::HTML::Document object for the requested website.
URL
= String URL of a processed web page.
data
= used to pass data between 2 requests.
The data to be fetched from a website is selected using XPath and structures the data as per the requirement.
- Open the terminal and run the application using:
rails s
- Click on the link
'Start Scrap'
- The results will be saved in the
results.json
file usingsave_to
helper of the gem.
- The results will be saved in the
- Click on the link
- Now, check out the stored
JSON
file, you will get the scraped data.
- Now, check out the stored
Nokogiri Web Scraping Software
Hooray !! You have extracted information from the static website.
- Making a simple ruby file for extracting the information.
- Open the terminal and install kimurai using the below-mentioned command:
gem install kimurai
- You can refer to the code written for the generated model and make a ruby file using it.
- Run that ruby file using:
ruby filename.rb
Dynamic Websites / JS rendered websites:
Pre-requisites:
Install browsers with web drivers:
For Ubuntu 18.04:
- For automatic installation, use the
setup
command:
$ kimurai setup localhost --local --ask-sudo
Note: It works using Ansible. If not installed, install using:
$ sudo apt install ansible
- Firstly, install basic tools:
sudo apt install -q -y unzip wget tar openssl
sudo apt install -q -y xvfb
- For manual installation, follow the commands for the specific browsers.
You can use this framework in 2 ways:
- Making a rails app and extract information with the help of models and controllers.
- Follow all the above steps from
a to o
for static websites. - Change the @engine from
:mechanize
to:selenium_chrome
for using chrome driver for scraping. - Also, change the parse method in the model to get the desired output.
- Follow all the above steps from
- Making a simple ruby file for extracting the information.
- Open the terminal and install kimurai using the below-mentioned command:
gem install kimurai
- You can refer to the code written for the generated model in the section of the dynamic website and make a ruby file using it.
- Run that ruby file using:
ruby filename.rb
You can find the whole source code here.
Visit BoTree Technologies for excellent Ruby on Rails web development services and hire Ruby on Rails web developers with experience in handling marketplace development projects.
Reach out to learn more about the New York web development agencies for the various ways to improve or build the quality of projects and across your company.
Consulting is free – let us help you grow!
Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (C) and xerces (Java).
Guiding Principles¶
Some guiding principles Nokogiri tries to follow:
- be secure-by-default by treating all documents as untrusted by default
- be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers
Features Overview¶
- DOM Parser for XML and HTML4
- SAX Parser for XML and HTML4
- Push Parser for XML and HTML4
- Document search via XPath 1.0
- Document search via CSS3 selectors, with some jquery-like extensions
- XSD Schema validation
- XSLT transformation
- 'Builder' DSL for XML and HTML documents
Status¶
Support, Getting Help, and Reporting Issues¶
All official documentation is posted at https://nokogiri.org (the source for which is at https://github.com/sparklemotion/nokogiri.org/, and we welcome contributions).
Consider subscribing to Tidelift which provides license assurances and timely security notifications for your open source dependencies, including Nokogiri. Tidelift subscriptions also help the Nokogiri maintainers fund our automated testing which in turn allows us to ship releases, bugfixes, and security updates more often.
Reading¶
Your first stops for learning more about Nokogiri should be:
- An excellent community-maintained Cheat Sheet
Ask For Help¶
There are a few ways to ask exploratory questions:
- The Ruby Discord chat server is active at https://discord.gg/UyQnKrT
- The Nokogiri mailing list is active at https://groups.google.com/group/nokogiri-talk
- Open an issue using the 'Help Request' template at https://github.com/sparklemotion/nokogiri/issues
Please do not mail the maintainers at their personal addresses.
Report A Bug¶
The Nokogiri bug tracker is at https://github.com/sparklemotion/nokogiri/issues
Please use the 'Bug Report' or 'Installation Difficulties' templates.
Security and Vulnerability Reporting¶
Web Scraping With Python
Please report vulnerabilities at https://hackerone.com/nokogiri
Full information and description of our security policy is in SECURITY.md
Semantic Versioning Policy¶
Nokogiri follows Semantic Versioning (since 2017 or so).
We bump Major.Minor.Patch
versions following this guidance:
Major
: (we've never done this)
- Significant backwards-incompatible changes to the public API that would require rewriting existing application code.
- Some examples of backwards-incompatible changes we might someday consider for a Major release are at
ROADMAP.md
.
Minor
:
- Features and bugfixes.
- Updating packaged libraries for non-security-related reasons.
- Dropping support for EOLed Ruby versions. Some folks find this objectionable, but SemVer says this is OK if the public API hasn't changed.
- Backwards-incompatible changes to internal or private methods and constants. These are detailed in the 'Changes' section of each changelog entry.
Patch
:
- Bugfixes.
- Security updates.
- Updating packaged libraries for security-related reasons.
Installation¶
Requirements:
- Ruby >= 2.5
- JRuby >= 9.2.0.0
Native Gems: Faster, more reliable installation¶
'Native gems' contain pre-compiled libraries for a specific machine architecture. On supported platforms, this removes the need for compiling the C extension and the packaged libraries, or for system dependencies to exist. This results in much faster installation and more reliable installation, which as you probably know are the biggest headaches for Nokogiri users.
Supported Platforms¶
As of v1.11.0, Nokogiri ships pre-compiled, 'native' gems for the following platforms:
- Linux:
x86-linux
andx86_64-linux
(req:glibc >= 2.17
), including musl platforms like Alpine - Darwin/MacOS:
x86_64-darwin
andarm64-darwin
- Windows:
x86-mingw32
andx64-mingw32
- Java: any platform running JRuby 9.2 or higher
To determine whether your system supports one of these gems, look at the output of bundle platform
or ruby -e 'puts Gem::Platform.local.to_s'
.
If you're on a supported platform, either gem install
or bundle install
should install a native gem without any additional action on your part. This installation should only take a few seconds, and your output should look something like:
Other Installation Options¶
Because Nokogiri is a C extension, it requires that you have a C compiler toolchain, Ruby development header files, and some system dependencies installed.
The following may work for you if you have an appropriately-configured system:
If you have any issues, please visit Installing Nokogiri for more complete instructions and troubleshooting.
How To Use Nokogiri¶
Nokogiri is a large library, and so it's challenging to briefly summarize it. We've tried to provide long, real-world examples at Tutorials.
Parsing and Querying¶
Nokogiri Web Scraping Tool
Here is example usage for parsing and querying a document:
Encoding¶
Strings are always stored as UTF-8 internally. Methods that returntext values will always return UTF-8 encoded strings. Methods thatreturn a string containing markup (like to_xml
, to_html
andinner_html
) will return a string encoded like the source document.
WARNING
Some documents declare one encoding, but actually use a differentone. In these cases, which encoding should the parser choose?
Data is just a stream of bytes. Humans add meaning to that stream. Anyparticular set of bytes could be valid characters in multipleencodings, so detecting encoding with 100% accuracy is notpossible. libxml2
does its best, but it can't be right all the time.
If you want Nokogiri to handle the document encoding properly, yourbest bet is to explicitly set the encoding. Here is an example ofexplicitly setting the encoding to EUC-JP on the parser:
Technical Overview¶
Guiding Principles¶
As noted above, two guiding principles of the software are:
- be secure-by-default by treating all documents as untrusted by default
- be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers
Notably, despite all parsers being standards-compliant, there are behavioral inconsistencies between the parsers used in the CRuby and JRuby implementations, and Nokogiri does not and should not attempt to remove these inconsistencies. Instead, we surface these differences in the test suite when they are important/semantic; or we intentionally write tests to depend only on the important/semantic bits (omitting whitespace from regex matchers on results, for example).
CRuby¶
The Ruby (a.k.a., CRuby, MRI, YARV) implementation is a C extension that depends on libxml2 and libxslt (which in turn depend on zlib and possibly libiconv).
These dependencies are met by default by Nokogiri's packaged versions of the libxml2 and libxslt source code, but a configuration option --use-system-libraries
is provided to allow specification of alternative library locations. See Installing Nokogiri for full documentation.
We provide native gems by pre-compiling libxml2 and libxslt (and potentially zlib and libiconv) and packaging them into the gem file. In this case, no compilation is necessary at installation time, which leads to faster and more reliable installation.
See LICENSE-DEPENDENCIES.md
for more information on which dependencies are provided in which native and source gems.
JRuby¶
The Java (a.k.a. JRuby) implementation is a Java extension that depends primarily on Xerces and NekoHTML for parsing, though additional dependencies are on isorelax
, nekodtd
, jing
, serializer
, xalan-j
, and xml-apis
.
These dependencies are provided by pre-compiled jar files packaged in the java
platform gem.
See LICENSE-DEPENDENCIES.md
for more information on which dependencies are provided in which native and source gems.
Contributing¶
See CONTRIBUTING.md
for an intro guide to developing Nokogiri.
Code of Conduct¶
We've adopted the Contributor Covenant code of conduct, which you can read in full in CODE_OF_CONDUCT.md
.
License¶
This project is licensed under the terms of the MIT license.
See this license at LICENSE.md
.
Dependencies¶
Some additional libraries may be distributed with your version of Nokogiri. Please see LICENSE-DEPENDENCIES.md
for a discussion of the variations as well as the licenses thereof.
Authors¶
- Mike Dalessio
- Aaron Patterson
- Yoko Harada
- Akinori MUSHA
- John Shahid
- Karol Bucek
- Sam Ruby
- Craig Barnes
- Stephen Checkoway
- Lars Kanis
- Sergio Arbeo
- Timothy Elliott
- Nobuyoshi Nakada