Scraping Google with AngleSharp
Web scraping is the act of extracting data from web sites. From a programming standpoint it’s performing an HTTP request and parsing the HTML response. This may involve taking care of various low level details, like handling a stateful session and run into bad formed HTML.
Since .NET Community is a very vibrant one, we can take advantage of existing software components without reiventing the wheel (excuse me for the clichè). In my opinion AngleSharp is the best library to accomplish the aforesaid task. It’s so complete (including HTML DOM, CSS Selector and JavaScript support) that I would define it a web scraping framework.
In this post I’ll show you how it’s easy submit a query to Google and parse results.
Create a sample
Jump into your terminal and create a stub project:
$ cd your/projects/path
$ mkdir GoogleScraper
$ cd GoogleScraper
$ dotnet new console
Then add reference to main AnlgeSharp package:
$ dotnet add package AngleSharp --version 0.14.0-alpha-788
I said main because AngleSharp is a very large project and has other components, like AngleSharp.Io that provides additional requesters and IO helpers.
Basic usage
AngleSharp operates through a browsing context defined by the IBrowsingContext
interface type. So the first thing to do in your Program.cs
(after adding the required using
statements) is to create such instance. Let’s see how:
var context = BrowsingContext.New(Configuration.Default.WithDefaultLoader());
Now yuo’ve everything you need to access Google web site and this is done using OpenAsync
method:
var document = await Context.ActiveContext.OpenAsync("https://www.google.com/");
The document
variable is an instance of a type that implements IDocument
. Via this interface you can extract data using both HTML DOM and CSS Selectors.
As first we need a reference to search form:
var form = document.QuerySelector<IHtmlFormElement>("form[action='/search']");
The query passed to QuerySelector
uses CSS syntax to select a form
tag with a specified action
attribute. Now we can submit the form creating an anonymous type with a field that matches the search form input
element:
// we want ask Google about Bill Gates
var result = await form.SubmitAsync(new { q = "bill gates" });
What remains is just extracting all a
tags (hyperlinks) and consuming data we need:
var links = result.QuerySelectorAll<IHtmlAnchorElement>("a"); // CSS
foreach (var link in links) {
var url = link.Attributes["href"].Value; // HTML DOM
Console.WriteLine(url);
}
The Attributes
property is accessed like a dictionary and gives access to element attributes.
The sample is available in this gist. It’s very basic and will grab also URLs of low interest, like ones that are part of web page UI.
For a more complete Google scraper you can see the source of this PickAll searcher. Based on AngleSharp, PickAll is a project of mine that makes even easier scraping results from multiple search engines.
Using PickAll allo the job is essentially a matter of one line of code:
var results = await SearchContext.Default.SearchAsync("bill gates");
foreach (var result in results) {
Console.WriteLine(result.Url);
}
Conclusion
As you can see with some knowledge of DOM and CSS is pretty easy, and I would add enjoyable, scraping data using AngleSharp. Data gathered from the web can be archived, processed or stored in a database for further mining.
Now it’s up to you! Experiment and have fun.