Engineering

A guide to automated web scraping and data extraction using HTTP requests and web browsers

Web scraping vs. web crawling  

The Internet contains a vast amount of information and uses web browsers to display information in a structured way on web pages. Web browser display pages let users easily navigate different sites and parse information. Performing the task of pulling perspective code is known as web crawling and web scraping.  

Processing a web page and extracting information out of it is web scraping. Web crawling is an iterative process of finding web links and downloading their content. An application performs both of these tasks, since finding new links entails scraping a web page.  

The terms are sometimes used interchangeably, and both deal with the process of extracting information. However, they perform different functions. How and where can that information be used? There are as many answers as there are web sites online, and more. This information can be a great resource to build applications around, and knowledge of writing such code can also be used for automated web testing.  

In this blog, I’ll cover two ways of scraping and crawling the web for data using:

  1. basic HTTP requests and 
  2.  web browsers—as well as the pros and cons of each. 

Downloading web content with HTTP requests and web browsers 

As most everything is connected to the Internet these days, you will probably find a library for making HTTP requests in any programming language. Basic HTTP requests are fast. Alternately, using web browsers, such as Firefox and Chrome, is slower. They are so for a good reason to account for rendering styles and executing scripts on behalf of web pages, changing how each act and are displayed to be easily readable and usable. Web browsers sometimes use unnecessary resources. For example, if you're trying to extract text from a web page and download it as plain text, a simple HTTP request might suffice. However, many websites rely heavily on JavaScript and might not display some content if it is not executed. In this instance, using a browser eliminates some of the work when getting web content. 

Parsing with XPath and CSS 

Two commonly used ways of parsing content is via XPath and CSS. XPath is a query language used for selecting elements in documents such as XML and HTML. Each has a structure to them and a query that can be written to follow that structure. Since CSS styles lie on top of HTML structure, CSS selectors are somewhat similar to XPath and are a way to select elements using a string pattern.  

Setting up the demo 

The environment 

C# and .NET core 3.1 are used for these examples. These libraries haven't changed much in a while and should also work on .NET 4.x. 

The following examples can be cloned as a Git repository from https://github.com/devbridge/web-scraping-arcticle. 

The repository also contains a sample website (ASP.NET Core MVC application) which includes three pages:  

  1. Page with a simple table 
  2. Page with a 'hidden link' 
  3. Page with a button, which appears after time out 

I'll be using these to test different ways of extracting data. 

Using HTTP requests 

Create a request using the following code: 


var request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";

You could just use the above code and be done. However, after executing the request, you might not receive what you've expected. The website might check for request headers and refuse to serve content if the request doesn’t meet its requirements.  

Usually providing some identity of a request and what the response should look like: 


request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-us,en;q=0.5");
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip,deflate");

After adding some headers, request a response and get a response stream. 


var response = (HttpWebResponse)request.GetResponse();
var responseStream = response.GetResponseStream();

After adding AcceptEncoding header and indicating that you’ve accepted gzip and deflate, check if the content is compressed in any way, and decompress it if so. 


if (response.ContentEncoding?.IndexOf("gzip", StringComparison.InvariantCultureIgnoreCase) >= 0)
{
    responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
}
else if (response.ContentEncoding?.IndexOf("deflate", StringComparison.InvariantCultureIgnoreCase) >= 0)
{
    responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
}

Lastly, convert responseStream to MemoryStream to get its byte array, which if read as a stream will be HTML content. 


using var ms = new MemoryStream();
responseStream?.CopyTo(ms);

var htmlContent = Encoding.UTF8.GetString(ms.ToArray());

Wrap all of this into a single method, so further on GetHtmlContent(string URL) will be called. 

XPath  

There's a library HtmlAgilityPack to select elements using XPath. If you're not running the sample projects, this library can be added as a NuGet package.  

To add, take the following steps: 

  • Download HTML content 
  • Create HtmlDocument object from HtmlAgilityPack namespace 
  • Load HTML content into HtmlDocument object  

var html = GetHtmlContent("http://www.url.com");
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(content);

Now a single element or an array of elements can be selected. 


var nodeArray = htmlDocument.DocumentNode.SelectNodes("your_xpath_query");
var singleNode = htmlDocument.DocumentNode.SelectSingleNode("your_xpath_query");

Notice the word node; this refers to HTML using a tree-like structure with many nodes. A node can be a single element. It can also be part of the layout and not be visible. For the purposes of this article, consider nodes to be an element. 

The following queries are based on a simple HTML table. 


<html>
    ...
    <table>
        <tr>
            <th>Entry Number</th>
            <th>Row</th>
        </tr>
        <tr class="odd">
            <td>Entry #1</td>
            <td>Odd</td>
        </tr>
        <tr class="even1">
            <td>Entry #2</td>
            <td>Even</td>
        </tr>
        <tr class="odd">
            <td>Entry #3</td>
            <td>Odd</td>
        </tr>
        <tr class="even2">
            <td>Entry #4</td>
            <td>Even</td>
        </tr>
    </table>
    ...
</html>

Note the ... means omitted markup for brevity. 

If you're running sample web site project, this table can be reached at <HTTP://localhost:5000/scraper/table.>

I won't provide a deep dive into all of the XPath capabilities. However, here are a few examples of how to select elements, as most of the time, these techniques get the job done. 

XPath query resembles a file system path. <html> element is the root, as would be c:\ drive on Windows and just / on Linux. If you want to get a file in a documents folder, you could do so by writing c:\Users\user\Documents\File.txt. 

Now replace root drive with <html> and the rest of the file path with element names to get XPath. Given the HTML in the table above, a path to a table cell could be written as /html/table/tr/td. Since most websites are more complicated, with more elements than the example above, writing queries in such a way might be inefficient because it can result in a long, hard to read XPath. However, just like searching for files, you can use a wildcard (*.txt); there are a few tricks in XPath as well. 

XPath for selecting only table cells (<td> tags) can be written as //td. Putting a forward slash without any tag name after it is processed as path wildcard. Coming back to the file system analogy, you could think of it as *File.txt where all folders were replaced with * wildcard. 


var nodes = htmlDocument.DocumentNode.SelectNodes("//td");

A single node object has a property InnerText, which I'll be using to print results to the console. I’ll also replace newline with an empty string to format each node's inner text on a single line.  


foreach (var node in nodes)
{
    Console.WriteLine(node.InnerText.Replace("\r\n", string.Empty));
}

The output looks like: 


Entry #1
Odd
Entry #2
Even
Entry #3
Odd
Entry #4
Even

Each row represents a text selected from inside <td> tags. I'll then select whole rows by using <tr> tag.  


nodes = htmlDocument.DocumentNode.SelectNodes("//tr");

As you can see, I get the expected result. 


Entry Number        Row
Entry #1        Odd
Entry #2        Even
Entry #3        Odd
Entry #4        Even

However, if you’re after just the data in the table, this might not be ideal because I’ve selected table header as well. 

Looking at the table HTML, you can see the difference with the header not having a class attribute. 


...
<tr>
    <th>Entry Number</th>
    <th>Row</th>
</tr>
<tr class="odd">
    <td>Entry #1</td>
    <td>Odd</td>
</tr>
...

This can be used to write a more precise XPath query by specifying that you need only elements with a class attribute. 


nodes = htmlDocument.DocumentNode.SelectNodes("//tr[@class]");

Now the output is only rows without a header. 


Entry #1        Odd
Entry #2        Even
Entry #3        Odd
Entry #4        Even

The class attribute can be replaced with any other (id, src, etc.), and more logic can be added to this query. To select only odd rows, choose elements that have a class attribute with odd value. 


nodes = htmlDocument.DocumentNode.SelectNodes("//tr[@class='odd']");

The results look like: 


Entry #1        Odd
Entry #3        Odd

Skip element name altogether if you’re interested only in elements with a class value odd. 


nodes = htmlDocument.DocumentNode.SelectNodes("//*[@class='odd']");

This should have the same output as with the previous result. 


Entry #1        Odd
Entry #3        Odd

Even rows in the above HTML table example have class attributes; however, their values differ slightly by number at the end.  


<tr class="even1">
    <td>Entry #2</td>
    <td>Even</td>
</tr>
...
<tr class="even2">
    <td>Entry #4</td>
    <td>Even</td>
</tr>

 

A query can be written in several ways to select these in one go. 

Using a pipe operator, which means OR in XPath, and combining two paths //tr[@class='even1'] | //tr[@class='even2'], but this can get inefficient quickly, if you have even a few more paths, not to mention tens or hundreds more. 

Another option would be to select only elements with class attributes that start with letter e since that would fit our requirement. 


nodes = htmlDocument.DocumentNode.SelectNodes("//tr[starts-with(@class, 'e')]");

You get the desired output of even rows. 


Entry #2        Even
Entry #4        Even

JavaScript 

Selecting elements with XPath might not return the desired results. Some websites use JavaScript to update their HTML pages. Since requesting an HTML page with an HTTP request does not execute JavaScript, you're left with an unchanged document and missing elements. 

However, that's not the end of that, as you can execute JavaScript in C# code. 

If you are running a sample web site, you can reach this example at: <HTTP://localhost:5000/scraper/link> 

For this example, I am using an HTML document with a <script> tag in it, which contains a function that returns a secret link.  


<script>
    function secretLinkFunction(){
        return 'https://secret.link';
    };

    var secretLink = secretLinkFunction();
</script>

The secret link variable could be used to change the HTML document, and we're interested in the value of it.  

While there is more than one library to run JavaScript in C# code, I'll be using Jurassic, which can be downloaded as a NuGet package. 

To start off, download an HTML document to have it as a string in the code. 
(We're using the same method from XPath example to do so.) 

var content = GetHtmlContent("http://localhost:5000/scraper/link");

<script> element is just like any other, meaning you can select it using XPath. 


var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(content);

var scriptNode = htmlDocument.DocumentNode.SelectSingleNode("//script");

Along with property InnerText, used previously, scriptNode contains InnerHtml property, which in this case is plain JavaScript inside <script> tag. 


function secretLinkFunction(){
    return 'https://secret.link';
};

var secretLink = secretLinkFunction();

To run this code, create an object ScriptEngine from Jurassic namespace. 


var scriptEngine = new ScriptEngine();

And call Evaluate on InnerHtml of scriptNode. 


scriptEngine.Evaluate(scriptNode.InnerHtml);

The code has been executed, and in theory, there should be a variable secretLink with a string value, to check if it's true. ScriptEngine allows the code to get a variable by name.  


var javascriptLink = scriptEngine.GetGlobalValue<string>("secretLink");
Console.WriteLine(javascriptLink);

The output should look like: 


https://secret.link

Using web browsers 

Downloading HTML content using a browser entails less code than using HTTP requests. However, there is added time that impacts the browser startup and web page load timing, as the browser needs to render it with provided styles and scripts.  

A WebDriver is required to download content using a web browser. This is a piece of software that provides a platform and language-neutral interface to control a web browser. Each browser has its own WebDriver.  

If you are using sample projects, build SeleniumScraper project. You'll find geckodriver for Firefox and chromedriver for Chrome in the build directory.  

Controlling WebDriver is done via Selenium, which is a framework for testing web applications. It provides domain-specific language to use it. I'll be using a C# wrapper written on top of it. 

To continue further, require an actual browser, the sample project is based on Firefox and Chrome. To run examples, install either one. You can also use a portable version of a browser. You don't need to install it on your machine, you just need to provide a binary executable path when instantiating web driver object.  

SeleniumScraper projects have several dependencies. 


Selenium.WebDriver
Selenium.Firefox.WebDriver
Selenium.Chrome.WebDriver

These are needed to use Selenium in general and for specific browsers. This is actually enough to compile an application, but calling any method would result in an error, as there’s missing WebDriver software for Firefox/Chrome. WebDrivers can be downloaded manually or by adding following NuGet packages that contain respective driver executables.  


Selenium.Mozilla.Firefox.Webdriver
Selenium.WebDriver.ChromeDriver

Creating Firefox and Chrome driver instances in C# code is the same. 


var options = new FirefoxOptions();
var webDriver = new FirefoxDriver(options);

var options = new ChromeOptions();
var webDriver = new ChromeDriver(options);

These are different implementations of DriverOptions and RemoteWebDriver interfaces. 

Downloading web content uses one line of code. I’ll use the same table example I used for HTTP requests and XPath previously. 


webDriver.Navigate().GoToUrl("http://localhost:5000/scraper/table");

HTML content can now be accessed using PageSource property. 


webDriver.PageSource

Running WebDriver as-is will result in a browser window showing up. If you want to avoid that, for whatever reason, you can add a parameter to DriverOptions. --headless will stop a browser window from popping up, and not seeing it does not make any difference from an application perspective. Headless browsers can be used in the same way as one with a window; you can even take screenshots.   


options.AddArgument("--headless");

CSS Selectors 

CSS selectors are used to select an element or an array of elements. Browsers are doing this to add styles as required. For this demonstration, I will be doing this to extract data. 

The following examples are based on the same HTML table featured in XPath section. 

With this structure: 


<html>
    ...
    <table>
        <tr>
            <th>Entry Number</th>
            <th>Row</th>
        </tr>
        <tr class="odd">
            <td>Entry #1</td>
            <td>Odd</td>
        </tr>
        <tr class="even1">
            <td>Entry #2</td>
            <td>Even</td>
        </tr>
        <tr class="odd">
            <td>Entry #3</td>
            <td>Odd</td>
        </tr>
        <tr class="even2">
            <td>Entry #4</td>
            <td>Even</td>
        </tr>
    </table>
    ...
</html>

Logic, when writing CSS selectors, is somewhat similar to XPath. You are defining a path to an element. It can be explicit and can also contain wildcards. 

To select all rows in the example table and do so explicitly, CSS selector might look like this: 


html > table > tr

It can also be shortened to this: 

tr

Either way, the output would be the same. 


Entry Number Row
Entry #1 Odd
Entry #2 Even
Entry #3 Odd
Entry #4 Even

 

There's no need to add any more syntax. With XPath, this would require two forward slashes to indicate a wildcard in the element path. Not only is it possible to select an element with a certain attribute, you can also provide expected attributes value. 

To avoid selecting the header from the example table, select all tr elements with a class attribute, as header lacks it. We can do so by calling the FindElementsByCssSelector method on a WebDriver.   


var elements = webDriver.FindElementsByCssSelector("tr[class]");

The WebDriver returns a collection of IWebElement, which contains property text (the equivalent of InnerText used in HtmlAgilityPack). Calling it on each of the elements produces the following result: 


Entry #1 Odd
Entry #2 Even
Entry #3 Odd
Entry #4 Even

Selecting odd rows requires specifying a class value to be odd, as per the HTML table above. 


var elements = webDriver.FindElementsByCssSelector("tr[class='odd']");

Just like XPath, the symbol * can be used as a wildcard instead of the element's name.  


var elements = webDriver.FindElementsByCssSelector("*[class='odd']");

Both selectors present the same result. 


Entry #1 Odd
Entry #3 Odd

CSS selectors can also perform string checks, like starts with and contains. However, a special syntax is used instead of whole words to select rows that contain e in the class attribute. The following selector fits the job. 


var elements = webDriver.FindElementsByCssSelector("tr[class*='e']");

Notice * symbol after class. 

Printing results you can see only even rows. 


Entry #2 Even
Entry #4 Even

JavaScript 

Although the browser executes JavaScript on its own and you don't need a script engine to run it yourself, it can still pose a problem. The reason is that when you download a web page and try to select an element, it might not be there yet. 

If you run the sample web page and navigate to <http://localhost:5000/scraper/button> you'll find a simple layout and some text. After waiting for 2 seconds, a button will appear. It has two attributes: href and class… 


<a href="https://secret.link" class="btn btn-primary">Secret Button</a>

…and is added using the following script: 


setTimeout(function (){
    var element = document.getElementById('container');

    var button = document.createElement('a');
    button.setAttribute('href', 'https://secret.link');
    button.setAttribute('class', 'btn btn-primary');
    button.innerHTML = 'Secret Button';

    element.appendChild(button);
}, 2000);

Instead of specifying the entire class for CSS selector, specify a string check for a class to start with btn. 


a[class^='btn']

Notice the ^ symbol after class. 

To make it cleaner, this selector can be replaced with just .btn, where . indicates a class attribute of an element. btn acts as the string check for the said attribute. You can find this in a method of an example project which uses the shorter selector.  


static IWebElement FindButton()
{
    return _webDriver.FindElementByCssSelector(".btn");
}

If you call FindElementsByCssSelector right after downloading HTML content of a sample page, it will throw an exception noting there's no such element. Just wait 2 seconds. It could be simply done by adding Thread.Sleep(2000), after the thread continues and finds the button. Instead of hardcoding the value of wait time, this can be achieved in a more dynamic way. 

OpenQA.Selenium.Support.UI namespace contains a class WebDriverWait, which, as parameters, takes a WebDriver object and TimeSpan object that indicates time out when waiting for an element.  

Using a FindButton method, wait for a button to appear by writing the following:  


new WebDriverWait(_webDriver, TimeSpan.FromSeconds(10))
    .Until(_ =>
    {
        try
        {
            var button = FindButton();
            return button.Displayed;
        }
        catch
        {
            return false;
        }
    });

The time out is set to 10 seconds, but the button appears after 2 seconds at which point Until receives the value true and code continues execution. 

Now you can safely call FindButton outside of wait context. Knowing that it is there, try to get the value of attribute href. 


var button = FindButton();
var secretLink = button.GetAttribute("href");
Console.WriteLine(secretLink);

The buttons secret goes away and you get the following result: 


https://secret.link/

Applying tactics to extract data from the web  

These techniques noted in this blog can be mixed and matched. Sometimes JavaScript on a web site can be obfuscated so much that it is easier to let the browser execute it instead of using a script engine. At times XPath can prove the only way to extract data, as it can be used on xml documents as well as HTML, while CSS selectors can not. Whatever the problem, you now have solutions at hand to effectively extract data from the web.