New Music Friday with Http Agility Pack

I am passionate about music and am always looking to see what's next from artists I know and artists I have not yet heard of. Most new albums are released on New Music Friday. One of my favourite sites to check for new music is Album of the Year.

They have a list of upcoming albums with a release date and user and critic scores against them. Below is a link to the site.

Html Agility Pack

Html Agility Pack has an Html parser that allows you to read / write the DOM using XPATH. We are going to use this nuget package in our script to scrape the new releases on AOTY

Album of the Year New Releases

The url for the new releases page is : AOTY new Releases. The first step is to navigate to the page and then open up developer tools in the browser.

From looking at the structure within the page we can see how the albums are displayed in html.

  • Each album is contained in a div with the classes albumBlock + five
  • The date is contained in a div with the class date
  • The artist is contained in a div with the class artistTitle
  • The album title is contained in a div with the class albumtitle
  • Links to the album and artist page are links nested in the album div structure

Example HTML from AOTY

<div class="albumBlock five">
	<div class="date">Sep 20</div>
	<div class="image">
		<a href="/album/178044-the-number-twelve-looks-like-you-wild-gods.php">
			<picture>
				<source media="(min-width: 1024px)" data-srcset="https://cdn2.albumoftheyear.org/215x/album/178044-wild-gods.jpg 1x, https://cdn2.albumoftheyear.org/430x/album/178044-wild-gods.jpg 2x" srcset="https://cdn2.albumoftheyear.org/215x/album/178044-wild-gods.jpg 1x, https://cdn2.albumoftheyear.org/430x/album/178044-wild-gods.jpg 2x">
				<source media="(min-width: 481px) and (max-width: 1023px)" data-srcset="https://cdn2.albumoftheyear.org/275x/album/178044-wild-gods.jpg 1x, https://cdn2.albumoftheyear.org/550x/album/178044-wild-gods.jpg 2x" srcset="https://cdn2.albumoftheyear.org/275x/album/178044-wild-gods.jpg 1x, https://cdn2.albumoftheyear.org/550x/album/178044-wild-gods.jpg 2x">
				<source media="(min-width: 0px) and (max-width: 480px)" data-srcset="https://cdn2.albumoftheyear.org/215x/album/178044-wild-gods.jpg 1x, https://cdn2.albumoftheyear.org/430x/album/178044-wild-gods.jpg 2x" srcset="https://cdn2.albumoftheyear.org/215x/album/178044-wild-gods.jpg 1x, https://cdn2.albumoftheyear.org/430x/album/178044-wild-gods.jpg 2x">
				<img class=" lazyloaded" src="https://cdn.albumoftheyear.org/album/thumbs/178044-wild-gods.jpg" data-src="https://cdn.albumoftheyear.org/album/thumbs/178044-wild-gods.jpg" alt="The Number Twelve Looks Like You - Wild Gods">
			</picture>
		</a>
	</div>
	<a href="/artist/34755-the-number-twelve-looks-like-you/">
		<div class="artistTitle">The Number Twelve Looks Like You</div>
	</a>
		<a href="/album/178044-the-number-twelve-looks-like-you-wild-gods.php">
			<div class="albumTitle">Wild Gods</div>
		</a>
		<div class="ratingRowContainer">
		<div class="ratingRow">
			<div class="ratingBlock">
				<div class="rating">80</div>
			<div class="ratingBar green">
				<div class="green" style="width:80%;"></div></div></div><div class="ratingText">critic score</div> <div class="ratingText">(1)</div>
</div><div class="ratingRow"><div class="ratingBlock"><div class="rating">76</div><div class="ratingBar green"><div class="green" style="width:76%;"></div></div></div><div class="ratingText">user score</div> <div class="ratingText">(3)</div>
</div></div></div>

XPath

XPath enables you to query XML/Html documents using query strings. You build your query with selectors and then you get an XML node returned. Some resources below for XPath.

  • XPather - Allows you to test XPath queries online with a sample set of data
  • DevHints - XPath - XPath Cheat sheet to get the Syntax

Scrape AOTY with XPath and Html Agility Pack

The script below is designed to run in LINQPad. It utilizes the Dump function in LINQPad to output the results to the console.

Script Structure

The Method Main() is the entry part to the program and calls the method GetNewRelease(). If this is called without parameters then it will load the first two pages from AOTY

In GetNewRelease() you can see the client is an Instance of HTMLWeb. HtmlWeb is from Html Agility Pack and is used to download the web page from a URL. To load a page you call the HtmlWeb.Load(Url). This will return an HtmlDocument that is then ready to parse.

The HtmlDocument is passed to the method which then uses XPath to get the values for the album

  • The // selects a descendant of the document when used with HTML Agility Pack it will return that node and everything that is a child of it. For instance //div[contains(@class, 'albumBlock five')] selects all of the divs that contain the class albumBlock five.
  • Where you see ./a[1]/@href this selects the first hyperlink relative to the node that you are in. If you omit the period at the start of the query it selects the first matching node in the document instead of relative to the position of the parent node.

Scrape AOTY

using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;

void Main()
{
	GetNewRelease();
}

public void GetNewRelease(int noOfPages = 2)
{
	var client = new HtmlWeb();
	var albums = new List<Album>();

	for (int i = 1; i < noOfPages; i++)
	{
		var currentUrl = GetUrl(i);
		var doc = client.Load(currentUrl);
		albums.AddRange(ProcessHtmlDocument(doc));
	}

	var orderedAndGrouped = albums.OrderBy(c=> c.ArtistTitle)
        .GroupBy(c=> c.Date).OrderByDescending(c=> c.Key);

    // Output Script Results
	orderedAndGrouped.Dump();
}

public List<Album> ProcessHtmlDocument (HtmlDocument doc)
{
	Hyperlinq BuildLink(string v) => new Hyperlinq($"https://www.albumoftheyear.org{v}");

	var lst = new List<Album>();
	var albumNodes = doc.DocumentNode.SelectNodes("//div[contains(@class, 'albumBlock five')]");

	foreach (var albumNode in albumNodes)
	{
		var date = albumNode.SelectSingleNode("//*[@class='date']").InnerText;
		var artistTitle = albumNode.SelectSingleNode(".//*[@class='artistTitle']").InnerText;
		var albumTitle = albumNode.SelectSingleNode(".//*[@class='albumTitle']").InnerText;

		var artistUrl = BuildLink(albumNode.SelectSingleNode("./a[1]/@href").Attributes.FirstOrDefault().Value ?? string.Empty);
		var albumUrl = BuildLink(albumNode.SelectSingleNode("./a[2]/@href").Attributes.FirstOrDefault().Value ?? string.Empty);
		//var imageUrl = albumNode.SelectSingleNode("//img[1]/@src").Attributes.FirstOrDefault().Value ?? string.Empty;

		var newRelease = new Album {
			Date = date,
			ArtistTitle = artistTitle,
			AlbumTitle = albumTitle,
			ArtistPage = artistUrl,
			AlbumPage = albumUrl,
			//Image = new Hyperlinq(imageUrl)
			};

		lst.Add(newRelease);

	}
	return lst;
}

public class Album
{
	public string Date { get; set; }
	public string ArtistTitle {get;set;}
	public string AlbumTitle {get;set;}
	public Hyperlinq ArtistPage {get;set;}
	public Hyperlinq AlbumPage {get;set;}
	//public Hyperlinq Image {get;set;}

}


string GetUrl (int page = 0) => page == 0
	? "https://www.albumoftheyear.org/releases/"
	: $"https://www.albumoftheyear.org/releases/{page}/";

Output

Below is the Sample output from the Script above

DateArtistTitleArtistPageAlbumPage
May 23Badly Drawn BoyBanana Skin Shoeshttps://www.albumoftheyear.org/artist/554-badly-drawn-boy/https://www.albumoftheyear.org/album/228828-badly-drawn-boy-banana-skin-shoes.php
May 23Tove LoSunshine Kitty (Paw Prints Edition)https://www.albumoftheyear.org/artist/6437-tove-lo/https://www.albumoftheyear.org/album/241776-tove-lo-sunshine-kitty-paw-prints-edition.php