Categories
Java

Parse HTML in Java with XPath and Jsoup

In this tutorial, we will explain how to parse and extract content from an HTML source code. First we will download a real HTML source code with Apache HTTP client and then we will parse it with an awesome Java library called Xsoup. It is a mix of Jsoup and XPath. It is better adapted to parsing HTML than Jsoup alone.

1) Download some HTML source code from internet

Let’s download the HTML source code of the sitemap of our website eazytutorial.com . For this, we need an HTTP client like the one from Apache. So add these dependencies to your pom.xml:

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpcore</artifactId>
    <version>4.4.14</version>
</dependency>

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.13</version>
</dependency>

Then to download the HTML source code into a String variable, do:

String url = "https://eazytutorial.com/index.php/post-sitemap.xml";

CloseableHttpClient client = HttpClients.createDefault();
String sitemap = EntityUtils.toString(client.execute(new HttpGet(url)).getEntity());

The sitemap looks like this:

[...]
	<url>	<loc>https://eazytutorial.com/index.php/2021/08/25/extract-text-between-two-string-with-python-regex/</loc>
		<lastmod>2021-08-25T16:21:29+00:00</lastmod>
	</url>
	<url>	<loc>https://eazytutorial.com/index.php/2021/08/17/extract-text-between-two-strings-in-scala/</loc>
		<lastmod>2021-08-25T16:22:52+00:00</lastmod>
	</url>
[...]

2) Extract some content from the HTML source code using XPath

Let’s say we want to extract the list of our blog posts URLs from our own sitemap HTML, that we just downloaded. What we can easily notice from the HTML above is that all URLs are the text encapsulated firstly by a <loc> tag and then by a <url> tag. No other data follow this pattern in the HTML source code, so this pattern is a sure way of getting all the blog post URLs and nothing else.

Using XPath, this pattern can be translated as:

//url/loc/text()

Now, let’s apply this pattern to the HTML code to extract the URLs. For this, we need a library called Xsoup. So add this dependency to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>xsoup</artifactId>
    <version>0.3.2</version>
</dependency>

Finally, the extraction code is:

String xpath = "//url/loc/text()";
Document document = Jsoup.parse(sitemap);
List<String> urls = Xsoup.compile(xpath).evaluate(document).list();

If you print the urls in the console, you will get:

https://eazytutorial.com/index.php/2021/08/25/extract-text-between-two-string-with-python-regex/
https://eazytutorial.com/index.php/2021/08/17/extract-text-between-two-strings-in-scala/

which is exactly what we were looking for.

That’s it for this tutorial ! If you have any question, you can leave us a reply below, we reply within 24 hours.

Leave a Reply

Your email address will not be published. Required fields are marked *