Web Scraping: Jaunt vs Jsoup

I recently found out that there is a new player in the game of web scraping with Java. It is called Jaunt and developed by Tom Cervenka.

I worked a lot with Jsoup and the question arised what the difference compared to Jaunt is. There was no article on the web that satisfied me so I decided to write my own.

I also emailed with Tom so that I can provide correct information here and don't miss something important.

Licensing / Pricing

Let's start with a very important point for most developers: licensing and pricing.

Jaunt

From the website it states that Jaunt is a free Java library. This is only the partial truth. As Jaunt is a commercial library it provides multiple versions - paid and a free for a monthly download.

Tom wrote me the whole story about licensing in an email response:

The free version can be used for personal or commercial projects, including redistributing the jar file, sublicensing, etc. In terms of the restrictions it places on the code itself, it is actually more liberal than the MIT license used by Jsoup, which dictates that a developer's code must include the MIT copyright/permission notice in all copies or substantial portions of the software.

The non-free, two-year version of Jaunt, on the other hand, is far more restrictive than the MIT license, since it does NOT allow redistribution of the jar file. That makes it well suited for server-side projects and/or any personal project that is not distributed.

The enterprise version is a business license (generally covering an unlimited number of employees) who's terms are completely negotiable, so it may or may not allow redistribution, depending on the needs of the client.

— From Tom Cervenka via Email

If you download and use the free version it expires at the end of each month. So that means you have to download a new free version every month, include the new jar-file in your project (replacing the old one), recompile and redeploy it. If this is too much work or just not possible you have to pay for it.

This is what happens if you miss to renew your jar-file, so make sure you don't:

I am using Jaunt api and my software stop working and it says "JAUNT HAS EXPIRED! [http://jaunt-api.com]".

— From Hector Herranz in the Jaunt Google Groups

As the time of writing this article a 2 year license will be $24 and a non expiring license will cost you $950. Here you can see all the pricing options available: Jaunt pricing

Jsoup

Jsoup is an Open Source project developed by Jonathan Hedley available under the MIT license. This allows you to use it in any project (personal and commercial) free of charge.

You can also look at the source code on Github.

Similarities

Both libraries share a common set of features. Before we go into detail about the differences let's quick list the similarities:

  • Downloading and parsing dirty HTML
  • Parsing HTML from String or File
  • Built in proxy support
  • Setting headers and Cookies
  • DOM traversal or selectors
  • Manipulate HTML elements (attributes, text, html)
  • File uploading and downloading
  • Form submissions (GET, POST)
  • Pretty print HTML

Differences: Jaunt vs Jsoup

Now let's come to the main part of this article. Here we cover the differences of both libraries as well as some unique features.

Selector syntax

To select elements Jsoup uses normal CSS selectors whereas Jaunt has it's own syntax. I asked Tom why Jaunt does not implement CSS selectors. Here is his answer:

Jaunt does not to support CSS selectors or XPath because I consider the Jaunt querying syntax more readable. The user doesn't have to think about CSS selectors, then maybe XPath, and then something else for querying JSON. With Jaunt, the query looks like the thing you're trying to find, so there's very little cognitive load when building or reading a query.

— From Tom Cervenka via Email

Availability through Maven Central

Jsoup is available through Maven Central. Jaunt on the other hand is not due to the expiring license. To use Jaunt with Maven you have to download and install it to your local repository.

Installing the Jaunt jar-file to your local repository can be done like this:

mvn install:install-file -Dfile=<path-to-file> -DgroupId=com.jaunt-api  \  
    -DartifactId=jaunt -Dversion=<version> -Dpackaging=jar

Popularity on Stackoverflow

As time of writing I just looked at the number of questions tagged on Stackoverflow for both libraries. I know this is not a complete analysis on popularity but it gives you a hint. The results were:

  • Jaunt has 26 questions tagged with jaunt-api
  • Jsoup has 4.191 questions tagged with jsoup

Jaunt: Unique features

  • Save complete web page to disk out of the box (including img, css, js)
  • Supports working with REST APIs (GET, POST, PUT, DELETE)
  • Parsing and traversing JSON (including selectors)
  • Web pagination discovery out of the box
  • Customizable caching & content handlers
  • Provides higher level abstractions for working with Forms and Tables

Jsoup: Unique features

  • Fully supports CSS selectors
  • Sanitize HTML in an easy way out of the box (predefined whitelists, etc.)
    Note: You can do that with Jaunt too, but you have to define your own filters

You want to learn more?

I just created a new online video course on Udemy and offer a 50% discount to loyal readers of my blog...

All you have to do is to click on the course image above. The discount code is included in the link!

I'm happy to see you in the course...


Code snippet comparison: Get the top 10 Google search results

We look at a simple code example that prints the urls of the top 10 Google search results to the console. The search term we use is apple.

This simple example is not enough to cover all the differences but it gives you a hint and that it what it's meant to do.

This is the Jaunt version:

final UserAgent userAgent = new UserAgent();  
userAgent.visit("http://google.com");  
userAgent.doc.apply("apple");  
userAgent.doc.submit("Google Search");

for(Element link : userAgent.doc.findEvery("<h3 class=r>").findEvery("<a>")) {  
    System.out.println(link.getAt("href"));
}

This is the Jsoup version:

//Google blocks the default user agent of Jsoup
final String UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";

final Document doc = Jsoup.connect("https://google.com/search?q=apple")  
                          .userAgent(USER_AGENT)
                          .get();

for (Element result : doc.select("h3.r a")){  
    System.out.println(result.attr("href")l);
}

Conclusion

In general Jaunt seems to work on a higher level of abstraction than Jsoup. It also provides some neat features out of the box (like caching, web pagination discovery, ...). In addition to the HTML features Jaunt has integrated working with REST APIs and JSON. So now you can build your own opinion which route you want to go. I listed all the facts above. If I forgot something important don't hesitate to write me.

Personally I stick with Jsoup for HTML processing and use Unirest for working with RESTful APIs and JSON. They both together provide nearly all features from Jaunt. If I need a Cache I build one either with Google Guava or use a full grown solution like memcached or redis. Also I like CSS selector syntax much more than the custom syntax of Jaunt.

If you liked this article please share it with your friends and give me some comments below.

Patrick Meier

I am an entrepreneur and software developer, building scalable, distributed web systems with Java, NodeJs and AngularJs.

Weiden, Germany