Build your own page recommendation engine with Scala

What's this all about?

I wanted to have a simple recommender that suggests me some new, possibly interesting, pages on the web. The input data to this recommender should be my existing bookmark library. Currently I save all my articles and bookmarks on Evernote - so it would be nice if that recommender could take this data directly from there without conversion, exporting, importing and so on.

I didn't find any simple solution to that problem - so I decided to build one on my own. First I created a wrapper for the delicioiusfeeds-API called deliciousfeeds4j. The second result is this post.

How does it work?

The idea is simple:

  1. find some users which bookmarked the same pages
  2. find the ones which are most common compared to me - based on the bookmarks that we share
  3. suggest all bookmarks from that top-n most common users, which I don't know about yet

To make this all work I use the data from Delicious through their API.

Show me the code!

Note: I have tried to make this as simply as possible. So there are many things that can be done much better. But this code is only to show the basic idea - nothing else!

The Starter object ties everything up. It loads the URLs from your Evernote-Account or if you like from a simple text file with one URL per line. Then the PageRecommender kicks in and finds some new URLs to look at. Finally everything is print out on the console.

object Starter extends App {

  //find the Urls stored in your evernote-account (you have to setup some things for this to work!)
  val urls = EvernoteURLRetriever.findAllUrls()

  //Uncomment this if you want to load your urls from a text-file instead...
  //val urls = FileURLRetriever.readUrlsFromFile("res/test-urls.txt")

  //use the 10 best-matching users for the recommendations
  val pageRecommender: PageRecommender = new DeliciousUserBasedPageRecommender(20)

  //get recommendations...
  val recommendations = pageRecommender.recommend(urls)

  println("\n\nFound some new urls: ")

  recommendations.foreach(println)

  //Here some further processing can be done -> save to file, group by url-authority, etc.
}

Here's the code from the EvernoteURLRetriever.

import com.evernote.auth.{EvernoteService, EvernoteAuth}  
import com.evernote.clients.ClientFactory  
import com.evernote.edam.`type`.Note  
import com.evernote.edam.notestore.NoteFilter  
import scala.collection.JavaConversions._  
import scala.collection.mutable.ListBuffer

object EvernoteURLRetriever {

  //This is fine for testing - no harm can be done since it's only the sandbox...
  private val developerToken = "YOUR_SANDBOX_DEVELOPER_TOKEN"
  private val evernoteServiceType = EvernoteService.SANDBOX

  //Uncomment this if you want to use your real account...
  //private val developerToken = "YOUR_PRODUCTION_DEVELOPER_TOKEN"
  //private val evernoteServiceType = EvernoteService.PRODUCTION

  /**
   * Finds all notes from all notebooks which have a source-url starting with 'http'.
   * @return found urls
   */
  def findAllUrls(): Traversable[String] = {

    //build the authentication
    val evernoteAuth = new EvernoteAuth(evernoteServiceType, developerToken)

    //get the note-store...
    val factory = new ClientFactory(evernoteAuth)
    val noteStore = factory.createNoteStoreClient()

    //build a filter for all notes with a source-url set...
    val filter = new NoteFilter()
    filter.setWords("sourceURL:http*")

    println("Evernote - searching for notes with source-url...")

    //fetch the results...
    val totalFound = noteStore.findNotes(filter, 0, 0).getTotalNotes

    var notes = ListBuffer[Note]()

    while (notes.size < totalFound)
      notes ++= noteStore.findNotes(filter, notes.size, 50).getNotes

    //now get the urls
    val urls = notes.map(_.getAttributes.getSourceURL)

    println("Evernote - found %s urls..." format urls.size)

    urls
  }
}

Here's the code from the PageRecommender trait.

trait PageRecommender {

  /**
   * Gets some urls and returns the recommended ones - based on the given data.
   *
   * @param urls - base data for recommendation
   * @return recommended urls
   */
  def recommend(urls: Traversable[String]): Traversable[String]
}

Here's the code from the DeliciousUserBasedPageRecommender class. It does all the heavy work of finding some new and possibly interesting URLs.

import com.delicious.deliciousfeeds4J.beans.Bookmark  
import com.delicious.deliciousfeeds4J.DeliciousFeeds  
import com.google.common.collect.{Multisets, HashMultiset, Multiset}  
import org.apache.commons.lang.StringUtils.isEmpty  
import scala.collection.JavaConversions._  
import scala.collection.mutable

class DeliciousUserBasedPageRecommender(val topNUsers: Int) extends PageRecommender {

  private val deliciousFeeds = new DeliciousFeeds
  deliciousFeeds.setExpandUrls(true)

  /**
   * Gets some urls and returns the recommended ones - based on the given data.
   *
   * @param urls - base data for recommendation
   * @return recommended urls
   */
  def recommend(urls: Traversable[String]): Traversable[String] = {

    //find all users who bookmarked the same urls, store them in multiset to find most similar ones
    val userMultiset: Multiset[String] = HashMultiset.create()

    for (url <- urls) {
      getBookmarksByUrl(url) match {
        case Some(bookmarks) => bookmarks.foreach(b => if (!isEmpty(b.getUser)) userMultiset.add(b.getUser))
        case None =>
      }
    }

    println("Recommender - found %s similar users, taking the top %s...".format(userMultiset.size, topNUsers))

    val recommendedUrls = new mutable.HashSet[String]

    //take the topN most similar users
    val similarUsers = take(topNUsers, userMultiset)

    println("Recommender - searching for other urls from that similar users...")

    //find all urls from the most similar users
    for (user <- similarUsers) {
      getBookmarksByUser(user) match {
        case Some(bookmarks) => bookmarks.foreach(recommendedUrls add _.getUrl)
        case None =>
      }
    }

    //remove the ones you already know
    urls.foreach(recommendedUrls.remove)

    println("Recommender - found %s recommended urls!" format recommendedUrls.size)

    recommendedUrls
  }

  private def getBookmarksByUrl(url: String): Option[Traversable[Bookmark]] = try {
    val bookmarks = deliciousFeeds.findBookmarksByUrl(10, url)

    if (bookmarks != null) Some(bookmarks)
    else None
  } catch {
    case e: Exception =>
      e.printStackTrace()
      None
  }

  private def getBookmarksByUser(user: String): Option[Traversable[Bookmark]] = try {
    val bookmarks = deliciousFeeds.findBookmarksByUser(100, user)

    if (bookmarks != null) Some(bookmarks)
    else None
  } catch {
    case e: Exception =>
      e.printStackTrace()
      None
  }

  private def take[T](count: Int, multiset: Multiset[T]) = {
    val sortedMultiset = Multisets.copyHighestCountFirst(multiset).elementSet().toList
    sortedMultiset.take(count)
  }
}

If you want to use the EvernoteURLRetriever

Before you can start, get your developer token...

Then edit the EvernoteURLRetriever object. Your done :-)

Get the Code

The complete source code with setup instructions can be found on Github.

How to improve this?

As I said earlier this is only intended to show the basic idea. Many improvements are possible.

One that comes to my mind is to use the recommendation engine from the Apache Mahout project. This step should improve quality of the recommendations.

What I did with that - some numbers

I used this with my Evernote account and the EvernoteURLRetriever found about 1100 URLs. With this data as input I got about 850 new pages as suggestion which pointed to about 550 different domains.

After all there are many interesting sites I did not know about - of course there is also a lot that is not intersting to me.

Patrick Meier

I am an entrepreneur and software developer, building scalable, distributed web systems with Java, NodeJs and AngularJs.

Weiden, Germany