cjwebb.github.io

I write software, teach people, and investigate new technology

Learning Akka Streams

| Comments

This blog post differs from my usual ones; I’m writing it as I learn something. As such, it is more of a story that contains errors and misunderstanding than a factual blog post.

Hello World

This blog post follows me trying to learn how to use Akka Streams. I haven’t needed to use them before, and whenever I glance at the documentation, I usually get confused about just how many new terms are being introduced.

Runnable code is available here. I’ve included more type annotations that normal, as they will assist us discussing what is going on.

We start by compiling and running the ‘Hello World’ example in the Quick Start Guide.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._

object HelloWorld extends App {

  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()

  val source: Source[Int, NotUsed] = Source(1 to 100)

  source.runForeach(println)

  system.terminate()
}

Source will be widely used. It represents the inputs to the stream.

A couple of things stand out in the code. Firstly, the ActorMaterializer is new, compared to standard Akka. I have no idea what it does, but I’m guessing it could have been named “ActorFactory”.

Secondly, the Source takes two type parameters. The first is the type that the Source emits. The documentations says that “the second one may signal that running the source produces some auxiliary value (e.g. a network source may provide information about the bound port or the peer’s address)”. The first one makes sense. The second one doesn’t yet. Things will hopefully become clearer once I write something else.

Either way, this stream runs and prints out 1 to 100.

Using a Sink

Source –> Sink

The first example uses a Source. It is not really a Stream. We just send everything to stdout. Let’s use Sink, which are the outputs of a stream.

There seem to be lots of kinds of Sink, including a foreach one, which we can use to println again.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import akka.{Done, NotUsed}
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._

import scala.concurrent.Future

object UsingASink extends App {

  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()

  val source: Source[Int, NotUsed] = Source(1 to 100)
  val sink: Sink[Any, Future[Done]] = Sink.foreach(println)

  source.runWith(sink)

  system.terminate()
}

Interestingly, the type signature of this is Sink[Any, Future[Done]]. From reading the ScalaDoc, Done is essentially Unit, but they used included it so the code could also run on Java. We’ve also used that second mysterious type parameter.

ScalaDoc says “The sink is materialized into a [[scala.concurrent.Future]]”. Perhaps the ActorMaterializer has been used, and deals with side-effects? Let’s keep going, and see if a more complicated example makes it easier to understand.

Simple Transform

1
Source --> Flow --> Sink

So far, we generate a Stream from an Iterable, send it to a Sink, and then print it out. Let’s include an intermediate step, where we do some “stream processing”. For this, we need the Flow class. This is beginning to look more like the Stream I imagined.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import akka.{Done, NotUsed}
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._

import scala.concurrent.Future

object SimpleTransform extends App {

  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()

  val source: Source[Int, NotUsed] = Source(1 to 100)
  val sink: Sink[Any, Future[Done]] = Sink.foreach(println)
  val helloTimesTen: Flow[Int, String, NotUsed] = Flow[Int].map(i => s"Hello ${i * 10}")

  val graph: RunnableGraph[NotUsed] = source via helloTimesTen to sink
  graph.run()

  system.terminate()
}

We discovered another type now; RunnableGraph. This is a “Flow with attached input and output, can be executed.”

That makes sense. We’ve attached a Source, and a Sink, to our Flow. Therefore it has input and output, and should work.

RunnableGraph also specifies that it has a ClosedShape, which also hints at the role that a Materializer takes. I’m still yet to figure them out.

1
2
3
4
5
/*
 * This [[Shape]] is used for graphs that have neither open inputs nor open
 * outputs. Only such a [[Graph]] can be materialized by a [[Materializer]].
 */
sealed abstract class ClosedShape extends Shape

Graphs

All of our examples so far have been very linear. There has been no way of splitting the stream to do different pieces of work, or combine multiple sources.

One use-case I can think of when using Akka Streams is to persist events on the stream to a database, in addition to continuing the stream processing.

1
2
3
Source --> Broadcast --> Flow --> Sink
               |
               --> Save to DB

Akka Streams has a thing named Broadcast especially for this. However, constructing and using one is more complex than I imaged. You need to start using a mutable GraphDSL.Builder. The GraphDSL implies that we now need to learn what lots of funny symbols mean, such as ~>.

We need to build up a Graph[ClosedShape, Mat] where Mat is one of those Materializers that I still don’t understand.

Interesting, it seems as though Graph doesn’t type-check very well. It is possible to construct and run a Graph that doesn’t do anything except fail at runtime. The following code gets checked via a require assertion when running the code:

1
def isRunnable: Boolean = inPorts.isEmpty && outPorts.isEmpty

I’m not really sure how one would go about asserting this at compile-time. It definitely seems possible though, as other Scala libraries have similar builder patterns that type-check. Hopefully this will be addressed in a future release.

Anyway, lets try and build a Stream that saves events to a DB.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
import akka.{Done, NotUsed}

import scala.concurrent.{ExecutionContext, Future}

object SendToDB extends App {

  implicit val system = ActorSystem()
  implicit val ec = system.dispatcher
  implicit val materializer = ActorMaterializer()

  val intSource: Source[Int, NotUsed] = Source(1 to 100)

  val helloTimesTen: Flow[Int, String, NotUsed] = Flow[Int].map(i => s"Hello ${i * 10}")
  val intToEvent: Flow[Int, DB.Event, NotUsed] = Flow[Int].map(i => DB.Event(s"Event $i"))

  val printlnSink: Sink[Any, Future[Done]] = Sink.foreach(println)
  val dbSink = Flow[DB.Event].map(DB.persistEvent).toMat(Sink.ignore)(Keep.right).named("dbSink")

  val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
    import GraphDSL.Implicits._
    val broadcast = builder.add(Broadcast[Int](2))

    intSource ~> broadcast ~> helloTimesTen ~> printlnSink
                 broadcast ~> intToEvent ~> dbSink

    ClosedShape
  })

  graph.run()

  system.terminate()
}

object DB {
  case class Event(msg: String)
  def persistEvent(e: Event)(implicit ec: ExecutionContext): Future[Unit] = {
    // pretend that some DB IO happens here
    println(s"persisting $e")
    Future {}
  }
}

There is a lot of new stuff here. Starting with the easiest thing; I’ve renamed some vals to reflect what they do now. We also have a new Flow named intToEvent which maps an Int to a DB.Event case class.

We also have the GraphDSL syntax, and implicit conversions. ~> means “Add Edge to Graph” in my head. As DSLs go, it isn’t too bad. Arrows are Stream processing go hand-in-hand. I’ll try and use this syntax from now on. The Broadcast class also checks at compile time that we’ve linked all the specified ‘ports’. In our example, we create the Broadcast and say it will have two things listening to it. If we only connect, one, we get a runtime error.

Lastly, we have a new Sink named dbSink. I basically copied the code from inside the printlnSink and changed the map method. It seems as though in order to perform actions on a Stream, we need to materialize the values contained inside. I now assume that Streams are inherently lazy, and Materializing is the act of evaluating the Stream.

We need to dig into Materializers, and finally figure them out.

Basic Materializer

Having read the docs some more, it seems as though “materialization” is the thing that actually runs our Stream. When using Akka, actors are created (or materialized) in order to do the work. Makes sense. I can’t help but feel that this should have been more obvious at the start…

1
Source ~> Sink

We return to the simplest graph, with just a Source and a Sink. The Source generates Ints, and the Sink just takes the first one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._

import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}

object BasicMaterializer extends App {

  implicit val system = ActorSystem()
  implicit val materializer = ActorMaterializer()

  val intSource: Source[Int, NotUsed] = Source(1 to 100)
  val headSink: Sink[Int, Future[Int]] = Sink.head[Int]

  val graph1: RunnableGraph[Future[Int]] = intSource.toMat(headSink)(Keep.right)
  val graph2: RunnableGraph[NotUsed]     = intSource.toMat(headSink)(Keep.left)

  // we can only get values from graph1
  val result = Await.result(graph1.run(), Duration(3,"seconds"))

  println(result)

  system.terminate()
}

I’ve included code of two RunnableGraphs. They only vary in second arguments; Keep.right versus Keep.left. This seems to be the key to Materializers, and the mysterious second type parameter included everywhere in our Source, Flows, and Sinks.

In order to get any values out of a Stream flowing left to right, we need to keep the right values. Presumably, in a stream going the other way, we need to keep the left values. In our case, the right side of our graph has Future[Int] as the second type parameter. This is the one we need.

The types on these methods are very obvious; given two types select the one on the left, or the right.

1
2
def left [L, R]: (L, R) => L
def right[L, R]: (L, R) => R

This now also answers my questions above about the previously used Sink named dbSink. Here’s the implementation again:

1
Flow[DB.Event].map(DB.persistEvent).toMat(Sink.ignore)(Keep.right)

DB.persistEvent returns a Future[Unit]. In order to actually evalutate these Futures, we need to materialize the stream. As we’re Keeping right, we pass our futures into Sink.ignore. If we kept left, we pass NotUsed. Whilst ignoring them doesn’t sound very useful, this actually runs them before ignoring whatever they return.

I feel like I understand roughly how Akka Streams work now.

Here’s a recap.

  • Sources generate values.
  • Sinks consume values.
  • Materialization is the process of running the Stream, and getting your Sink to do something.
  • Flows are linear transformations.
  • Graphs can be modelled with Broadcast (and Merge, but we didn’t try them out).

This is enough blog post for now. I’d like to continue learning Akka Streams; they seem a lot easier to use than Actors, and patterns are built in. Streams are more type-safe than Actors too, which hopefully will permit writing more maintainable code.

The Virtues of Side Projects

| Comments

I write software for a living, and most of my side projects are software based too. I view side projects as a tool for learning. Learning by doing.

Side projects can feed positively into your day job, as you learn new skills and techniques

Building things from scratch, and by yourself, means you have to broaden your knowledge base. If you’re writing a website, you’ll need to know how to deploy it, how to enable database backups, how to configure a firewall, and basic server maintenance tools.

When I first started a side project, I wrote Java all day, and developed on Windows. I knew hardly anything about Linux, and certainly had no experience running a server. Whilst it wasn’t necessarily all fun, I learnt a lot that has been widely applicable later on in my career.

Side projects can also assist in your day job directly. An example of this very happened recently to me. A couple of colleagues were pair programming on an adjacent desk. They were stuck on something, and casually asked if I knew how to fix a problem they were having. After a minute or so looking at the code, my answer was “Yeah, I did something like this recently in my side project”. I then sent them a link to my side project on Github, and after adapting my code, they had solved their issue and were away coding again.

Usually it’s skills and techniques that will transfer over from side-projects. Occasionally, you can just copy some of your side project code.

Side projects let you experiment without adversely affecting your day-job

Want to learn Haskell? Great! But don’t then deploy it at work, and force other people to maintain your code.

Want to try out a theory about structuring an object-orientated codebase? You can, and if it doesn’t work out, you won’t have annoyed all of your colleagues.

Having a great job, or working on a great team, means that you’re given time to learn or embrace different technologies or techniques. However, if you’re not that lucky, side projects can give you the freedom to learn and experiment in ways you wouldn’t normally be allowed to.

Side projects can die without negative consequences

If you get bored with working on a side project, just stop working on it. A project with paying customers is not a side project. Nobody can honestly expect you to keep working for free on something you find boring.

Work on something different for a while. If the side project doesn’t interest you again, forget about it, and thank it for the experience and skills you gained whilst working on it.

Side projects can force you to think differently

Compared to your day job, side projects are subject to a different set of constraints. Different constraints means different problems to solve. Your decisions will be different as a result of this.

The main constraint on my side projects is money. At most places I’ve worked, the budget is usually huge in comparison to what I can set aside. EC2 instances costs nothing to a corporation, but ~$50/month, multiplied by three for redundancy and replication, soon make a dent in my own finances.

This constraint has also led me to use a lot of free-tiers of hosted-database providers. If a company is willing to provide me with free services, then I’m all ears. This was how I was introduced to NoSQL databases, and I gained a completely different perspective on how to store, and query, data.

Side projects count as experience

Always be learning. There is a famous quote by Malcolm Gladwell about how doing something for ten thousand hours can help achieve mastery. Whilst this quote is often out of context, everyone understands that you need to practice something in order to become better at it. The good news, is that you can practice your skills whilst working on a fun side-project.

The second piece of good news, is that you can often use side projects as direct evidence to show off to potential employers. I’ve often chatted about my side projects whilst being interviewed, and I’ve often chatted about interviewees' side projects whilst I interviewed them. They’re a great way to showcase your talents!

Using Elm in Octopress

| Comments

Elm is a functional programming language aimed at the browser. It aims to replace all HTML, CSS, and JavaScript code. It borrows a lot from Haskell, and promises that if your Elm code compiles, it will run without exceptions.

Animated, or interactive, examples can greatly enhance blog posts. This is ultimately achieved via embedding HTML, CSS, and JavaScript. My blog is powered by Octopress, and instead of writing HTML, CSS, and JavaScript, I was interested in using Elm instead.

Including JavaScript in an Octopress Post

Running arbitary JavaScript in Octopress is easy. The code below will insert an HTML paragraph into a div.

1
2
3
4
5
<div id="elm-goes-here"></div>
<script type="text/javascript">
  var element = document.getElementById("elm-goes-here");
  element.innerHTML = "<p>This is set via JavaScript!!</p>";
</script>

Instead of inlining the JavaScript, you can include it in the source/javascripts/ directory. After publishing, that directory is made available as /javascripts/

Including Elm in an Octopress Post

As we can use arbitary JavaScript in an Octopress blog post, we can follow a few simple steps, and have the browser running our Elm code instead!

Elm has interop with JavaScript through HTML embedding (and a couple of other ways). In order to embed in a div, we first need to write and then compile our Elm code.

1
elm-make Stamps.elm –output=app.js

Once compiled, and made available by placing it in the source/javascripts/ directory, we can then include it in the blog post.

1
2
3
4
5
6
<div id="elm-goes-here"></div>
<script type="text/javascript" src="/javascripts/app.js"></script>
<script>
  var element = document.getElementById("elm-goes-here")
  Elm.embed(Elm.Stamps, elmDiv);
</script>

As a demonstration, have a quick play with the interactive section below. It is an embedded version of Elm’s Stamps Example. Elm.embed requires a module to import, so if you’d like to try this yourself, add the following to the top of your Elm code.

1
module Stamps where

If you don’t like pentagons, there are lots of other Elm examples to play around with.

Using Play-Framework's PathBindable

| Comments

Using custom types in Play Framework’s routes file is a major win, and is not something obviously supported. Consider the routes file below:

1
2
GET /stuff/:id     @controllers.StuffController(id: String)
GET /things/:id    @controllers.ThingsController(id: java.util.UUID)

In the first route, we take the id parameter as a String. In the second, we take it as a java.util.UUID.

Advantages

In our example above, paths that do not contains UUIDs are not matched for the second route. We don’t have to deal with IDs that are not UUIDs.

At the start of a project, you may see lots of lines that say:

1
2
3
4
id match {
   case i if isUUID(i) => doStuff()
   case _ => BadRequest(id must be a UUID)
}

By not matching on the route, we can remove this code. A request either matches a route, and is passed to the controller, or it doesn’t, and the controller never knows about the request.

By allowing types, and not just strings, you can avoid stringly-typed controllers. Admittedly, UUIDly-typed is only a small step in the right direction, but still a significant improvement.

Disadvantages

You need to fully-qualify the types in the routes file, for example by using java.util.UUID everywhere. You cannot use imports in the routes file. Hopefully someone will find a solution to that at some point.

Implementation

There are two things that need doing before you can use custom types in the routes file. Firstly, you must implement a PathBindable and its bind and unbind methods. For a UUID, this is quite simple. The bind method returns an Either so that you can return the a message for why the route did not match.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package util

import java.util.UUID
import play.api.mvc.PathBindable

object Binders {
   implicit def uuidPathBinder = new PathBindable[UUID] {
      override def bind(key: String, value: String): Either[String, UUID] = {
         try {
            Right(UUID.fromString(value))
         } catch {
            case e: IllegalArgumentException => Left("Id must be a UUID")
         }
      }
      override def unbind(key: String, value: UUID): String = value.toString
   }
}

Secondly, you must make Play aware of this class, by changing your build file.

1
2
3
import play.PlayImport.PlayKeys._

routesImport += "utils.Binders._"

After those two steps, you can then use types in the routes file.

Cassandra TTL Is Per Column

| Comments

Cassandra Time-To-Live (TTL) is decribed in the Datastax documentation. This blog post briefly explores it to demonstrate that TTL is set per column, and not per row.

We start by recreating the example given in the documentation. We create a keyspace, a table, and insert some data into it. The TTL value is much lower than the offical documentation, as I don’t want to wait 24 hours before the TTL runs out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cqlsh> CREATE KEYSPACE excelsior WITH REPLICATION =
         { 'class' : 'SimpleStrategy', 'replication_factor': 1 }

cqlsh> CREATE TABLE excelsior.clicks (
         userid uuid,
         url text,
         date timestamp,
         name text,
         PRIMARY KEY (userid, url)
       );

cqlsh> INSERT INTO excelsior.clicks (
         userid, url, date, name)
       VALUES (
         3715e600-2eb0-11e2-81c1-0800200c9a66,
         'http://apache.org',
         '2013-10-09', 'Mary')
       USING TTL 60;

Now that we have created our keyspace and table, let’s query the TTL:

1
2
3
4
5
cqlsh> SELECT TTL (date), TTL (name) from excelsior.clicks;

 ttl(date) | ttl(name)
-----------------------
        52 |        52

Insert or Update to change TTL per column

As demonstrated by the CQL synatx, TTL is set per column. To demonstrate this, we now insert the data again, but exclude the date.

1
2
3
4
5
6
7
8
9
10
11
12
cqlsh> INSERT INTO excelsior.clicks (
         userid, url, name)
         VALUES (
           3715e600-2eb0-11e2-81c1-0800200c9a66,
           'http://apache.org',
           'Mary')
         USING TTL 60;
cqlsh> SELECT TTL (date), TTL (name) from excelsior.clicks;

 ttl(date) | ttl(name)
-----------+-----------
        11 |        49

If we then wait 11 seconds, we can see that different columns can expire at different times.

1
2
3
4
5
cqlsh> select * from excelsior.clicks;

 userid                               | url               | date | name
--------------------------------------+-------------------+------+------
 3715e600-2eb0-11e2-81c1-0800200c9a66 | http://apache.org | null | Mary

This can come as a surprise if you’re used to rows behaving as one single entity. If you want to update the TTL for an entire row in Cassandra, you need to either insert or update the entire row again with a new TTL.

ScalaTest and Twitter Futures

| Comments

Scala has nice abstractions for asynchronous code. However, writing tests for that code sometimes results in an ugly, unreadable mess. Fortunately, ScalaTest has built-in support for testing Futures, in addition to utilities for other types of asynchronous testing, such as polling and test-probes.

org.scalatest.concurrent.Futures

ScalaTest has a trait named Futures which defines functions such as whenReady, and other goodies like a futureValue method to help your async tests become terser. However, ScalaTest only comes with support for the standard-library Futures. To use them, mixin org.scalatest.concurrent.ScalaFutures.

If, currently like me, you’re using Twitter Futures, then you need to define your own support for them. Luckily, it is quite easy to define support for any Futures library.

Behold a TwitterFutures trait:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import com.twitter.util.{Throw, Return}
import org.scalatest.concurrent.Futures

trait TwitterFutures extends Futures {

  import scala.language.implicitConversions

  implicit def convertTwitterFuture[T](twitterFuture: com.twitter.util.Future[T]): FutureConcept[T] =
    new FutureConcept[T] {
      override def eitherValue: Option[Either[Throwable, T]] = {
        twitterFuture.poll.map {
          case Return(o) => Right(o)
          case Throw(e)  => Left(e)
        }
      }
      override def isCanceled: Boolean = false
      override def isExpired: Boolean = false
    }
}

You may also have to define an implicit PatienceConfig for your tests as the default settings will timeout after 150 milliseconds.

1
implicit val asyncConfig = PatienceConfig(timeout = scaled(Span(2, Seconds)))

Polling?

Strangely, ScalaTest chooses to poll futures, despite both Scala and Twitter Futures coming with Await functions that handle timeouts. Using that as a starting point would have seemed more sensible to me. However, I’m not the author of a successful Scala testing library, and I’m sure that author Bill Venners had a reason. However, it is worth noting.

Mock Responses and Iterate

| Comments

‘Agile’ may now be an overloaded term abused by all kinds of people, but the original manifesto is actually still quite relevant. The second point is:

Working software over comprehensive documentation

This is useful to embrace in many circumstances, and can be expanded to:

You should intially favour building software over documenting it. Comprehensive documentation can come later.

I have used the following formula for the last few internal APIs I have worked on, and it makes for fun, fast, and collaborative development:

  1. Write API layer that returns mock responses
  2. Deploy it
  3. Solicit feedback from those consuming the API
  4. Make changes to API layer to address feedback
  5. Repeat

A mock-response is one that does not change, even when API parameters do. It is hardcoded. If you request GET /users/123 you get the same data as when requesting GET /users/456.

The point is to get something deployed as quickly as possible, and get feedback on it. When the API is more stable, the mock-responses can become dynamic and then persisted. Soliciting feedback is the highest priority until then. Nobody can write the perfect API the first time, and receiving and responding to feedback as early as possible saves development time later.

Tools

There are several tools available to help you develop an API and iterate upon it. WireMock can be configured with a routes file, and mock-responses, and can easily run in standalone mode. Just edit a couple of files, and deploy!

If deployment is a problem, check out getsandbox.com. They provide a JavaScript sandbox to easily return some mock-responses.

#Code2014

| Comments

As the end of the year approaches, Twitter has a hash-tag named #Code2014 that encourages people to tweet which programming languages they’ve been using over the last 12 months.

I joined in, but afterwards felt compelled to explain my thoughts on each in more than 140 characters. Hence this blog post.

Scala

My main language, and the one I am gainly employed to write in. I found myself at the Scala Exchange conference again this year, and was pleasantly surprised by the attention that Scalaz was receiving. A lot of people were still unconvinced, but the Scalaz talks were very well attended.

I have also found myself immersed in Twitter’s Scala-Ecosystem in the last quarter of the year. I’m thoroughly enjoying it. I have become more convinced that Akka is middleware, and Futures (rather than Actors) are more than sufficient for the vast majority of use-cases.

Java

First-class functions are so incredibly versatile, that I now get frustrated when using Java.

On a related note, this tweet amused me this year:

For a greenfield project, I’m not sure why I would ever choose Java over Scala again.

Haskell

It has been said that Scalaz is a gateway drug from Scala to Haskell, but it is the blurred lines between OOP and FP in Scala that occasionally irritate me. As a pure language, Haskell will hopefully not have the same kind of problems. I have now finally finished reading LYAHFGG and I will start writing more things in Haskell next year.

Clojure

I was incredibly excited by Clojure at the beginning of the year, but my excitement is beginning to wane. I would love to love Clojure; the collections libraries are magnificent, it is incredibly concise, and the premise of s-expressions is compelling. However, the lack of static typing constantly annoys me. Typed Clojure and/or Prismatic Schema may alleviate my concerns though. I need to spend more time investigating them.

Racket

I have been working my way through The Little Schemer, and apparently they had renamed Scheme to Racket. I’m not convinced it is as nice as Clojure, but might be worth investigating Typed Racket if the libraries to provide typing in Clojure don’t turn out very well.

JavaScript

JavaScript is still the only realistic choice when writing for the browser, which means that it will be necessary for years to come. I spent a lot more time using it this year, having written both client-side and server-side JavaScript over the past twelve months.

I also re-read JavaScript: The Good Parts, shortly followed by the excellent Functional JavaScript. The latter shows that JavaScript could actually be a very nice language, if everyone followed the same path.

The “Good Parts” are constantly shrinking though, as Douglas Crockford revised his advice:

It might be nice to see a new CoffeScript-esque JavaScript compiler that errors if you don’t use the Good Parts. Either that, or I need to spend time having a proper look at ClojureScript and PureScript.

Python

Python is my default scripting language. I enjoyed PayPal debunking some myths of Enterprise Python, but Python is slow and lacks good concurrency support in comparison to Scala.

Golang

I have written a couple of utilities and one website in Golang this year. It could potentially replace Python as my default “operations” language once I learn more of the standard library.

It is annoying that it doesn’t have a map function though, and without immutable data structures, I probably won’t ever trust it to build anything complicated.

Lua

Lua is a language I will spend more quality time with in 2015. Having previously worked with Lua when embedding it in Redis, I was excited to discover OpenResty and being able to embed it into NGINX. I also want to investigate using Lua for CUDA GPU Programming.

More In 2015?

Aside from those already listed, I need to spend some time with Erlang next year. I have now read a couple of books about it (including the excellent Learn You Some Erlang), and I really need to write some to experience it firsthand.

Monoids

| Comments

In abstract algebra, a branch of mathematics, a monoid is an algebraic structure with a single associative binary operation and an identity element. Monoids are studied in semigroup theory as they are semigroups with identity.

Wikipedia has the above definition of a Monoid. This blog post will desconstruct that definition to simply describe what a monoid is, and why you would want to use one.

We will, however, ignore a proper defintion of ‘abstract algebra’ as that would mean writing several textbooks. Just imagine an ‘algebraic structure’ as a thing that contains functions. A ‘thing that contains functions’ is sometimes named a class in programming languages.

So, with that out of the way, this is the list of things we will cover:

  • Associative
  • Semigroup
  • Identity

Associative

You should have learnt some associative things during Mathematics lessons at primary school. Addition and Multiplication are associative. Subtraction and Division are not. Associativity means that you can reorder operations on a list of things, and always receive the same result.

1
2
3
4
((((1 + 2) + 3) + 4) + 5) = 15
(1 + (2 + (3 + (4 + 5)))) = 15 // reordered, same result
((((1 - 2) - 3) - 4) - 5) = -13
(1 - (2 - (3 - (4 - 5)))) = 3 // reordered, different result

Semigroup

Semigroup is the name for a thing that provides an associative function. For Integers, + and * are associative functions. A Semigroup is one of those “algebraic structures”, and it has one function; an associative one.

It would looks like this, for adding up Integers:

1
2
3
class AdditionSemigroup[Int] {
  def associative(a: Int, b:Int) = a + b
}

It is worth noting that a Semigroup will apply to a whole type. This Semigroup will apply to all Int, as denoted by the square brackets after the class name. The example is written in Scala, where Int is an alias for Integer.

Identity

Also known as the zero value, or the ‘value for which nothing happens’. This will depend on what your monoid does. For instance, for an adding monoid, the identity value will be zero. Adding zero to x gives you x. For multiplication, the identity value will be 1. Multiplying 1 by x gives you x. If we had kept the identity value as zero, multiplying x by zero would return zero, thereby being useless. Likewise, any other value would mess up our calculation.

1
2
3
4
5
6
1 + 1     = 2
1 + 1 + 0 = 2 // we can add an 'identity' of zero at any point
4 * 4     = 16
4 * 4 * 1 = 16 // we can add an 'identity' of 1 at any point.
4 * 4 * 0 = 0  // an 'identity' of zero doesn't work for multiplication
4 * 4 * 2 = 32 // neither does any other value. It has to be 1!

So what is a monoid?

A monoid, has an associative function, and it has an identity function. As you may be able to tell from the wikipedia description, it really is a Semigroup, but with an identity function.

As shown above, the + operator on Integers is associative, and we know that the zero value for this is 0

1
2
3
4
class AdditionMonoid[Int] {
  def identity = 0
  def associative(a: Int, b:Int) = a + b
}

Why is this useful?

Consider the following Scala example. |+| is an alias for our previously-seen associative method. This particular associative method (taken from scalaz) merges two Maps together by summing the values if the keys are equal.

1
2
3
scala> import scalaz.Scalaz._ // this defines the |+| operator.
scala> Map("a" -> 1.0, "b" -> 3.0) |+| Map("a" -> -1.0)
res0: scala.collection.immutable.Map[String,Double] = Map(a -> 0.0, b -> 3.0)

We can also take advantage of the reduce method to do this:

1
2
scala> List(Map("a" -> 1.0, "b" -> 3.0), Map("a" -> -1.0)).reduce(_ |+| _)
res1: scala.collection.immutable.Map[String,Double] = Map(a -> 0.0, b -> 3.0)

If you can define your problem as a Monoid, it becomes trivial to compute, and trivial to parallelise too. Remember, associative functions can be batched and executed in any order.

Hopefully it is becoming clear why this is useful. If you want to add Integers together, most languages already include basic addition. However, Monoids can potentially be written for any type of data. As long as you define identity and associative, they can do anything you want!

Aggregation Services in Node.js

| Comments

My previous blog post talked about building Aggregation Services using Play-JSON. In it, I mentioned that Aggregation Services using JavaScript might be quite nice. As JSON is native to JavaScript, you might expect manipulating JSON in JavaScript to be incredibly simple. And you would be correct!

Below is the same functionality as last time, but with Node.js. To recap, we fetch an article a1, which contains a list of products ids [s1, s2, s3]. We load the article, and then have to fetch all the products it contains.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var express = require('express'),
    request = require('request'),
    async = require('async'),
    app = express();

// where our json data lives
var data = {
  "s3": "https://gist.githubusercontent.com/cjwebb/aef1f4fb2ca6d01f8b63/raw/0b6eb2c9b55a6720ccf41ee4ff8cca053cfda063/product-s3.json",
  "s2": "https://gist.githubusercontent.com/cjwebb/2d7fce88ce6594325bec/raw/fe025c2eafb8aeca953999f10663b83863a14d25/product-s2.json",
  "s1": "https://gist.githubusercontent.com/cjwebb/814c6337b0f04f1cfeba/raw/dc9b297a96c0bd8870436413e51efa2a36168308/product-s1.json",
  "a1": "https://gist.githubusercontent.com/cjwebb/c26c42e03ea8573efd4c/raw/75479f6f2d218ac6212e4f4b53fc7e30746228bd/article-a1.json"
}

var fetchProduct = function(item, cb) {
  request.get(data[item], {json:true}, function(error, response, body){
      cb(null, body);
  });
};

app.get('/', function(req, res){
  request.get(data['a1'], {json:true}, function(error, response, body){
      async.map(body['product_list'], fetchProduct, function(err, results){
          // mutate all the state!!  
          body['product_list'] = results.filter(function(n){ return n });  
          res.send(body);    
      });
  });
});

app.listen(3000);

Having worked a lot with Scala and Clojure recently, I keep forgetting that one can actually mutate variables! If the service is kept relatively small, the mutation should be forgiveable.

This isn’t production-ready code. Error handling is missing for the first HTTP request, and if fetchProduct returns an error, erroring products are filtered out. Hopefully though, the code gives a flavour of what an Aggregation service written in Node.js would look like.