Github Data Mining: on Amazon Web Services

Person. Proud Novodian. Learner. Digresser.

This blog post describes how we implemented our Github Reports ecosystem on Amazon Web Services (AWS) and the technical challenges we faced. We will dive into some of the main components of AWS, so that you will hopefully get set up quicker than we did and avoid the pitfalls we fell into.

In our previous blog post we discussed about the general idea behind mining our Github data, with the purpose of extracting useful organisation information. We showed an ideal algorithm to retrieve historical data, handle network errors, retry operations and decompose the problem into multiple pieces.

An algorithm and all of its friends

The process we thought of consisted in the following steps:

  1. Starting the process
  2. Requesting a piece of Github information
  3. Converting the information
  4. Persisting the information
  5. Handling API rate limits
  6. Handling network errors
  7. Understanding “what to request next”, then doing it

Keeping everything abstract helped us think more clearly of the components we needed:

  • some kind of worker instance to actually run the code
  • a database to store and read data from
  • some scheduler to restart requests that have failed for any error
  • a queue to store the messages (all the information we want to extract)

In fact, we wrote an abstract class that wraps all of these meta-components together to provide global functionality without detailing internal implementation (following code is a slight simplification of the actual one):

if (configuration.hasAlarm()) {
    String alarmName + configuration.alarmName();
Q queue = getQueue(configuration);
M queueMessage = getItem(queue);
List<M> newMessages = handleQueueMessage(configuration, queueMessage);
updateQueue(queue, queueMessage, newMessages);

A “good enough” approach would be to implement this system on a local computer, which provides all the pieces we need:

  • the worker can be implemented as a simple command line program
  • the database can be easily stored locally
  • the scheduler is simply cron

Theoretically, there’s nothing wrong with this approach, except that it would keep a machine busy all the time, is not scalable but, mostly, not kewl.

Cartman is not happy

Cartman is not happy

Introducing Amazon Web Services

Amazon Web Services (AWS) is a suite of cloud services that range from the infrastructure to the platform level. It enables you to define APIs, manage databases, create your own Virtual Private Network, handle traffic scaling manually or automatically, and so on.

We chose AWS over similar service providers (Google Cloud, Microsoft Azure) because it seemed as the most complete suite. In fact, after a quick documentation dive, we found out it satisfies our requirements, providing all of the components we need.

Simple Queue System

SQS is a queue system (no surprise so far) that is “fast, reliable, scalable, fully managed”. Through the AWS SDK, developers can perform all basic operations one would expect to be possible with queues:

  • create, purge, delete the queue
  • add an object
  • read a message
  • delete a message

Apart from the basic stuff, SQS provides a model that guarantees that any message, as soon as it’s read by someone, won’t be visible by other queue clients until a certain timeout expires. This means that it’s guaranteed that multiple clients reading the queue at very close instants in time will not read the same message. After the timeout expires, if the message was not deleted, it becomes visible again by any other client.

Therefore, SQS was the perfect candidate for deploying and managing our message queue, where every message consists in one atomic operation we need to perform (e.g., get repositories list at page 2, get comments for issue 1337 at page 10, etc.).

In our specific implementation, we set the visibility timeout to 0 for a couple of reasons.
First, we don’t allow for parallel executions of worker instances, since we rely on the partial order of messages to avoid inconsistencies in the database (e.g., we must hold a reference to a repository before being able to store those repository’s issues).

Don’t be like Cartman, keep your queue ordered and tidy

Don’t be like Cartman, keep your queue ordered and tidy

Second, if the handling of a message errors, we reschedule the execution on the same message at a variable rate, be it immediate or delayed in time; it’s therefore important for us that the message is immediately visible at any moment after the first read, so that it can be picked up by any “retry” operation.


You’ve probably heard of microservices or even serverless architecture by now, which means you’re very likely to have heard about Amazon’s Lambda too. In a few words, Lambda is a realisation of serverless configuration, allowing you to run units - functions - in response to events, in what is known as event-driven computing: something happens → your unit runs.

This implies a little paradigm shift from the usual, or more traditional, way of doing things. Instead of having your application running on a server (or servers) somewhere, you break it down to smaller pieces and you’re billed on the resources it takes to run each piece when the corresponding event occurs. So, you code each unit, upload it and somehow configure it to respond (i.e. run) to events of your choosing.

It may also be simpler for you to envision, and develop, your application broken down this way and having it running quicker too (there’s no need to buy/configure servers).

This is essentially what AWS offers when it comes to serverless computing, through Lambda.

Now, if you read the previous post on Github data mining endeavours, you might remember the tree structure and the need to go through each of its nodes, or “messages” to actually get the information we want, parse it appropriately and persist it according to our needs. This would be our worker! Backtracking a bit to the beginning of this blog post we start to see what our unit might be:

  1. fetch a message;
  2. parse it, retrieving its content;
  3. and then do whatever it needs to be done, according to the content.

We coded our application in such a way that we could pack these three steps in one unit, our worker. So, as an example, and looking at the tree, imagine the worker starts, getting a message (step 1), it goes on parsing it and figuring out it needs to fetch the repositories (step 2) and then does so, adding to the queue new messages relative to each repository (step 3).

We have our unit - our worker - we now need to make it work, that is, respond to events.

CloudWatch Events

CloudWatch is a AWS module that contains lots of resources useful for performance/cost analysis of your cloud platform. It enables you to monitor resource consumption, errors and success rates and effectively plan your spending with billing estimates.

Within all of these features, a gem hides: Events. Events is a sub-component of CloudWatch that lets you associate a source event to a target one; for example, you might want to react to AWS API calls for your account to perform some custom metric logging logic, by either calling a Lambda function, do some queue operation, consume a stream, etc.

Our simple algorithm requires us to react to some “alarm” event, in a way similar to how cron jobs work. As it turns out, CloudWatch Events has a “Schedule” source event type that allows us to specify a cron expression, so that when that one is triggered the target gets called. In our case, the target is the Lambda function that implements our worker.

You can create events manually or via the provided SDK

When a lambda is triggered from a schedule, it looks for the associated rule on CloudWatch and removes it, so that a new target won’t be triggered; then, it proceeds its regular behaviour (gets the first message from the queue, processes it, etc.).

REST APIs - API Gateway

As you already know by now we had to consider two different scenarios where we needed to get data from Github, parse it and persist it. One of these scenarios was pretty much defined from the start: live data; where Github POSTs data as events on repositories occur, via a webhook. This case requires us to be able to respond to these events appropriately. You’ve probably started seeing the unit ↔ event mapping by now…

We can have a worker that’s responsible for receiving the payload from these webhook events: every time the webhook is triggered a corresponding worker starts, parses the data and then persists it.

Given that the webhook’s payload already contains data about the event itself (e.g. a user commented on a pull request), we don’t need to go through the first step described above (“fetch a message”).

So we’re left with the need to bridge the webhook and our worker unit somehow and, given that we have to provide an endpoint for the webhook to hit, having the event that triggers our worker be a request to a certain endpoint only seemed reasonable.

Fortunately, AWS offers a service that makes it easy to create, manage and expose an API, with direct tie-in to their Lambda service: API Gateway.

You can easily configure API Gateway, creating a resource - webhook - and a method - POST - and then point it to one of your Lambdas. Deploy it when you’re done and that’s it, you’ve got the event that triggers your unit setup.

Feelings, wohoo feelings

We’ve talked about how we used the tools at hand to achieve what we wanted, but what was the overall experience of actually using these tools? Is is it easy to get up and ready with the AWS solutions?

Development Experience

Starting with AWS Lambda we can’t say developing for it was a breeze. If you go and have a quick look at their Hello World tutorial [go on, do it!] you’ll see that it’s fairly limited.

Lambda supports (only) three environments:

  • Java (Java 8);
  • Node.js (v0.10.36, v4.3.2);
  • Python 2.7.

If you took the time to browse through some of their documentation you were probably quick to see that the focus is on the Node.js and Python environments. This was a bit of a set back when we started working with Lambda. Even though we were aware about this from the start, when we actually started coding our Lambdas we didn’t expect the support for the Java 8 environment to be so poor.

If you want to target AWS Lambda with Java 8 as your main development environment prepare yourself for quite a bit of digging around the AWS Forums, blog posts, etc. How you parse your input data and how you should prepare your output (erroneous or not) is not clearly defined, or even practical at times.

Easiness of Use

We felt really confident working with queues on Amazon SQS, as everything worked out just fine right from the first hacks.

CloudWatch Events were pretty easy to use as well, even fi we had issues with authorising events to execute custom lambdas.

In fact, the authorisation/security model a bit hard to understand, maybe because a bit undocumented. You have roles, users and user groups, policies, ARNs: grasping the basics of how all of this rules mix up is not easy at all.

API Gateway is an easy way of connecting lambdas to usable endpoints, even though we found some difficulties working with Velocity templates, a Java-based template engine. The official page claims that:

It permits anyone to use a simple yet powerful template language to reference objects defined in Java code.

The reality is that Velocity doesn’t look easy to use and is mostly annoying when dealing with query string parameters, as it doesn’t offer helper methods to unwrap common objects. For instance, the following template is used to simply convert the query string parameters into a simple JSON object that can be read from the lambda:

#set($params = $input.params().querystring)

#if($params.from != "")
#set($from = """${params.from}""")
#set($from = "null")

#if($ != "")
#set($to = """${}""")
#set($to = "null")

#if($params.users != "")
#set($users = $params.users)
#set($users = "[]")

#if($params.timezone != "")
#set($timezone = """${params.timezone}""")
#set($timezone = "null")

    "from": $from,
    "to": $to,
    "users": $users,
    "timezone": $timezone

Now imagine doing this for every endpoint, and/or with regularly changing specs.
We really feel like Amazon could provide an (optional) automatic wrapping/unwrapping mechanism for connecting API Gateway to lambdas: converting to custom model objects is much easier from your lambda than in a non-programmable Velocity template.

Another pain point was the fact that, if the lambda throws an exception, the API Gateway does not treat the non-0 return code as an error, but simply returns the serialised exception in a 200 Success HTTP response. To return custom HTTP error codes you have to set up custom Integration Responses for every endpoint (again), using regular expressions to parse the content of the returned error:

In this case too, Amazon should give us easier ways to achieve this, possibly avoiding the need to configure regular expression matchers on the Web UI.


Using AWS provided tools, such as the CLI or the Web UI, didn’t prove as easy as we wished. We had a very specific issue that we were able to understand only after talking to the AWS support, since nobody replied us in the AWS forums.

The Web UI is OK for initially playing with configurations and trying to understand how the numerous options work and belong together. After a while, though, we felt like we needed a tool for automating most of our work: deploying lambdas, changing API Gateway configurations, etc.

The AWS CLI, which in theory allows to tweak every single piece of configuration on AWS, isn’t really useful, since commands are very atomic and most likely the base for some other tool. Just look at our instructions on how to setup one single endpoint and make up your own mind on the issue: to us, the AWS CLI was unusable.

In order to deploy lambdas without manually zipping and uploading via the Web interface, we used a very nice Gradle plugin that allowed us to work with most Amazon services.


After our 4 months long experience we feel like Amazon AWS offers a complete service set for our use case, so rich that it cannot be matched by any other cloud provider, at the moment. The learning curve is also pretty low at the beginning, while the Web UI is decent, given the amount of options you need to work with.

The lack of good documentation, the bugs we encountered while trying to configure all services and the poor quality of tooling, though, made us roll our eyes multiple times, and will likely cause us to look for and compare alternative providers next time.

Enjoyed this article? There's more...

We send out a small, valuable newsletter with the best stories, app design & development resources every month.

No spam, no giving your data away, unsubscribe anytime.

About Novoda

We plan, design, and develop the world’s most desirable software products. Our team’s expertise helps brands like Sony, Motorola, Tesco, Channel4, BBC, and News Corp build fully customized Android devices or simply make their mobile experiences the best on the market. Since 2008, our full in-house teams work from London, Liverpool, Berlin, Barcelona, and NYC.

Let’s get in contact