Learnings from building High Availability(HA) Services

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

At DeltaX, we have been dabbling with Internet Scale and High Availability for our core tracking and ad-serving services. We have had our fair share of battles, wounds, victories and a host of untold stories. Today, I shall dabble into some learnings keeping the stories for another day.

When designing architecture for mission critical systems the two most commonly discussed aspects are scalability and availability. Most often than not both aspects are used interchangeably. Scalability is about being able to handle increasing load while availability is keeping the system operational by decreasing downtime. Designing Highly Available systems is focusing on the qualitative measures to reduce downtime and eliminating the single point of failures (SPOFs). Here are some learning and thoughts on things to consider while architecting an HA system.

1. Accept Failure

This is contrarian to what we set out to achieve but with all things that start in the head, you have to first get the monkey out of your head. So, if someone comes up to you and informs you that have to build a system which has zero downtime and should be running 99.999% uptime (also called five 9s which is a gold standard). Our first reaction would be to ensure we code in such a way that the system will never fail, handle all exceptions, scale to ensure that it can handle increasing load and hence will never have a downtime. Instead for a second, pause and first accept failure. Accepting failure doesn’t mean that you are building for failure but you accept that irrespective of what you do - it can still fail and so you have to consider, reconsider and plan your system around being able to fail and still keep running.

Next two learnings will talk more about how to fail - like a gentleman.

2. Redundancy, Failover and Recovery (avoid SPOF)

Building redundancy is about ensuring that there are alternate paths in the system to keep functioning (albeit at lower capacity) while failover is switching to the alternate path. The switch over ideally has to be automatic to ensure that there is no manual intervention needed. Once we have a system which fails over it’s very important to have a recovery plan to be able to resurrect the failed path otherwise there is a high chance the will result in additional load and may cause congestion or subsequent failures (snowball-effect). The recovery may be automatic or even manual.

Let’s take a classic example of a web server to understand redundancy and failover.

Schematic user -> web server setup

Now let’s add a load balancer in between and have two servers responding to requests; while the load balancer will ensure that whichever server is ‘healthy’ will be the one receiving requests from the load balancer. As soon as it detects that one of them is ‘unhealthy’ it shall redirect the requests to another one.

Schematic user -> web server setup

Although, this ensures that we have redundancy and also automatic failover - the load balancer in itself is now a SPOF. So, let’s try an alternate setup where we have two load balancers and two servers.

Schematic user -> web server setup

This is a simplistic schematic setup; production systems are more complex and have more moving parts. While we ensure automatic failovers it’s really important to be able to recover from failure. A simple example here could be that once the load balancer detects a web server to be ‘unhealthy’ it’s important to ensure that either we are able to automatically recover by swapping out the web server with a healthy one.

3. Performance Monitoring & Alerts

You can’t improve what you can’t measure. Also, for any HA system monitoring and alerts can’t be an afterthought. Monitoring is ensuring you are measuring health and performance indicators while alerts are ensuring you get timely and actionable information about the system.

Bonus Tip: To see if your system can handle failure, failover, recover and you are able to receive alerts while chaos hits the roof - you can simply log into any of your servers and simply power-off! Think this is a joke? Netflix actually built a tool called Chaos Money to do exactly this. Chaos Monkey is a service which identifies groups of systems and randomly terminates one of the systems in the group.

Architecture for DeltaX HA services
DeltaX Architecture for HA services

We leverage the AWS Cloud to the fullest - right from Route53, ELB, EC2 auto-scaling and S3 for the persistent store. I must note here that adopting the cloud doesn’t really mean that you are set for HA but it definitely makes your job easier with a suite of services and health monitoring system.

For redundancy, we use multiple EC2 instances under an Elastic Load Balancer(ELB) for redundancy. In each of the instances, we have multiple workers running using the Node.js cluster module. Failover and recovery are handled at multiple levels. At the worker level, we have the cluster module which instantiates a worker if one dies; monit monitoring the server process within each instance with a trigger for restart if needed. ELB health checks to route traffic between multiple instances; also to ensure auto-scaling requirements are met. Monitoring & alerts are handled through Amazon Cloudwatch and Amazon SNS.

Overall, we still have some areas in the architecture to improvise upon and further eliminate SPOFs. Like any serious HA architecture - you can’t take anything for granted; if you do the Chaos Monkey may strike.

Is the Future of Application Architecture – Serverless?

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

Advancements by cloud-based IAAS providers (Amazon Web Services, Google Cloud and Azure have made on-demand scale and flexibility a reality. Today, as a startup you don’t need to worry about over-provisioning infrastructure, forecasting growth and go over long-term infrastructure contracts to meet your demands. Interestingly, a new suite of cloud services are questioning the very existence of a core aspect of common application architectures - the ‘server’ and are coined as serverless.

What is the ‘server’ in serverless?

Let’s say you wanted to run a service on the cloud - for this, you would need to do the following:

  • Decide the type of computing resources you need. Instance type, cores, memory and storage space.
  • Choose an OS / Machine image to install on the instance
  • Setup / deploy your service

Steps 1 & 2 above constitute the ‘server’ in the serverless paradigm and in effect, these are the steps you wouldn’t have to worry about. All you need to do is to choose your execution environment and submit your code.

Available Options

When it comes to the serverless paradigm - each of the major cloud IAAS providers have launched their own options. Here is a quick summary of options available:

IAASServerless ParadigmSupported EnvironmentsProduction-ready
Amazon Web ServicesAWS LambdaNode.js, Java, Python, C# (.NET Core) 
Microsoft AzureAzure FunctionsNode.js, C#, F#, Python, PHP, and shell 
Google CloudCloud FunctionsNode.js 

RefClick here for a detailed comparison on Stackoverflow

There are slight differences in the extent of support and capabilities but the process to initiate works as follows:

  • Select a development environment
  • Choose the amount of memory, execution timeout etc.
  • Setup a trigger for launch
Proof of Concept

In part, to test drive the paradigm and at the same time build something useful, I worked on two POCs.

Azure Function: Cachewarmer Function

When it comes to our web application, we use Entity Framework as the ORM. Considering the multi-tenant nature of the application and the volume of tables - context initialization takes an unexpectedly long time. It’s for this exact reason we had to build a mechanism to warm the context cache to initialize it and keep it ready for external requests.

Trigger: CRON

Dev Environment: shell

Description: I cooked together a sequence of cURL requests to make pings to a special endpoint on the web application which initiates a context load. Considering we have over 500 tenants we had to batch a series of requests and to avoid hitting the max execution time I had to split this into two separate functions.

Honestly, this was really a trivial function, but it is exactly why having a serverless architecture was justified. Not to forget, we were up and running within 20 mins.

AWS Lambda: Slackbot dxdb

This was in retrospective a solid use case. Let me take a deep dive onto this one:

Purpose: As noted earlier, we have over 500 tenant databases. When it comes to querying the databases - it’s pretty cumbersome to connect to them individually using SMSS and then run individual queries. When it comes to executing small queries to check data; it would be pretty useful to simply fire the query in the Slack channel and see the results. An unexpected consequence of using Slack is also that one can fire the query from the Slack mobile application as well and see the results on the go.

Features Supported:

  • Detect the DB to connect with intelligently from the schema
  • Support delayed response. Some queries can take longer to execute while Slack for an immediate response has a window of 3 seconds.
  • Formatting output to the extent possible
  • Minimal error notifications

How it works? Slack command dxdb

  • Every invocation of the command makes a POST request to the AWS API Gateway with the command and the request text; in our case the query.
  • The AWS API Gateway invokes the AWS lambda function dxdbExecuteSQL and passes the request params. Tip: The AWS API Gateway is probably the most underrated yet one of the most powerful and flexible services AWS has launched. Will explore this in the future.
  • dxdbExecuteSQL function authenticates the request, does minimal checks on the kind of queries (in our case only read-only) and does two things.First formats the intermediate response in the form of MSSQL prompt to be sent back to Slack through the API gateway. Next invoke the dxdbDelayedSlackResponse lambda function.
  • dxdbDelayedSlackResponse lambda function parses the query, identifies the tenant, fires the query, reads the results, formats the response and makes POST request back to Slack.

Although the setup is complex and layered, I only had to focus on the workflow and the business logic; the effort of picking an instance, setting it up and keeping it running was not something I had to worry about. Another interesting thing about this setup is that - the function is not running all the time, it is only executed on invocation and the icing on the cake is that you are only billed for the time it executes in increments of 100ms.

Code: Project is available on Github.

Follow-up Thoughts

Going serverless is an extension of adopting the cloud but demands a change in the thought process of layering your architecture. The recent trend around microservices-based architecture also fits well with the serverless paradigm.

Interestingly, each of the cloud services offers a minimal code editor. I can see how in the future you could probably have a full-fledged IDE available at your disposal. Looking at the pace of innovation, we are another step closer to not just programming for the cloud but literally in the cloud.

Video Transcoding on the AWS Cloud

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

Video ad-serving is a complex beast given the sheer expressiveness of the medium and unpredictable client-side bandwidth that’s required. At DeltaX, our ad-server is now also Youtube certified for VAST in-stream ads. VAST (Video Ad Serving Template) is an XML-based standard defined by IAB standard for video ad-serving. In the case of video ad-serving, the ad-server responds with multiple video assets in different formats, resolution, and bitrate while the VAST compatible video player picks the most appropriate video asset based on the host platform, bandwidth, and other client considerations. For this to work as expected, it’s important to transcode the media file provided by an advertiser to different formats, resolution, quality and specs beforehand.

Setting up a Elastic Transcoder Pipeline on AWS

Transcoding is the process of converting a media file from one format, resolution, quality and specs to another. In the past, a transcoding pipeline would require a lot of heavy lifting on the software and hardware front. Today, using the cloud you can setup a transcoding pipeline in a matter of minutes. Considering we use Amazon Web Services to host and scale our ad-server - the Amazon Elastic Transcoder was a great fit. Expectedly, it also plays well with Amazon S3 and Amazon Cloudfront.

Here is how we setup the video transcoding pipeline for a VAST ad-server:

1. Create Custom Presets
Custom Presets

Here you can start with a pre-existing preset. Amazon Elastic Transcoder provides comprehensive options to specify the codec, bit rate, number of key frames, sizing policy and aspect ratio.

VAST Presets

At DeltaX we have fine tuned our presets to optimally be able to serve for all platforms.

2. Setup a Pipeline
Transcoding Pipeline

Pipeline acts as a queue for various transcoding jobs. It also helps you configure the Amazon S3 source and output buckets.

3. Setup a Job
Transcoding Job

Here is where you specify the input source file and choose one or many output presets (configured in step 1) to generate transcoded output files.

4 Job status and Completion
Transcoding Job Status

You can track the status of your job on their dashboard manually. Once the status is complete you can visit the bucket/prefix and see the transcoded files.

Transcoding Job Input
Transcoding Job Output

You can see how a 720p HD file was transcoded along with thumbnails of output files of varying resolution and bitrates. If you notice the original file size and the ones which were transcoded, you would have already figured out the amount of bandwidth saving along with ensuring that the user wouldn’t have to wait very long for the video ad to load.

Closing Thoughts

This is a classic example of how with the emergence of the cloud ecosystem infinite scale and on-demand can go hand in hand. For startups, the cloud is an amazing leveler to help innovate and get to market faster.

Look forward to sharing more tidbits, optimizations and architecture considerations while building the ad-server in follow-up posts. Ending with a quote (modified to suit the blog post) from one of my favorite movies TROY - “If they ever tell our story let them say that we walked with giants. Startups rise and fall like the winter wheat, but these names will never die. Let them say we lived in the time of Azure, tamer of the Microsoft stack. Let them say we lived in the time of AWS.”

Math.AI using Personality Insights service from IBM Watson

How it all started - Xhacknight Hackathon

Me and Amrith spontaneously decided to make it to the Xhacknight organized by XHackers and sponsored by IBM Bluemix. It turned out to be an amazing learning experience.

What we built - Match.AI

1. Reviews are everywhere
With the boom in user generated content (UCG) - reviews are everywhere. Be it the kind of product you are buying, restaurant you want to eat out or places you want to stay at. Each of us do scout through reviews for respective opinions before making a decision.

2. Sadly - Online reviews are broken!
Each person's likings and preferences are different. None of the UCG sites allow users to take into consideration the person's personality match to that of the reviewers.

3. PersonalityMatch Index (PMI)
Using Personality Insights service from IBM Watson we were able to capture the personality profile of a user. On top of this, we built a PMI algorithm to match two different personality profiles. Using IBM Bluemix we were able to deploy this service using Node.js and expose it through REST endpoints (/map and /match)

4. Xamarin - Demo Reviews Mobile App
We also built a barebones Xamarin mobile app for iOS and Android while allow users to login through Facebook Connect, build a user personality profile using his Facebook data and then finally use the REST end-points to show a PMI rating along with the reviews.

5. Use-cases through API
The Match.AI PersonalityMatch Index REST API we built was generic and could be used for quite a few use cases:
- User Review Personalization
- Dating App Recommendations
- Resume Classification

Reference Links

API service: https://github.com/whiteboardmonk/ibm-bluemix-pi-xhacknight
Xamarin app: https://github.com/amrithyerramilli/xhacknight-xamarin-pmi-app

Presentation after Hackathon

I can still feel you

Udaipur - Jan 2012

Shimmering sunrise, heartwarming sunsets
Calm waters, chilly breeze
I can still feel you
With my closed eyes

City walks, strangers and talk
Some serendipity, some delight
I can still feel you
With my closed eyes

Narrow lanes, princely feel
Pitch dark skies, illuminated sights
I can still feel you
With my closed eyes

Wandering souls, togetherness and love
I can still feel you
With my closed eyes

- Akky

Creative Commons License

I want you by my side

Golden Gate Bridge, San Francisco - August 2009
Golden Gate Bridge, San Francisco - August 2009

I want you by my side
On a quiet escape
To find solace
In an unknown place
With you right by my side

I want you by my side
To weather through
A a thunderstorm
On a voyage
With you right by my side

I want you by my side
To cuddle through
A cold winter night
With nothing in sight
With you right by my side

I want you by my side
To chase
A distant dream
Outside our realms
With you right by my side

- Akky

Creative Commons License

Try but don’t Cry

Rome - 2011
Rome - 2011

I tried but failed,
I cried and nothing changed.

I shall again try,
But this time I won't cry.
For every time I cry,
I loose hope to try.

For I still remember,
A line from my school days
Try try and you will succeed

For every failed try I make,
I learn how not to partake.
For every chance I take,
Is my another shot at the cake.

- Akky

Creative Commons License

P.S: Reads very much like a nursery rhyme 🙂

I am Anna Hazare

Portrait of Anna Hazare by Siddharth Iyer (link below)

I have bribed a few, Took the easy way through
But all along the way, I knew I was not true

Then Along came a man, Who voiced my opinion
His words echoed in the hearts of a billion population
Showed us how we could stand for what we believe in
And now I say no more to corruption, Bcz I am Anna Hazare.

I read in the books, about the father of our nation
But none after him taught me, how to put those words into action.

Then Along came a man, Who voiced my opinion
His words echoed in the hearts of a billion population
Showed us the way to put Gandhi's words into action
And now I say no more to corruption, Bcz I am Anna Hazare

I watch in dismay, as bigger scams unfold everyday
But I see no punitive action. I feet robbed; I feet raped.

Then Along came a man, Who voiced my opinion.
His words echoed in the hearts of a billion population,
Talked less, but prompted us to take the first step.
And now I say no more to corruption, Bcz I am Anna Hazare.

I was not born when freedom fighters gave their life for the nation.
But if we don't stand up today, it would mean disrespect to their contribution.

Then Along came a man, Who voiced my opinion
His words echoed in the hearts of a billion population,
He shouted a battle cry, to fight for India, Against Corruption
And now I say no more to corruption
Bcz I am Anna Hazare

- Written by Akky

Creative Commons License

P.S.: Image uploaded above is a Portrait of Anna Hazare by Siddharth Iyer (source).