In the second half of 2016 - we decided to migrate our multi-tenant app from bare-metal servers to Azure. While you can find numerous benchmarks for various cloud platforms - there are very few relatable drill-downs on the thought process as part of such migrations to the cloud as-is. More importantly, this was not just migration - it was literally a war with all hands on the deck; keeping the existing usage, client data, and growth intact we were able to migrate over 1.4TB data and existing clients to the cloud successfully.
This is how we declared WAR
Finally, we emerged as winners post the last tenant migration
The overall response and feedback post the talk was humbling - everyone was amazed at what we were able to achieve.
So, here is my humble request to the team
If I look back at our journey, we have recovered from massive failure; seen through classic disasters and built innovative and meaningful solutions. While we are moving mountains, working on disaster recovery or building that fancy little new feature; let’s share our story on this blog.
At DeltaX, we have been dabbling with Internet Scale and High Availability for our core tracking and ad-serving services. We have had our fair share of battles, wounds, victories and a host of untold stories. Today, I shall dabble into some learnings keeping the stories for another day.
When designing architecture for mission critical systems the two most commonly discussed aspects are scalability and availability. Most often than not both aspects are used interchangeably. Scalability is about being able to handle increasing load while availability is keeping the system operational by decreasing downtime. Designing Highly Available systems is focusing on the qualitative measures to reduce downtime and eliminating the single point of failures (SPOFs). Here are some learning and thoughts on things to consider while architecting an HA system.
1. Accept Failure
This is contrarian to what we set out to achieve but with all things that start in the head, you have to first get the monkey out of your head. So, if someone comes up to you and informs you that have to build a system which has zero downtime and should be running 99.999% uptime (also called five 9s which is a gold standard). Our first reaction would be to ensure we code in such a way that the system will never fail, handle all exceptions, scale to ensure that it can handle increasing load and hence will never have a downtime. Instead for a second, pause and first accept failure. Accepting failure doesn’t mean that you are building for failure but you accept that irrespective of what you do - it can still fail and so you have to consider, reconsider and plan your system around being able to fail and still keep running.
Next two learnings will talk more about how to fail - like a gentleman.
2. Redundancy, Failover and Recovery (avoid SPOF)
Building redundancy is about ensuring that there are alternate paths in the system to keep functioning (albeit at lower capacity) while failover is switching to the alternate path. The switch over ideally has to be automatic to ensure that there is no manual intervention needed. Once we have a system which fails over it’s very important to have a recovery plan to be able to resurrect the failed path otherwise there is a high chance the will result in additional load and may cause congestion or subsequent failures (snowball-effect). The recovery may be automatic or even manual.
Let’s take a classic example of a web server to understand redundancy and failover.
Now let’s add a load balancer in between and have two servers responding to requests; while the load balancer will ensure that whichever server is ‘healthy’ will be the one receiving requests from the load balancer. As soon as it detects that one of them is ‘unhealthy’ it shall redirect the requests to another one.
Although, this ensures that we have redundancy and also automatic failover - the load balancer in itself is now a SPOF. So, let’s try an alternate setup where we have two load balancers and two servers.
This is a simplistic schematic setup; production systems are more complex and have more moving parts. While we ensure automatic failovers it’s really important to be able to recover from failure. A simple example here could be that once the load balancer detects a web server to be ‘unhealthy’ it’s important to ensure that either we are able to automatically recover by swapping out the web server with a healthy one.
3. Performance Monitoring & Alerts
You can’t improve what you can’t measure. Also, for any HA system monitoring and alerts can’t be an afterthought. Monitoring is ensuring you are measuring health and performance indicators while alerts are ensuring you get timely and actionable information about the system.
Bonus Tip: To see if your system can handle failure, failover, recover and you are able to receive alerts while chaos hits the roof - you can simply log into any of your servers and simply power-off! Think this is a joke? Netflix actually built a tool called Chaos Money to do exactly this. Chaos Monkey is a service which identifies groups of systems and randomly terminates one of the systems in the group.
Architecture for DeltaX HA services
We leverage the AWS Cloud to the fullest - right from Route53, ELB, EC2 auto-scaling and S3 for the persistent store. I must note here that adopting the cloud doesn’t really mean that you are set for HA but it definitely makes your job easier with a suite of services and health monitoring system.
For redundancy, we use multiple EC2 instances under an Elastic Load Balancer(ELB) for redundancy. In each of the instances, we have multiple workers running using the Node.js cluster module. Failover and recovery are handled at multiple levels. At the worker level, we have the cluster module which instantiates a worker if one dies; monit monitoring the server process within each instance with a trigger for restart if needed. ELB health checks to route traffic between multiple instances; also to ensure auto-scaling requirements are met. Monitoring & alerts are handled through Amazon Cloudwatch and Amazon SNS.
Overall, we still have some areas in the architecture to improvise upon and further eliminate SPOFs. Like any serious HA architecture - you can’t take anything for granted; if you do the Chaos Monkey may strike.
Advancements by cloud-based IAAS providers (Amazon Web Services, Google Cloud and Azure have made on-demand scale and flexibility a reality. Today, as a startup you don’t need to worry about over-provisioning infrastructure, forecasting growth and go over long-term infrastructure contracts to meet your demands. Interestingly, a new suite of cloud services are questioning the very existence of a core aspect of common application architectures - the ‘server’ and are coined as serverless.
What is the ‘server’ in serverless?
Let’s say you wanted to run a service on the cloud - for this, you would need to do the following:
Decide the type of computing resources you need. Instance type, cores, memory and storage space.
Choose an OS / Machine image to install on the instance
Setup / deploy your service
Steps 1 & 2 above constitute the ‘server’ in the serverless paradigm and in effect, these are the steps you wouldn’t have to worry about. All you need to do is to choose your execution environment and submit your code.
Available Options
When it comes to the serverless paradigm - each of the major cloud IAAS providers have launched their own options. Here is a quick summary of options available:
There are slight differences in the extent of support and capabilities but the process to initiate works as follows:
Select a development environment
Choose the amount of memory, execution timeout etc.
Setup a trigger for launch
Proof of Concept
In part, to test drive the paradigm and at the same time build something useful, I worked on two POCs.
Azure Function: Cachewarmer Function
When it comes to our web application, we use Entity Framework as the ORM. Considering the multi-tenant nature of the application and the volume of tables - context initialization takes an unexpectedly long time. It’s for this exact reason we had to build a mechanism to warm the context cache to initialize it and keep it ready for external requests.
Trigger: CRON
Dev Environment: shell
Description: I cooked together a sequence of cURL requests to make pings to a special endpoint on the web application which initiates a context load. Considering we have over 500 tenants we had to batch a series of requests and to avoid hitting the max execution time I had to split this into two separate functions.
Honestly, this was really a trivial function, but it is exactly why having a serverless architecture was justified. Not to forget, we were up and running within 20 mins.
AWS Lambda: Slackbot dxdb
This was in retrospective a solid use case. Let me take a deep dive onto this one:
Purpose: As noted earlier, we have over 500 tenant databases. When it comes to querying the databases - it’s pretty cumbersome to connect to them individually using SMSS and then run individual queries. When it comes to executing small queries to check data; it would be pretty useful to simply fire the query in the Slack channel and see the results. An unexpected consequence of using Slack is also that one can fire the query from the Slack mobile application as well and see the results on the go.
Features Supported:
Detect the DB to connect with intelligently from the schema
Support delayed response. Some queries can take longer to execute while Slack for an immediate response has a window of 3 seconds.
Formatting output to the extent possible
Minimal error notifications
How it works?
Every invocation of the command makes a POST request to the AWS API Gateway with the command and the request text; in our case the query.
The AWS API Gateway invokes the AWS lambda function dxdbExecuteSQL and passes the request params. Tip: The AWS API Gateway is probably the most underrated yet one of the most powerful and flexible services AWS has launched. Will explore this in the future.
dxdbExecuteSQL function authenticates the request, does minimal checks on the kind of queries (in our case only read-only) and does two things.First formats the intermediate response in the form of MSSQL prompt to be sent back to Slack through the API gateway. Next invoke the dxdbDelayedSlackResponse lambda function.
dxdbDelayedSlackResponse lambda function parses the query, identifies the tenant, fires the query, reads the results, formats the response and makes POST request back to Slack.
Although the setup is complex and layered, I only had to focus on the workflow and the business logic; the effort of picking an instance, setting it up and keeping it running was not something I had to worry about. Another interesting thing about this setup is that - the function is not running all the time, it is only executed on invocation and the icing on the cake is that you are only billed for the time it executes in increments of 100ms.
Going serverless is an extension of adopting the cloud but demands a change in the thought process of layering your architecture. The recent trend around microservices-based architecture also fits well with the serverless paradigm.
Interestingly, each of the cloud services offers a minimal code editor. I can see how in the future you could probably have a full-fledged IDE available at your disposal. Looking at the pace of innovation, we are another step closer to not just programming for the cloud but literally in the cloud.
Video ad-serving is a complex beast given the sheer expressiveness of the medium and unpredictable client-side bandwidth that’s required. At DeltaX, our ad-server is now also Youtube certified for VAST in-stream ads. VAST (Video Ad Serving Template) is an XML-based standard defined by IAB standard for video ad-serving. In the case of video ad-serving, the ad-server responds with multiple video assets in different formats, resolution, and bitrate while the VAST compatible video player picks the most appropriate video asset based on the host platform, bandwidth, and other client considerations. For this to work as expected, it’s important to transcode the media file provided by an advertiser to different formats, resolution, quality and specs beforehand.
Setting up a Elastic Transcoder Pipeline on AWS
Transcoding is the process of converting a media file from one format, resolution, quality and specs to another. In the past, a transcoding pipeline would require a lot of heavy lifting on the software and hardware front. Today, using the cloud you can setup a transcoding pipeline in a matter of minutes. Considering we use Amazon Web Services to host and scale our ad-server - the Amazon Elastic Transcoder was a great fit. Expectedly, it also plays well with Amazon S3 and Amazon Cloudfront.
Here is how we setup the video transcoding pipeline for a VAST ad-server:
1. Create Custom Presets
Here you can start with a pre-existing preset. Amazon Elastic Transcoder provides comprehensive options to specify the codec, bit rate, number of key frames, sizing policy and aspect ratio.
At DeltaX we have fine tuned our presets to optimally be able to serve for all platforms.
2. Setup a Pipeline
Pipeline acts as a queue for various transcoding jobs. It also helps you configure the Amazon S3 source and output buckets.
3. Setup a Job
Here is where you specify the input source file and choose one or many output presets (configured in step 1) to generate transcoded output files.
4 Job status and Completion
You can track the status of your job on their dashboard manually. Once the status is complete you can visit the bucket/prefix and see the transcoded files.
You can see how a 720p HD file was transcoded along with thumbnails of output files of varying resolution and bitrates. If you notice the original file size and the ones which were transcoded, you would have already figured out the amount of bandwidth saving along with ensuring that the user wouldn’t have to wait very long for the video ad to load.
Closing Thoughts
This is a classic example of how with the emergence of the cloud ecosystem infinite scale and on-demand can go hand in hand. For startups, the cloud is an amazing leveler to help innovate and get to market faster.
Look forward to sharing more tidbits, optimizations and architecture considerations while building the ad-server in follow-up posts. Ending with a quote (modified to suit the blog post) from one of my favorite movies TROY - “If they ever tell our story let them say that we walked with giants. Startups rise and fall like the winter wheat, but these names will never die. Let them say we lived in the time of Azure, tamer of the Microsoft stack. Let them say we lived in the time of AWS.”
Me and Amrith spontaneously decided to make it to the Xhacknight organized by XHackers and sponsored by IBM Bluemix. It turned out to be an amazing learning experience.
What we built - Match.AI
1. Reviews are everywhere
With the boom in user generated content (UCG) - reviews are everywhere. Be it the kind of product you are buying, restaurant you want to eat out or places you want to stay at. Each of us do scout through reviews for respective opinions before making a decision.
2. Sadly - Online reviews are broken!
Each person's likings and preferences are different. None of the UCG sites allow users to take into consideration the person's personality match to that of the reviewers.
3. PersonalityMatch Index (PMI)
Using Personality Insights service from IBM Watson we were able to capture the personality profile of a user. On top of this, we built a PMI algorithm to match two different personality profiles. Using IBM Bluemix we were able to deploy this service using Node.js and expose it through REST endpoints (/map and /match)
4. Xamarin - Demo Reviews Mobile App
We also built a barebones Xamarin mobile app for iOS and Android while allow users to login through Facebook Connect, build a user personality profile using his Facebook data and then finally use the REST end-points to show a PMI rating along with the reviews.
5. Use-cases through API
The Match.AI PersonalityMatch Index REST API we built was generic and could be used for quite a few use cases:
- User Review Personalization
- Dating App Recommendations
- Resume Classification
Recently I understood, there is a difference between 'understanding something' and 'being able to cope with it'. I am still trying to cope with this. - Akky