What is likable about Amazon’s Elasticsearch on AWS?

We have been using Amazon Elasticsearch for the last 6 months and it’s been an overall pleasant and predictable experience. Before opting for it we had our own doubts reading the caveats of Amazon’s cloud offering here and here. I must say, that some of the points noted in the posts are pertinent; but like all things ‘it depends’ and so we wanted to put across some things that we like about AWS Elasticsearch cloud offering and why it worked for us and probably why you should also consider.

1. Integration with AWS ecosystem

This is a big plus for us as we use Amazon Kinesis Data Firehose and it supports Amazon Elasticsearch as a destination. We were able to setup a test domain in minutes and were able to benchmark our production workload very easily. Even when it comes to authentication AWS IAM works great for us and we were able to use it as an authentication layer for our workers. You can read more about our real-time stream processing pipeline here which primarily uses Amazon Kinesis Data Firehose.

Kinesis Firehose - Elasticsearch

2. Cluster Management (scale out/up)

There were at least two separate instances in a span of the first 8 weeks where we had to revisit the cluster setup and space provisioning. In both the instances, with a few clicks, we were able to scale up and scale out with a few clicks without any downtime with close to few GBs of data which was getting shipped to Elasticsearch close to peak load.

Cluster Deploy - Elasticsearch

3. Monitoring & Alerts

Amazon Elasticsearch by default logs a plethora of important metrics to AWS Cloudwatch. This was interesting because when we started working with Elasticsearch we didn’t completely know what we were getting into. Overall, it’s a very powerful system which could fit many different use cases (timeseries aggregation in our case) and like any powerful system, it takes time to master. Thankfully, Amazon’s Elasticsearch offering makes available quite a few critical metrics and as you get deeper into the ecosystem - you can tweak your cluster to your workloads much better.

Looking at Cloudwatch and Elasticsearch metrics we were able to fine-tune: # Cluster size # Impact of number of shards & replication factor # Impact of queries

Read Latency - Elasticsearch
Disk Queue - Elasticsearch

This proved to be really useful and we could track the performance metrics closely pre and post changes. Cloudwatch and it’s tight integration with Amazon SNS makes the setup of critical alerts on SMS and email really easy and something you should not miss.

4. Snapshot Recovery / Backup to S3

Although we didn’t have to use the automatic snapshots from S3 it would give anyone additional peace of mind knowing that Amazon takes daily automated snapshots in a pre-configured S3 bucket. Considering our indexes are daily, and at times we need to be able to do benchmarking on our own data - we also run daily manual snapshots through AWS Lambda for backups to S3. This allows us the ability to recover from our own snapshots if needed.

5. Elasticsearch Upgrades

Lastly, Amazon has been extremely proactive with regards to keeping pace with Elasticseach releases. At the time of this post, v6.2 is the most recent version for Elasticsearch and that’s the same version available with Amazon. Please note that version upgrades are not available from the interface and need to be handled by manually taking a snapshot and restoring it to a new domain. I hope Amazon is working on making it a seamless experience as well. Unlike quite a few cloud offerings it’s commendable to see that Amazon is making a conscious effort to offer the latest and the greatest.

Here is a glimpse of stats from our production cluster:

DeviceStats
AWS Elasticsearch Version5.2
Number of nodes6
Number of data nodes4
Active primary shards104
Active shards208
Provisioned Size2 TB
Searchable Documents300-500 MN

Today, we are fairly advanced with regards to our understanding of the internals of Elasticsearch - data nodes, master nodes, indexing, shards, replication strategy and query performance. In hindsight, it would be fair to say that we have been extremely happy with our choice given our limited understanding of running a production scale Elasticsearch cluster 6 months back. With Amazon Elasticsearch we were able to start small, move fast, learn on the go, and fine-tune our cluster for production workloads without having to understand every single aspect of managing and scaling a production scale Elasticsearch cluster.

While I wholeheartedly recommend Amazon’s Elasticsearch offering - it goes without saying that you should be aware of its limitations, anomalies, and caveats. Liz Bennet’s post (although not up to date) and Amazon’s Elasticsearch documentation is a good start with regards to evaluating your options. Like any cloud offering it has its limitations and trade-offs.

Building a Real-time Stream Processing Pipeline

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

The Big Data ecosystem has grown leaps and bounds in the last 5 years. It would be fair to say that in the last two years the noise and hype around it have matured as well. At DeltaX, we have been keenly following and experimenting with some of these technologies. Here is a blog post on how we built our real-time stream processing pipeline and all it’s moving parts.

BIG DATA PROCESSING MODELS

Before I take a deep dive into how we went about building our data pipeline - here are some models I would like to describe:

Batch processing

We have been using batch-processing as a paradigm on the tracking side from the start. Overall, when Hadoop as an ecosystem came to the fore - ‘map-reduce’ as a powerful paradigm for batch processing on bounded datasets got wide adoption. Batch processing works with large data sets and is not expected to give results in real-time. Apache Spark works on top of Hadoop and primarily falls under the batch processing model.

Stream Processing

Stream processing as a paradigm is when you work with a small window of data, complete the computation in near-real-time, independently. asynchronously and continuously. Apache Spark Streaming (micro-batch), Apache Storm, Kafka Streams, Apache Flink are popular frameworks for stream processing.

HISTORY OF EXPERIMENTATION AT DELTAX

Genesis

When we started architecting our tracker back in 2012, it was also the time when the Hadoop ecosystem was catching a lot of eyeballs. Being the curious mind and dabbling with it a little - it was thrilling to see the power of scalable distributed file system (Hadoop) and map-reduce as a paradigm. We were small and the data that we were expecting to see at that time in the near future wasn’t anywhere close to Big Data and so we never ventured towards it. But as a side-effect of the exploration, the files that we generated from tracker were JSON and were processed in batches.

Exploring Apache Storm

We built a POC in 2014 for our tracker and dabbled in stream (event) processing as a paradigm. This was an interesting exploration and conceptually our use case fit very well with the ‘spouts’ and ‘bolts’ semantics from Apache Storm. This was also our first time working with ZooKeeper and Kafka and I must admit it wasn’t a breeze to get them to work.

Exploring Amazon Kinesis

Around 2015 Joy worked on a POC for ingesting click data into Amazon Kinesis. Compared to Apache Kafka, working with a cloud-managed service felt refreshing. We explored this immediately on launch and it lacked a lot of bells and whistles which are now baked into the service. Read further to see how we shall close the loop on this.

Exploring Datastores

Data stores have always been of interest to us on the tracking and ad-serving side. Having dabbled with SQL, SQL Column stores, Redis, AWS DynamoDB, AWS S3 and MongoDB at varying times - we would always be interested in the next exciting store. It was then when we came across Druid. Druid is a distributed column-store database and it caught our fancy for it real-time ingestion and sub-second query times. Amrith also happened to a fairly detailed deep-dive on it and dabbled with it as part of #1ppm. I scanned the docs which explain their data model and various trade-offs in fair detail. Reading through Druid docs and understanding it’s internal working set a benchmark with regards to what we should expect from a sub-second query store.

Exploring Stream Processing and Apache Spark

It was Dec 2016 when we decided to go neck deep this time with Amrith leading from the front. The ecosystem had matured, we had learned from our previous explorations and the volume of data had substantially grown. We explored Apache Kafka and it’s newly introduced streaming model. Post POCs, follow-up discussions and deep-dive we were convinced that the computing framework, tooling, paradigm and unified stack that Spark provides was suited, mature and superior to other options available. This was also the time when Joy hopped on the bandwagon. There were some fundamental challenges we needed to overcome to confidently take this to production.

Here are some challenges we faced with Apache Spark:

  • We were creating rolling hourly log files by advertiser; which was close to 15K per hour and this was only growing
  • We were using AWS S3 supported EMRFS which is is an implementation of HDFS for S3, but it wasn’t really meant for working with thousands of small files, instead it was more suited for processing a small number of huge files.
  • We deviated towards the batch processing paradigm by running the AWS EMR cluster every half an hour, yet we were not able to figure out a clean way to ingest the summarized data into individual advertiser BPs. This was more of a bottle next with regards to our multi-tenant isolation across advertisers
  • AWS EMR cluster wasn’t very stable and something we were not very confident about. Also, the overall provisioning and dynamics of resources allocation were not something that was easy to factor in for production workload.
  • We were able to process a day’s odd data in fractionally incremental time vs. half-hour data which was a complete bummer for us. On exploring further - the stack we were working towards was ideal to process large volumes of data over a week to two week period in one shot instead of trying to process half-hour worth of data.

Lastly, I must confess none of these should be looked at as shortcomings for Apache Spark but more as architectural trade-offs given what was possible at that point in time given what was in place, bandwidth, and resources. Given the right use case, I would hands-down go back to booting up an AWS EMR cluster to process a few months worth of data using Apache Spark.

P.S: Amrith has a fairly detailed set of notes about how we went about this exercise as a draft post with title ‘Igniting Spark’ and can be read by anyone internally.

BUILDING THE REAL-TIME STREAM PROCESSING PIPELINE

By this time we had a series of learnings and some clear goals in mind:

  • Stream processing as a paradigm suits our use case the best
  • Easy to maintain or managed service in the cloud would be ideal
  • Developer friendly and peace of mind was of utmost importance
  • Being able to ingest streaming data and query summaries was important
  • Good to have a way to run batch processing framework for machine learning, data crunching, and analysis
Architecture Components
DeltaX Architecture

Click here to view full architecture flow

Event Producers

Our core tracking and ad serving stack are built from scratch on Node.js. It’s on AWS and auto-scaled. The async event-driven approach of Node.js works perfectly right for producing async events. We integrated the Kinesis Firehose SDK and push events to Kinesis Firehose

Streaming Queue

Kinesis Firehose is a fully managed streaming queue with configurable destinations. It also supports running custom lambda functions on every event. Event processing and the scalable serverless model of processing together is extremely powerful. We have configured two destinations for our Kinesis Firehose application - Amazon S3 for batch processing logs and Amazon Elasticsearch for near-realtime summarization queries.

Amazon Elasticsearch

Using Elasticsearch as part of our stack is a story in itself. We had looked into Elasticsearch primarily for log monitoring the first time. Elasticsearch as an ecosystem has evolved from its primary search driven use-case to a wide array of time-series and aggregation use-case. Like any NoSQL databases, you want to follow the access-oriented pattern and model it right. With Elasticsearch in our arsenal, we were also able to build a pull-based architecture - where workers across advertisers pull the required data from Elasticsearch. With Kinesis Firehose + Elasticsearch we have been able to keep the data freshness to around 15 minutes from a click to its summary being available. Jaydeepp has planned to write a multi-part series on Elasticseach - Part 1 is already published.

Streaming Analytics

Kinesis Analytics allows running streaming SQL window functions on events in Kinesis Firehose. This could be useful to run any kind of real-time anomaly detection, fraudulent click protection or rate limiting.

Batch Processing and Analytic Workloads

The AWS S3 logs deposited by Kinesis Firehose can be used for batch processing and analytic workloads. We use AWS Athena a managed PrestoDB service to do all the heavy lifting when it comes to analytic workloads across advertisers and big date ranges. You can do this while still writing vanilla SQL. Anything more complicated and you can start an AWS EMR cluster and run an Apache Spark job to do the data crunching for you.

LOOKING FORWARD

Just last week, Vamsi blew me away with his take on modelling the tracking data to a Graph Database.

Here is what I have learned from this experience and something you would have already felt after reading about this journey. This is not where it ends. You are never able to connect the dots looking forward. Considering we are working with unbounded data sets - all we can do is to keep streaming and keep processing!

CDN for serving Dynamic Content

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

Using CDNs (Content Delivery Network) for static content has been a long known best practice and something we have been using across our platform and ad-server. I wanted to share a special usecase where we use CDN (AWS Cloudfront) for serving dynamic requests on our ad-server to achieve subsecond response times.

CDN for Static Content

CDNs employ a network of nodes across the globe called edge nodes to get closer to the user (client browser) and hence are able to reduce the latency and roundtrip delay. Add to this a cache policy at the edge nodes and you are able to serve content gloablly with with acceptable latencies.

Here is how it would look like: How CDNs Work

CDNs also come in handy as browsers limit the number of HTTP connections with the same domain - this is anywhere between 2-4 for older browsers and 6-10 for modern. Using multiple CDN sub-domains dynamically helps avoid queing the requests on the browser side.

CDN for Dynamic Content

Using CDN for dynamic content in cases where the response from the server is supposed to be different for every user request is counter intuitive. When it comes to ad-server the response is not only unique by user but also time sensitive. So, caching the dynamic requests of the ad-serving engine is not recommeneded. CDNs that allow supporting dynamic content allow this to be specified in distribution settings or read it from the headers of the response of the origin servers.

Before we get deeper, there is another important consideration - all ad-serving requests are now mandated to be through HTTPS. HTTPS (SSL/TLS) is recommened to protect the security, privacy and integrity of the data but it’s not known to be the fastest off the block. I’m referring to the 3-way handshake which delivers the expected promise of SSL/TLS but adds significant latency while establishing the initial connection. This initial latency can be substantial considering ad-serving performance is measured in subseconds.

By terminating SSL at the edge node of a CDN also called as SSL offloading can speed up initial requests (see realworld results below).

Here is how it would look like:

Dynamic CDN
CDN for Dynamic Content (real-world results)

Theoretically, using CDNs for dynamic content for SSL offloading may look like a minor boost - but when it comes to real world results here is how the results stack up.

Dynamic CDN - realworld reults

This is close to a 900% boost in real world performance for the first request. The results shall vary based the latencies between your user, you origin server and the nearest edge location.

Additional Pointers

We use AWS Cloudfront as our CDN and he are some features which we are able to leverage for subsecond ad deliveries:

  • Vast coverage - 98 Edge locations (87 Points of Presence and 11 Regional Edge Caches) in 50 cities across 23 countries
  • HTTP/2 support - which takes advantage of miltiplexing (multiple request & response messages between client and server on the same connection instead of multiple connections). Esp. for usecases where multiple assets are required like richmedia ads; the realworld benchmarks were unbelieveable to me and Amrith (possibly a future blog post).
  • SSL Session Ticket - to reduce the back and forth for the SSL handshake for subsequent requests.
  • Support for gzip compression.

CDNs have become a commodity with the ease and flexibility offered by the public cloud providers like AWS & Microsoft. I feel the recent launch of AWS Lambda@Edge the pormise of the on-demand nature of the cloud and serverless architecture will finally culminate into something bigger.

Learnings from building High Availability(HA) Services

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

At DeltaX, we have been dabbling with Internet Scale and High Availability for our core tracking and ad-serving services. We have had our fair share of battles, wounds, victories and a host of untold stories. Today, I shall dabble into some learnings keeping the stories for another day.

When designing architecture for mission critical systems the two most commonly discussed aspects are scalability and availability. Most often than not both aspects are used interchangeably. Scalability is about being able to handle increasing load while availability is keeping the system operational by decreasing downtime. Designing Highly Available systems is focusing on the qualitative measures to reduce downtime and eliminating the single point of failures (SPOFs). Here are some learning and thoughts on things to consider while architecting an HA system.

1. Accept Failure

This is contrarian to what we set out to achieve but with all things that start in the head, you have to first get the monkey out of your head. So, if someone comes up to you and informs you that have to build a system which has zero downtime and should be running 99.999% uptime (also called five 9s which is a gold standard). Our first reaction would be to ensure we code in such a way that the system will never fail, handle all exceptions, scale to ensure that it can handle increasing load and hence will never have a downtime. Instead for a second, pause and first accept failure. Accepting failure doesn’t mean that you are building for failure but you accept that irrespective of what you do - it can still fail and so you have to consider, reconsider and plan your system around being able to fail and still keep running.

Next two learnings will talk more about how to fail - like a gentleman.

2. Redundancy, Failover and Recovery (avoid SPOF)

Building redundancy is about ensuring that there are alternate paths in the system to keep functioning (albeit at lower capacity) while failover is switching to the alternate path. The switch over ideally has to be automatic to ensure that there is no manual intervention needed. Once we have a system which fails over it’s very important to have a recovery plan to be able to resurrect the failed path otherwise there is a high chance the will result in additional load and may cause congestion or subsequent failures (snowball-effect). The recovery may be automatic or even manual.

Let’s take a classic example of a web server to understand redundancy and failover.

Schematic user -> web server setup

Now let’s add a load balancer in between and have two servers responding to requests; while the load balancer will ensure that whichever server is ‘healthy’ will be the one receiving requests from the load balancer. As soon as it detects that one of them is ‘unhealthy’ it shall redirect the requests to another one.

Schematic user -> web server setup

Although, this ensures that we have redundancy and also automatic failover - the load balancer in itself is now a SPOF. So, let’s try an alternate setup where we have two load balancers and two servers.

Schematic user -> web server setup

This is a simplistic schematic setup; production systems are more complex and have more moving parts. While we ensure automatic failovers it’s really important to be able to recover from failure. A simple example here could be that once the load balancer detects a web server to be ‘unhealthy’ it’s important to ensure that either we are able to automatically recover by swapping out the web server with a healthy one.

3. Performance Monitoring & Alerts

You can’t improve what you can’t measure. Also, for any HA system monitoring and alerts can’t be an afterthought. Monitoring is ensuring you are measuring health and performance indicators while alerts are ensuring you get timely and actionable information about the system.

Bonus Tip: To see if your system can handle failure, failover, recover and you are able to receive alerts while chaos hits the roof - you can simply log into any of your servers and simply power-off! Think this is a joke? Netflix actually built a tool called Chaos Money to do exactly this. Chaos Monkey is a service which identifies groups of systems and randomly terminates one of the systems in the group.

Architecture for DeltaX HA services
DeltaX Architecture for HA services

We leverage the AWS Cloud to the fullest - right from Route53, ELB, EC2 auto-scaling and S3 for the persistent store. I must note here that adopting the cloud doesn’t really mean that you are set for HA but it definitely makes your job easier with a suite of services and health monitoring system.

For redundancy, we use multiple EC2 instances under an Elastic Load Balancer(ELB) for redundancy. In each of the instances, we have multiple workers running using the Node.js cluster module. Failover and recovery are handled at multiple levels. At the worker level, we have the cluster module which instantiates a worker if one dies; monit monitoring the server process within each instance with a trigger for restart if needed. ELB health checks to route traffic between multiple instances; also to ensure auto-scaling requirements are met. Monitoring & alerts are handled through Amazon Cloudwatch and Amazon SNS.

Overall, we still have some areas in the architecture to improvise upon and further eliminate SPOFs. Like any serious HA architecture - you can’t take anything for granted; if you do the Chaos Monkey may strike.

Is the Future of Application Architecture – Serverless?

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

Advancements by cloud-based IAAS providers (Amazon Web Services, Google Cloud and Azure have made on-demand scale and flexibility a reality. Today, as a startup you don’t need to worry about over-provisioning infrastructure, forecasting growth and go over long-term infrastructure contracts to meet your demands. Interestingly, a new suite of cloud services are questioning the very existence of a core aspect of common application architectures - the ‘server’ and are coined as serverless.

What is the ‘server’ in serverless?

Let’s say you wanted to run a service on the cloud - for this, you would need to do the following:

  • Decide the type of computing resources you need. Instance type, cores, memory and storage space.
  • Choose an OS / Machine image to install on the instance
  • Setup / deploy your service

Steps 1 & 2 above constitute the ‘server’ in the serverless paradigm and in effect, these are the steps you wouldn’t have to worry about. All you need to do is to choose your execution environment and submit your code.

Available Options

When it comes to the serverless paradigm - each of the major cloud IAAS providers have launched their own options. Here is a quick summary of options available:

IAASServerless ParadigmSupported EnvironmentsProduction-ready
Amazon Web ServicesAWS LambdaNode.js, Java, Python, C# (.NET Core) 
Microsoft AzureAzure FunctionsNode.js, C#, F#, Python, PHP, and shell 
Google CloudCloud FunctionsNode.js 

RefClick here for a detailed comparison on Stackoverflow

There are slight differences in the extent of support and capabilities but the process to initiate works as follows:

  • Select a development environment
  • Choose the amount of memory, execution timeout etc.
  • Setup a trigger for launch
Proof of Concept

In part, to test drive the paradigm and at the same time build something useful, I worked on two POCs.

Azure Function: Cachewarmer Function

When it comes to our web application, we use Entity Framework as the ORM. Considering the multi-tenant nature of the application and the volume of tables - context initialization takes an unexpectedly long time. It’s for this exact reason we had to build a mechanism to warm the context cache to initialize it and keep it ready for external requests.

Trigger: CRON

Dev Environment: shell

Description: I cooked together a sequence of cURL requests to make pings to a special endpoint on the web application which initiates a context load. Considering we have over 500 tenants we had to batch a series of requests and to avoid hitting the max execution time I had to split this into two separate functions.

Honestly, this was really a trivial function, but it is exactly why having a serverless architecture was justified. Not to forget, we were up and running within 20 mins.

AWS Lambda: Slackbot dxdb

This was in retrospective a solid use case. Let me take a deep dive onto this one:

Purpose: As noted earlier, we have over 500 tenant databases. When it comes to querying the databases - it’s pretty cumbersome to connect to them individually using SMSS and then run individual queries. When it comes to executing small queries to check data; it would be pretty useful to simply fire the query in the Slack channel and see the results. An unexpected consequence of using Slack is also that one can fire the query from the Slack mobile application as well and see the results on the go.

Features Supported:

  • Detect the DB to connect with intelligently from the schema
  • Support delayed response. Some queries can take longer to execute while Slack for an immediate response has a window of 3 seconds.
  • Formatting output to the extent possible
  • Minimal error notifications

How it works? Slack command dxdb

  • Every invocation of the command makes a POST request to the AWS API Gateway with the command and the request text; in our case the query.
  • The AWS API Gateway invokes the AWS lambda function dxdbExecuteSQL and passes the request params. Tip: The AWS API Gateway is probably the most underrated yet one of the most powerful and flexible services AWS has launched. Will explore this in the future.
  • dxdbExecuteSQL function authenticates the request, does minimal checks on the kind of queries (in our case only read-only) and does two things.First formats the intermediate response in the form of MSSQL prompt to be sent back to Slack through the API gateway. Next invoke the dxdbDelayedSlackResponse lambda function.
  • dxdbDelayedSlackResponse lambda function parses the query, identifies the tenant, fires the query, reads the results, formats the response and makes POST request back to Slack.

Although the setup is complex and layered, I only had to focus on the workflow and the business logic; the effort of picking an instance, setting it up and keeping it running was not something I had to worry about. Another interesting thing about this setup is that - the function is not running all the time, it is only executed on invocation and the icing on the cake is that you are only billed for the time it executes in increments of 100ms.

Code: Project is available on Github.

Follow-up Thoughts

Going serverless is an extension of adopting the cloud but demands a change in the thought process of layering your architecture. The recent trend around microservices-based architecture also fits well with the serverless paradigm.

Interestingly, each of the cloud services offers a minimal code editor. I can see how in the future you could probably have a full-fledged IDE available at your disposal. Looking at the pace of innovation, we are another step closer to not just programming for the cloud but literally in the cloud.

Video Transcoding on the AWS Cloud

This blog post was cross-posted from DeltaX Engineering Blog - {recursion} where it was published first.

Video ad-serving is a complex beast given the sheer expressiveness of the medium and unpredictable client-side bandwidth that’s required. At DeltaX, our ad-server is now also Youtube certified for VAST in-stream ads. VAST (Video Ad Serving Template) is an XML-based standard defined by IAB standard for video ad-serving. In the case of video ad-serving, the ad-server responds with multiple video assets in different formats, resolution, and bitrate while the VAST compatible video player picks the most appropriate video asset based on the host platform, bandwidth, and other client considerations. For this to work as expected, it’s important to transcode the media file provided by an advertiser to different formats, resolution, quality and specs beforehand.

Setting up a Elastic Transcoder Pipeline on AWS

Transcoding is the process of converting a media file from one format, resolution, quality and specs to another. In the past, a transcoding pipeline would require a lot of heavy lifting on the software and hardware front. Today, using the cloud you can setup a transcoding pipeline in a matter of minutes. Considering we use Amazon Web Services to host and scale our ad-server - the Amazon Elastic Transcoder was a great fit. Expectedly, it also plays well with Amazon S3 and Amazon Cloudfront.

Here is how we setup the video transcoding pipeline for a VAST ad-server:

1. Create Custom Presets
Custom Presets

Here you can start with a pre-existing preset. Amazon Elastic Transcoder provides comprehensive options to specify the codec, bit rate, number of key frames, sizing policy and aspect ratio.

VAST Presets

At DeltaX we have fine tuned our presets to optimally be able to serve for all platforms.

2. Setup a Pipeline
Transcoding Pipeline

Pipeline acts as a queue for various transcoding jobs. It also helps you configure the Amazon S3 source and output buckets.

3. Setup a Job
Transcoding Job

Here is where you specify the input source file and choose one or many output presets (configured in step 1) to generate transcoded output files.

4 Job status and Completion
Transcoding Job Status

You can track the status of your job on their dashboard manually. Once the status is complete you can visit the bucket/prefix and see the transcoded files.

Transcoding Job Input
Transcoding Job Output

You can see how a 720p HD file was transcoded along with thumbnails of output files of varying resolution and bitrates. If you notice the original file size and the ones which were transcoded, you would have already figured out the amount of bandwidth saving along with ensuring that the user wouldn’t have to wait very long for the video ad to load.

Closing Thoughts

This is a classic example of how with the emergence of the cloud ecosystem infinite scale and on-demand can go hand in hand. For startups, the cloud is an amazing leveler to help innovate and get to market faster.

Look forward to sharing more tidbits, optimizations and architecture considerations while building the ad-server in follow-up posts. Ending with a quote (modified to suit the blog post) from one of my favorite movies TROY - “If they ever tell our story let them say that we walked with giants. Startups rise and fall like the winter wheat, but these names will never die. Let them say we lived in the time of Azure, tamer of the Microsoft stack. Let them say we lived in the time of AWS.”