I've come across blue-eyed founders who use the term MVP frequently but end up falling prey to some common myths.
Release a minimum product - has-assed product.
Reality: Release a product that helps you validate your core hypothesis
Let’s release what’s possible in 2/3 months
Reality: Constrain time to think creatively and shorten the Build-Measure-Learn loop
Tackle complexity head-on - Do the most complex bits of the product as part of the MVP
Reality: Find simpler ways to validate your core hypotheses
I will release the MVP and we will know whether it works. If it doesn't then we will pivot
Reality: It will mostly not work, and it's ok. First, start with what you plan to measure and go through a few follow up cycles of Build-Measure-Learn before thinking of a Pivot
I highly recommend you read the book “The Lean Startup” where Eric Ries popularized the term. Here are a few videos [1] & [2] (with interesting Q&A at the end) where he explains his philosophy and concepts.
I came across AutoML while exploring some data science use cases we were tackling and hence I explored it for some time. Here are some notes related to Google Cloud - AutoML from a developer perspective. At the tail end shall also share some thoughts around how this ecosystem is shaping up.
As a developer and before diving deeper into data science I would always wonder how Machine Learning is different than traditional programming. I’m going to simplify it a bit to make the difference a bit apparent.
PROGRAMMING MODEL
We write some rules (code), receive input data and generate output. It’s a fairly established model for us to think about.
MACHINE LEARNING
We start with sample data (training set) and expected results to generate a model. Once we have a model we can give it the input data and generate output. The primary difference being that we don’t write rules to generate output - our model learns the rules on its own looking at the sample data and expected results. Based on the type of available sample data there are different types of learning - supervised, unsupervised, semi-supervised and reinforcement learning.
Lifecycle of a Machine Learning Project
To fully understand the value that an AutoML system brings to the table one has to understand the lifecycle of a typical Machine Learning project.
Step 1 & 2: This is largely about making the data ready in a consumable format. Most of this has to be done manually whether you are doing - traditional programming, ML project or even AutoML.
Step 3: This is where most of the actual time is spent and it involves a variety of skills from your arsenal around the art and science of machine learning. Most of the time it involves you to go back and forth with your data to ensure you are working with the right set of features and model. This along with the step 4 is time consuming. Also, you can read more about Feature engineering and Feature selection here.
Step 4: Based on the trained model created using the features and from Step 3 and sample (training) data - we need to run the model on test data and interpret the predicted results. Google cloud has a very good primer on evaluating the results here
Where does AutoML fit?
AutoML is aimed at automating functional aspects around - Data encoding, Feature engineering / selection, Model selection, Evaluating Predictions and Model Interpretation. There are frameworks which also try to automate some other aspects too.
The data loading and getting started for Cloud ML was a breeze.
You upload the data in cloud storage and then Cloud ML imports it. Post import we see some key aspects about our data along with some manual controls to tweak the “Data type” which gets auto detected. Once we choose the Target Column - it’ runs a corelation analysis to figure out which columns matter. Also, shows some additional stats around the data values.
After choosing a Target Column you can initiate training model. While initiating the training - you can choose the number of hour (budget) and the feature columns. Depending on whether the Target column is a numeric value or a dimension - CloudML auto-picks a model type.
Once the training finishes - Cloud ML shows a summary and detailed analysis of the summary and evaluation metrics depending on the training model. Considering our Target Metric is Clicks - it’s used a Regression Model. Also, it’s determined the respective feature importance based on the Training Data.
Considering it’s an integrated cloud offering you can also deploy the model and see live online predictions. Above I have experimented trying to predict the number of ‘Clicks’ I shall receive for a budget of 500,000 on 29th June, 2020 for the respective channels - Adwords, Facebook and DBM.
Closing Thoughts
Cloud AutoML is extremely powerful and more importantly has smart defaults around being able to build a robust ML model. It’s integration with Google Cloud ensures that you are able to quickly deploy the same model and consume it as an API. One can also export the model and run it on your own infrastructure by hosting it on Docker.
Data science is equal parts an art and a science. AutoML is making ML accessible and tackling the scientific bits around building a ML model. Data scientists will still be needed to fill in the artistic side of things until AI takes over in the near or distant future.
I’ve been yearning to explore GraphQL for some time now and a daytime sleeper bus journey turns out to be the best excuse. I shall take a quick dive into it’s feature highlights also share some samples on how it can solve some real challenges for us. For brevity I’ve taken an example of our Task/Pending Task queing system (TaskTypes, Tasks, PendingTasks).
Let’s get started with GraphQL feature highlights:
1. Query Language for you API
This is the best way to understand GraphQL. Like we use SQL to explore our database, clients (end user applications) can use GraphQL to query data from you API. The similarity doesn’t end there - like in SQL it supports DDL (GraphQL schema), DML (GraphQL Mutation) and DQL (GraphQL Query).
Here is an example - GraphQL Query for a particular Task ID:
2. Query what you need
In case of REST - you don’t know what fields will get returned untill you make a request and see the response. One can always read the API docs - or use a client library but there is no discoverability or safety when it comes to REST as the response returns everything.
Example: Let’s say you have a web app and mobile app which is displaying Task details (Id, TaskTypeId, Arguments, Priority). What if the mobile app doesn’t need to display the Task ‘Argument’ field? In case of REST considering both are using the same API endpoints - both will receive the ‘Arguments’ field for an API call for GET /tasks/75103779.
In case of GraphQL, the clients (web and mobile) can query for what they need.
Example Web Client - requesting for Id, TaskTypeId, Arguments and Priority:
Example Mobile Client - requesting for Id, TaskTypeId and Priority:
4. Nested resources
When using REST - you will always stumble upon edge-cases around whether REST resources should return responses containing multiple resorces.
Here is an example set of REST endpoints: /tasktypes/tasktypes/:id/tasks/tasks/:id/tasks/:id/pendingstasks/pendingtasks/pendingtasks/:id
Use-case: If you want to display a Pending Task (id, taskstatusid) with it’s Task details (id, arguments).
Example with REST:
GET /pendingtasks/4480674
This will give us the id; and the taskid
GET /tasks/:taskid
This will give us the task details for the pending task
Example with GraphQL:
While we are at this - there is no restriction on what is returned as long as the client is expecting it.
Example GraphQL where you want Status of Pending Task 4480674 and details for TaskType 10
4. Version-less
As a side-effect of (2) and (3) - if there are not breaking migrations you don’t need a versioning for your API. You can keep evolving your schema and ageing fields can be deprecated slowly and hidden from new client integrations you can continuously evolve your API.
5. Strong Type System
GraphQL schemas are expressed as fields and types. This ensures the schema is self-documenting and better type checks during runtime.
6. Single endpoint
GraphQL acts as a single endpoint for all your queries. This is very different than the REST conventions we are used to where every resourse is a separate endpoint.
Let’s say your GraphQL server is running at http://localhost:4000 - you can directly send it a query using cURL or fetch to query data across all the mapped schemas.
Examples:
curl -X POST -H "Content-Type: application/json" -d '{"query": "{ task(id: 81061739) { id priority } }"}' http://localhost:4000/
GraphQL specification defines a query language and the execution engine. In code it’s defined as a schema and a set of resolvers. The actual code for the resolvers isn’t part of the specification and hence there are endless possibilities as long as you can find a resolver for your backend. In more concrete terms, the above examples are possible using the sequelize which is an ORM for mssql and graphql-sequelize which is a resolver for graphql queries targeted at Sequelize models or associations. In the same way, you can have resolver which is fetching data from your existing REST API (or even multiple REST APIs).
GraphQL is not a silver bullet - it has it’s own gotchas. Here are some you would want to keep in mind:
1. Fat Clients
A direct consuqeuence of ‘query what you need’ - is that the clients are more context / scehma aware. Depending on the client libraries you use this could be a boon or a bane.
2. Rouge clients / Complex Query Optimization
In a case of REST, we consciously expose the data through the API. What this allows us to ensure is being able to optimize the queries the API is making on the database. In case of GraphQL we allow clients to make arbitary/adhoc queries - optimizing for such usescases can become complicated. In addition, N+1 problem is a classic problem which hits ORMs (yes - EF, I’m looking down upon you) and REST APIs alike. GraphQL is no exception and to avoid such cases one needs to use a dataloader. DataLoader is a generic utility to be used as part of your application’s data fetching layer to provide a consistent API over various backends and reduce requests to those backends via batching and caching.
Closing Thoughts
With us moving towards API-first, we are already facing some of the above challenges across teams with conventional REST API design. GraphQL looks like a strong contender for us to evaluate and experiment with. If you are excited about the possibilities of GraphQL, please feel free to reach out to me with ideas.
We are big fans of Amazon ES Service and have been using it for over a year in production with great success. We also have blogged about what we love about Amazon’s ES offering. While all of that stays true even today - we were wishful in thinking that Amazon must be working on making the upgrades a bit easier.
Lastly, Amazon has been extremely proactive with regards to keeping pace with Elasticseach releases. At the time of this post, v6.2 is the most recent version for Elasticsearch and that’s the same version available with Amazon. Please note that version upgrades are not available from the interface and need to be handled by manually taking a snapshot and restoring it to a new domain. I hope Amazon is working on making it a seamless experience as well. Unlike quite a few cloud offerings it’s commendable to see that Amazon is making a conscious effort to offer the latest and the greatest.
Last month, Amazon launched a one-click in-place upgrade for their Amazon ES offering. This makes it upgrading to the latest version of Amazon ES a breeze. We went from 5.3 to 6.3 (from 5.3 to 5.6; and then 5.6 to 6.3) in a matter of hours.
Upgrade from 5.3 to 5.6
Upgrade from 5.6 to 6.3
One thing to keep in mind while upgrading to v6.x is that multiple mapping types per index are no longer supported. This was an interesting trade-off we stumbled upon while we were modeling our indexes. We ensured that we don’t use multiple types per index and it clearly paid off during the upgrade.
With this upgrade, we are looking forward to the performance improvements and also to try out the new goodness which was launched with Elasticsearch v6.x - composite aggregations.
If you look through their profiles there are lot of things that work for them; in no particular order: genuine, really smart, having prior experience of working together and complementary skills.
2. Idea
In Paul's own words [1], "you don't need a brilliant idea to start a startup around" but if we look back Y Combinator at it's core was a pioneering effort. I would highly recommend you to read "How Y Combinator started" [2] which takes you through the thought process. For brevity, I would like to highlight some unique bits of the core idea in Paul's own words:
- investors should be making more, smaller investments - they should be funding hackers instead of suits - they should be willing to fund younger founders - funding startups synchronously, instead of asynchronously as it had always been done before
If you read through Paul's essay closely you will also realize that they didn't necessarily feel that a lot of these USPs were pioneering at the core.
3. Execution
Not much can be said about the execution except that it was perfect, at least to me from the time I was following them. You have to excuse me on this count because I was only able to follow them through what I have been reading about them online and through blog posts from other incubates.
Here is a small excerpt from how they started [2]:
It's hard for people to realize now how inconsequential YC seemed at the time. I can't blame people who didn't take us seriously, because we ourselves didn't take that first summer program seriously in the very beginning.
For me what really stood out when I read how they started [2] was how a large part of the execution or implementation of the plan was not necessarily thinking that we are going to change the face of 'venture funding' or 'incubation' forever. It felt as if they were just doing what they believe in.
Apart from the above very obvious things which by now feel common sensical and over repeated but still people get them wrong (including me). But when it comes to being able to cerate a strong brand there are two more things that matter:
4. Timing
Y Combinator has created a brand of it's own not by simply becoming a better incubator but by re-thinking form bottom-up of how an incubator should be. The timing plays an important role from the everyone's perspective because if today there was supposed to be a right team coming together and working on the same fundamentals (idea) with brillant execution they would still remain sub-ordinate to Y Combinator as people would rightly credit Y Combinator for the seed idea.
5. Sense of Purpose
Sense of purpose to me largely is a differentiator between good and great. For every article that I have read about Y Combinator from the founders I have always felt a sense of purpose in 'why they are doing, what they are doing'. Being able to identity a gap is simply identifying an opportunity but taking upon yourself and working towards a goal as if you were destined to work on it and no one else values it enough is purpose for me.
At DeltaX we have been using Amazon Athena as part of our data pipeline for running ad-hoc queries and analytic workloads on logs collected through our tracking and ad-serving system. Amazon Athena responds anywhere from few seconds to minutes for data than runs into hundreds of GBs and has pleasantly surprised us by its ease of use. As part of this blog post, I shall discuss how we went about setting up Athena to query our JSON data.
Amazon Athena is an interactive query engine that makes it easy to analyze data in Amazon S3. It’s serverless and built on top of Presto with ANSI SQL support and hence you can run queries using SQL.
It has 3 basic building blocks:
AWS S3 - Persistent Store
Apache Hive - Schema Definition (DDL)
Presto SQL - Query Language
1. AWS S3 - Persistent Store
Athena uses AWS S3 as it’s persistent store and supports files in the following formats - CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.
In our use case, we log data in plain JSON and also use GZIP compression to utilize the storage space optimally. Athen recognizes whether the files are GZIP compressed based on their extension.
Files are dropped into the bucket by Kinesis Data Firehose in the following format: s3://<bucket>/events/[YYYY]/[MM]/[DD]/[HH]/*
Here is an explanation for various parts of the query:
1. CREATE EXTERNAL TABLE - clause
Apache Hive supports ‘external’ and ‘internal’ tables. In case of Internal hive manages the complete lifecycle of the data and hence when you issue a delete table it deletes the data as well. For an external table, the data is stored outside of the hive system and it only recognizes the schema to be able to interpret the data. On issuing a delete table query on an external table doesn’t delete the underlying data.
Only external tables are supported in case of Athena.
2. PARTITIONED BY - clause
In case of Athena, you are charged per GB of data scanned. Hence, to optimize the query execution time and the pricing it’s ideal that you partition your data such that only data that’s relevant is accessed.
We use a pretty elaborate partition scheme with year, month, day and dt which helps us access the right partition of data.
Athena is able to auto=magically load the default Hive partition scheme - YYYY-MM-DD-HH-MM. This can be done by issuing the command MSCK REPAIR TABLE es_eventlogs For any custom partition scheme; you would need to load the partitions manually.
This specifies the SerDe (Serializer and Deserializer) to use. In this case, it’s JSON represented as org.openx.data.jsonserde.JsonSerDe. You can find other supported options here.
4. LOCATION
This specifies the base path for the location of external data. Athena reads all files from the location you have specified.
3. Presto SQL - Query Language
Presto, a distributed query engine was built at Facebook as an alternative to the Apache Hive execution model which used the Hadoop MapReduce mechanism on each query, Presto does not write intermediate results to disk resulting in a significant speed improvement. Athena supports a sub-set of the Presto SQL largely limited to SELECT and a selection of operators and functions.
A simple query to find count of events by xb and xevent would look like:
SELECT xb,
xevent,
count(1)
FROM es_eventlogs
WHERE dt >= '2018-03-01'
AND dt <= '2018-03-02'
GROUP BY xb, xevent
Here is typical query which we run to calculate distinct reach for an advertising campaign:
SELECT xevent,
dt,
COUNT(distinct userid)as reach
FROM es_eventlogs es
WHERE es.xevent IN('xclicks', 'xviews')
AND es.xccid in('7')
AND dt >= '2018-03-07'
AND dt<= '2018-04-06'AND Createdatts
BETWEEN timestamp '2018-03-07 05:30:00'
AND timestamp '2018-04-06 05:30:00'
AND xb = 'XXXXXX'
GROUP BY ROLLUP(xevent, dt)
ORDER BY xevent, dt
Considering some queries could take considerably longer time, Athena console also shows a list of historical of queries and their results are also available for download as CSV.
In the last 3 months, we are increasingly using Athena to run quick ad-hoc queries and some analytical workloads. The ease of SQL on top of Big Data store is a lethal combination. The cloud is evolving fast towards the serverless paradigm and services like Amazon Athena exemplify the serverless and pay per use model.
We have been using Amazon Elasticsearch for the last 6 months and it’s been an overall pleasant and predictable experience. Before opting for it we had our own doubts reading the caveats of Amazon’s cloud offering here and here. I must say, that some of the points noted in the posts are pertinent; but like all things ‘it depends’ and so we wanted to put across some things that we like about AWS Elasticsearch cloud offering and why it worked for us and probably why you should also consider.
1. Integration with AWS ecosystem
This is a big plus for us as we use Amazon Kinesis Data Firehose and it supports Amazon Elasticsearch as a destination. We were able to setup a test domain in minutes and were able to benchmark our production workload very easily. Even when it comes to authentication AWS IAM works great for us and we were able to use it as an authentication layer for our workers. You can read more about our real-time stream processing pipeline here which primarily uses Amazon Kinesis Data Firehose.
2. Cluster Management (scale out/up)
There were at least two separate instances in a span of the first 8 weeks where we had to revisit the cluster setup and space provisioning. In both the instances, with a few clicks, we were able to scale up and scale out with a few clicks without any downtime with close to few GBs of data which was getting shipped to Elasticsearch close to peak load.
3. Monitoring & Alerts
Amazon Elasticsearch by default logs a plethora of important metrics to AWS Cloudwatch. This was interesting because when we started working with Elasticsearch we didn’t completely know what we were getting into. Overall, it’s a very powerful system which could fit many different use cases (timeseries aggregation in our case) and like any powerful system, it takes time to master. Thankfully, Amazon’s Elasticsearch offering makes available quite a few critical metrics and as you get deeper into the ecosystem - you can tweak your cluster to your workloads much better.
Looking at Cloudwatch and Elasticsearch metrics we were able to fine-tune: # Cluster size # Impact of number of shards & replication factor # Impact of queries
This proved to be really useful and we could track the performance metrics closely pre and post changes. Cloudwatch and it’s tight integration with Amazon SNS makes the setup of critical alerts on SMS and email really easy and something you should not miss.
4. Snapshot Recovery / Backup to S3
Although we didn’t have to use the automatic snapshots from S3 it would give anyone additional peace of mind knowing that Amazon takes daily automated snapshots in a pre-configured S3 bucket. Considering our indexes are daily, and at times we need to be able to do benchmarking on our own data - we also run daily manual snapshots through AWS Lambda for backups to S3. This allows us the ability to recover from our own snapshots if needed.
5. Elasticsearch Upgrades
Lastly, Amazon has been extremely proactive with regards to keeping pace with Elasticseach releases. At the time of this post, v6.2 is the most recent version for Elasticsearch and that’s the same version available with Amazon. Please note that version upgrades are not available from the interface and need to be handled by manually taking a snapshot and restoring it to a new domain. I hope Amazon is working on making it a seamless experience as well. Unlike quite a few cloud offerings it’s commendable to see that Amazon is making a conscious effort to offer the latest and the greatest.
Here is a glimpse of stats from our production cluster:
Device
Stats
AWS Elasticsearch Version
5.2
Number of nodes
6
Number of data nodes
4
Active primary shards
104
Active shards
208
Provisioned Size
2 TB
Searchable Documents
300-500 MN
Today, we are fairly advanced with regards to our understanding of the internals of Elasticsearch - data nodes, master nodes, indexing, shards, replication strategy and query performance. In hindsight, it would be fair to say that we have been extremely happy with our choice given our limited understanding of running a production scale Elasticsearch cluster 6 months back. With Amazon Elasticsearch we were able to start small, move fast, learn on the go, and fine-tune our cluster for production workloads without having to understand every single aspect of managing and scaling a production scale Elasticsearch cluster.
While I wholeheartedly recommend Amazon’s Elasticsearch offering - it goes without saying that you should be aware of its limitations, anomalies, and caveats. Liz Bennet’s post (although not up to date) and Amazon’s Elasticsearch documentation is a good start with regards to evaluating your options. Like any cloud offering it has its limitations and trade-offs.
Tracking pixels also referred to as 1x1 pixels is a common way to track user activity in the analytics and ad-serving world. There are multiple ways to achieve this; the most common being through a transparent 1x1 GIF image call and passing the data that is needed through URL params. This is also the reason why tracking pixels are referred to as 1x1 pixels. Overall, the tracking pixels are flaky and constrained by the limitations imposed by various browser environments and network connectivity. The Beacon API proposes to address these concerns and to provide a streamlined API and predictable support across browsers.
BEACON API
The work on the spec of the Beacon API started in 2013 and by the end of 2014, the then latest versions of Google Chrome and Mozilla Firefox started shipping support for the Beacon API.
You can either read the W3C spec of the Beacon API or a shorter version of the key aspects in the MDN docs to understand the motivation and inner workings.
Here is a quick blurb from MDN:
The Beacon interface schedules an asynchronous and non-blocking request to a web server. Beacon requests use HTTP POST and requests do not require a response. Requests are guaranteed to be initiated before a page is unloaded and they are run to completion without requiring a blocking request (for example XMLHttpRequest).
There are two key aspects which make this really interesting for tracking pixels:
It’s non-blocking and are prioritized
Requests are guaranteed to be initiated before page is unloaded and are allowed to run to completion
Together this would ensure that the end-user experience remains unaffected while still ensuring that there is no data loss if a request is initiated but the user decides to navigate to another page.
MIGRATING TO THE BEACON API
We did a small test across our ad-server tracking pixels with regards to support for the Beacon API by comparing it with our traffic across browsers, devices and the browser compatibility chart. It looked promising and so we moved ahead to add support for the Beacon API. We can’t be loosing any of the tracking data for non-supported browsers and so having a fallback was important.
Here is how that code block looks with support for Beacon API:
// URL to call
var eventUrl = 'http://www.tracking.com/1x1.gif';
if (navigator&&navigator.sendBeacon) {
// Check if sendBeacon is supported
navigator.sendBeacon(eventUrl);
} else {
// Fallback to using JS Image call
(new Image).src=eventUrl;
}
We deployed this change a few weeks back and looking at the data that we have gathered - here is the percentage of traffic that supports the Beacon API:
Device
Usage (%)
Desktop
97.72
Mobile
93.40
Across (Desktop + Mobile)
95.74
In hindsight, the transition went off smooth and this change will definitely make tracking more robust. It’s interesting how the W3C spec is continuously evolving and adapting based on use cases, and making the web more usable for users and developers alike.
The Big Data ecosystem has grown leaps and bounds in the last 5 years. It would be fair to say that in the last two years the noise and hype around it have matured as well. At DeltaX, we have been keenly following and experimenting with some of these technologies. Here is a blog post on how we built our real-time stream processing pipeline and all it’s moving parts.
Before I take a deep dive into how we went about building our data pipeline - here are some models I would like to describe:
Batch processing
We have been using batch-processing as a paradigm on the tracking side from the start. Overall, when Hadoop as an ecosystem came to the fore - ‘map-reduce’ as a powerful paradigm for batch processing on bounded datasets got wide adoption. Batch processing works with large data sets and is not expected to give results in real-time. Apache Spark works on top of Hadoop and primarily falls under the batch processing model.
Stream Processing
Stream processing as a paradigm is when you work with a small window of data, complete the computation in near-real-time, independently. asynchronously and continuously. Apache Spark Streaming (micro-batch), Apache Storm, Kafka Streams, Apache Flink are popular frameworks for stream processing.
HISTORY OF EXPERIMENTATION AT DELTAX
Genesis
When we started architecting our tracker back in 2012, it was also the time when the Hadoop ecosystem was catching a lot of eyeballs. Being the curious mind and dabbling with it a little - it was thrilling to see the power of scalable distributed file system (Hadoop) and map-reduce as a paradigm. We were small and the data that we were expecting to see at that time in the near future wasn’t anywhere close to Big Data and so we never ventured towards it. But as a side-effect of the exploration, the files that we generated from tracker were JSON and were processed in batches.
Exploring Apache Storm
We built a POC in 2014 for our tracker and dabbled in stream (event) processing as a paradigm. This was an interesting exploration and conceptually our use case fit very well with the ‘spouts’ and ‘bolts’ semantics from Apache Storm. This was also our first time working with ZooKeeper and Kafka and I must admit it wasn’t a breeze to get them to work.
Exploring Amazon Kinesis
Around 2015 Joy worked on a POC for ingesting click data into Amazon Kinesis. Compared to Apache Kafka, working with a cloud-managed service felt refreshing. We explored this immediately on launch and it lacked a lot of bells and whistles which are now baked into the service. Read further to see how we shall close the loop on this.
Exploring Datastores
Data stores have always been of interest to us on the tracking and ad-serving side. Having dabbled with SQL, SQL Column stores, Redis, AWS DynamoDB, AWS S3 and MongoDB at varying times - we would always be interested in the next exciting store. It was then when we came across Druid. Druid is a distributed column-store database and it caught our fancy for it real-time ingestion and sub-second query times. Amrith also happened to a fairly detailed deep-dive on it and dabbled with it as part of #1ppm. I scanned the docs which explain their data model and various trade-offs in fair detail. Reading through Druid docs and understanding it’s internal working set a benchmark with regards to what we should expect from a sub-second query store.
Exploring Stream Processing and Apache Spark
It was Dec 2016 when we decided to go neck deep this time with Amrith leading from the front. The ecosystem had matured, we had learned from our previous explorations and the volume of data had substantially grown. We explored Apache Kafka and it’s newly introduced streaming model. Post POCs, follow-up discussions and deep-dive we were convinced that the computing framework, tooling, paradigm and unified stack that Spark provides was suited, mature and superior to other options available. This was also the time when Joy hopped on the bandwagon. There were some fundamental challenges we needed to overcome to confidently take this to production.
Here are some challenges we faced with Apache Spark:
We were creating rolling hourly log files by advertiser; which was close to 15K per hour and this was only growing
We were using AWS S3 supported EMRFS which is is an implementation of HDFS for S3, but it wasn’t really meant for working with thousands of small files, instead it was more suited for processing a small number of huge files.
We deviated towards the batch processing paradigm by running the AWS EMR cluster every half an hour, yet we were not able to figure out a clean way to ingest the summarized data into individual advertiser BPs. This was more of a bottle next with regards to our multi-tenant isolation across advertisers
AWS EMR cluster wasn’t very stable and something we were not very confident about. Also, the overall provisioning and dynamics of resources allocation were not something that was easy to factor in for production workload.
We were able to process a day’s odd data in fractionally incremental time vs. half-hour data which was a complete bummer for us. On exploring further - the stack we were working towards was ideal to process large volumes of data over a week to two week period in one shot instead of trying to process half-hour worth of data.
Lastly, I must confess none of these should be looked at as shortcomings for Apache Spark but more as architectural trade-offs given what was possible at that point in time given what was in place, bandwidth, and resources. Given the right use case, I would hands-down go back to booting up an AWS EMR cluster to process a few months worth of data using Apache Spark.
P.S: Amrith has a fairly detailed set of notes about how we went about this exercise as a draft post with title ‘Igniting Spark’ and can be read by anyone internally.
BUILDING THE REAL-TIME STREAM PROCESSING PIPELINE
By this time we had a series of learnings and some clear goals in mind:
Stream processing as a paradigm suits our use case the best
Easy to maintain or managed service in the cloud would be ideal
Developer friendly and peace of mind was of utmost importance
Being able to ingest streaming data and query summaries was important
Good to have a way to run batch processing framework for machine learning, data crunching, and analysis
Our core tracking and ad serving stack are built from scratch on Node.js. It’s on AWS and auto-scaled. The async event-driven approach of Node.js works perfectly right for producing async events. We integrated the Kinesis Firehose SDK and push events to Kinesis Firehose
Streaming Queue
Kinesis Firehose is a fully managed streaming queue with configurable destinations. It also supports running custom lambda functions on every event. Event processing and the scalable serverless model of processing together is extremely powerful. We have configured two destinations for our Kinesis Firehose application - Amazon S3 for batch processing logs and Amazon Elasticsearch for near-realtime summarization queries.
Amazon Elasticsearch
Using Elasticsearch as part of our stack is a story in itself. We had looked into Elasticsearch primarily for log monitoring the first time. Elasticsearch as an ecosystem has evolved from its primary search driven use-case to a wide array of time-series and aggregation use-case. Like any NoSQL databases, you want to follow the access-oriented pattern and model it right. With Elasticsearch in our arsenal, we were also able to build a pull-based architecture - where workers across advertisers pull the required data from Elasticsearch. With Kinesis Firehose + Elasticsearch we have been able to keep the data freshness to around 15 minutes from a click to its summary being available. Jaydeepp has planned to write a multi-part series on Elasticseach - Part 1 is already published.
Streaming Analytics
Kinesis Analytics allows running streaming SQL window functions on events in Kinesis Firehose. This could be useful to run any kind of real-time anomaly detection, fraudulent click protection or rate limiting.
Batch Processing and Analytic Workloads
The AWS S3 logs deposited by Kinesis Firehose can be used for batch processing and analytic workloads. We use AWS Athena a managed PrestoDB service to do all the heavy lifting when it comes to analytic workloads across advertisers and big date ranges. You can do this while still writing vanilla SQL. Anything more complicated and you can start an AWS EMR cluster and run an Apache Spark job to do the data crunching for you.
LOOKING FORWARD
Just last week, Vamsi blew me away with his take on modelling the tracking data to a Graph Database.
Here is what I have learned from this experience and something you would have already felt after reading about this journey. This is not where it ends. You are never able to connect the dots looking forward. Considering we are working with unbounded data sets - all we can do is to keep streaming and keep processing!
Using CDNs (Content Delivery Network) for static content has been a long known best practice and something we have been using across our platform and ad-server. I wanted to share a special usecase where we use CDN (AWS Cloudfront) for serving dynamic requests on our ad-server to achieve subsecond response times.
CDN for Static Content
CDNs employ a network of nodes across the globe called edge nodes to get closer to the user (client browser) and hence are able to reduce the latency and roundtrip delay. Add to this a cache policy at the edge nodes and you are able to serve content gloablly with with acceptable latencies.
Here is how it would look like:
CDNs also come in handy as browsers limit the number of HTTP connections with the same domain - this is anywhere between 2-4 for older browsers and 6-10 for modern. Using multiple CDN sub-domains dynamically helps avoid queing the requests on the browser side.
CDN for Dynamic Content
Using CDN for dynamic content in cases where the response from the server is supposed to be different for every user request is counter intuitive. When it comes to ad-server the response is not only unique by user but also time sensitive. So, caching the dynamic requests of the ad-serving engine is not recommeneded. CDNs that allow supporting dynamic content allow this to be specified in distribution settings or read it from the headers of the response of the origin servers.
Before we get deeper, there is another important consideration - all ad-serving requests are now mandated to be through HTTPS. HTTPS (SSL/TLS) is recommened to protect the security, privacy and integrity of the data but it’s not known to be the fastest off the block. I’m referring to the 3-way handshake which delivers the expected promise of SSL/TLS but adds significant latency while establishing the initial connection. This initial latency can be substantial considering ad-serving performance is measured in subseconds.
By terminating SSL at the edge node of a CDN also called as SSL offloading can speed up initial requests (see realworld results below).
Here is how it would look like:
CDN for Dynamic Content (real-world results)
Theoretically, using CDNs for dynamic content for SSL offloading may look like a minor boost - but when it comes to real world results here is how the results stack up.
This is close to a 900% boost in real world performance for the first request. The results shall vary based the latencies between your user, you origin server and the nearest edge location.
Additional Pointers
We use AWS Cloudfront as our CDN and he are some features which we are able to leverage for subsecond ad deliveries:
Vast coverage - 98 Edge locations (87 Points of Presence and 11 Regional Edge Caches) in 50 cities across 23 countries
HTTP/2 support - which takes advantage of miltiplexing (multiple request & response messages between client and server on the same connection instead of multiple connections). Esp. for usecases where multiple assets are required like richmedia ads; the realworld benchmarks were unbelieveable to me and Amrith (possibly a future blog post).
CDNs have become a commodity with the ease and flexibility offered by the public cloud providers like AWS & Microsoft. I feel the recent launch of AWS Lambda@Edge the pormise of the on-demand nature of the cloud and serverless architecture will finally culminate into something bigger.