Monday, June 06, 2016

Azure SQL Data Warehouse

Azure SQL Data Warehouse is the new SaaS offering from Microsoft. It is still in preview state and I had a chance to set up and play with it. It looks very promising.
So, what is Azure SQL Data Warehouse? Microsoft defines Azure SQL Data warehouse as follows: “Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data - both relational and non-relational. Built on massively parallel processing (MPP) architecture, SQL Data Warehouse can handle your enterprise workload”.  FlyData [1] that compares Azure data warehouse vs Amazon Redshift defined it as “a distributed and enterprise level database capable of handling large amounts of relational and non-relational data”

Historically, those type of MPP are only available for enterprises that can afford to fork out a lot of money to buy appliance upfront. I believe what Microsoft done is great in trying to bring this powerful massively parallel processing architecture to all of us and now we are in position to investigate, experiment and perform cost benefit analysis and see if it can benefit the business under our domain. I believe the SaaS offering of the product will not discriminate against any size of companies whether small, medium or large. It is also a great opportunity for data scientists to experiment their modelling and avoid hassle of initial investment cost.   Whether this will be a game changer, we won’t say it for sure now but time will tell.
The abstraction and separation of compute from azure data warehouse is the one I personally like very much.  I believe the flexibility to dynamically deploy, grow, shrink and pause is indeed another feature.  In this article, I will look at two aspects of the product; architecture and pricing

There are four layer that defines azure data warehouse.  There are disagreement if it is four or three or even two layers that defines Azure data warehouse architecture.  I wouldn't be drawn into the layering as they all talk about the same thing.   Those layers are DMS, storage, control and compute nodes.  The image below (source: Microsoft) gives you a visual architecture of Azure data warehouse
  1. Data Movement services is one of the layer that spans across both control node and compute nodes. It is a windows service that coordinates and/or manages data movement between different nodes.
  2. Control node is the entry point for all applications. It is like connecting to an instance of SQL server. The application that connects to ADW (Azure data warehouse) doesn’t know the underlying compute nodes nor will be able to connect to any of the compute nodes. The compute nodes are abstracted from our application. The control node is responsible in distributing our requests to underlying compute nodes
  3. Compute nodes are the engine of our computing power hence called compute nodes. When you issue a big query the first node you hit will be control node which will then distribute the query to compute node. Once the query run on compute nodes it is then the work of control nodes that will aggregate the results from all the compute node and return’s to our application.
  4. One of the great achievements of Azure data warehouse is the separation of dependence of storage layer from the compute node when scaling up. You should be able to increase your storage independent of your compute node.  The storage layer and compute nodes scales separately. The data will be stored in Azure Blob storage. The compute nodes are directly interacting with the storage layer when reading or writing data.

Charging is based on number of DWU. DWU (data warehouse unit) represents the amount of compute resources used at any given time.  The more DWU you use,  the faster your query runs as it will be executed on more distributed CPU and memory.  It will take some time and experience to understand the trade-off and/or optimum DWU to pricing to run your query to be completed within the time-frame you planned or anticipated.
 As per information I have got at current time, the charge is per minute. For instance, if you used 100 DWU for 30 minute you will be charged for 30 minutes instead of per hour pricing that is shown for compute capabilities. Currently, the minimum DWU you can configure from the portal is 100 DWU (with a cost of £0.43/hour) with increment of 100 up to a maximum of 2000 DWU (with a cost of £8.55/hour). If you pause you will not be charged.

  2. (

No comments: