Introduction
This architecture demonstrates a framework for providing secure access to datasets stored in the Azure cloud, for on-premises users and computing clusters. Intended to meet the massive storage capacity and high bandwidth requirements of an enterprise-scale AI/ML development group whom choose to utilize on-premises GPU clusters for the bulk of development.
Links and references in this post lead to official Microsoft Azure overviews, documentation and configuration pages where necessary. For more information on the author, please feel free to reach me on LinkedIn or view my $ whoami post.
Workflow
This section demonstrates the workflow of the architecture, from data ingestion and storage, to construct a virtual network within the Azure cloud. Finally, we’ll connect users and applications to cloud-based resources, secured with Microsoft Entra ID and Azure RBAC.
- Azure Data Factory pulls data from client, partner and other provider sources, thus starting the Extract-Transform-Load(ETL) process. This data is both structured and unstructured, and includes Batch Data, Streaming Data and data from numerous third-party databases.
- Data is then moved to the Azure Data Lake StorageGen2 on the Azure cloud. From here, we can authenticate data access using Microsoft Entra ID and Role-Based Access Control (RBAC).
- Azure Virtual Network provides a virtual network in the cloud. For this solution, we utilize two subnets: one for Azure DNS with Azure Firewall, and one for a private endpoint that enables the on-premises user and applications to access the Azure Data Lake Storage Gen2 and other cloud resources.
- A private endpoint within the virtual network provides access to the Azure Data Lake Storage Gen2, and allows clients to securely access data within the storage account.
- By default, Azure Firewall uses Azure DNS for name resolution, but with the added security benefits of a best-in-class firewall. Here, we route and filter the on-premises traffic to the Azure Private Network through custom Azure Firewall DNS Settings
- Due to the high bandwidth requirement in this scenario, we utilize an ExpressRoute connection and Azure ExpressRoute Virtual Network Gateway, to connect the on-premises network directly to the Private Endpoint mounted to the Azure Data Lake Storage Gen2 volume.
- For ease of use and business continuity, we provide a common identity for all users and services accessing both cloud and on-premises resources, which is accomplished by synchronizing the on-premises AD DS with the cloud-based Microsoft Entra ID with Microsoft Entra Connect.
- Azure RBAC requires the user or application to have an identity in Microsoft Entra ID prior to “coarse grain” access to the Azure Data Lake Gen2 storage.
Components
- Azure Data Factory Azure Data Factory is a managed cloud service that is built for complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration workloads. Among many other capabilities, it includes Multi-cloud support, over 100 native connectors and the ability to auto-scale to traffic volumes.
- Azure Storage offers highly available, massively scalable, durable and secure storage for a variety of data objects in the cloud. This solution utilizes Azure Data Lake Storage Gen2.
- Azure Data Lake Gen2 is a centralized repository, whose hierarchical namespace is a key feature that enables Azure Data Lake to provide high-performance data access at object storage scale and price. Built on Azure Blob Storage, it’s ability to store petabyte-sized files, facilitate hundreds of gigabits of throughput, trillions of objects and both structured and unstructured data, make Azure Data Lake Gen2 and ideal solution for AI/ML work flows.
- Azure Private Link enables users and applications to access Azure PaaS Services (for example, Azure Storage and SQL Database) and Azure hosted customer-owned/partner services over a private endpoint within your virtual network.
- Private Endpoint is a network interface that uses a private IP address from your virtual network. This network interface connects you privately and securely to a service that’s powered by Azure Private Link, which in our case is Azure Data Lake Storage Gen2. By utilizing a private endpoint, we’re bringing the service into the virtual network.
- Azure Virtual Network is the fundamental building block for private networks in Azure. It provides the environment for Azure resources to securely communicate with each other, with the internet, and with on-premises networks.
- Azure ExpressRoute extends on-premises networks into the Microsoft cloud over a private connection through speeds up to 100 Gbps.
- Azure ExpressRoute Virtual Network Gateway connects the Azure virtual network and on-premises network using ExpressRoute and serves two purposes: exchange IP routes between the networks and route network traffic.
- Azure Firewall is a cloud-native and intelligent network firewall security service that provides the best of breed threat protection for your cloud workloads running in Azure.
- Microsoft Entra ID is a cloud-based identity and access management service that provides a single identity control plain to manage permissions and roles, for users and applications accessing your resources.
- Microsoft Entra Connect handles all the operations that are related to synchronize identity data between your on-premises environment and Microsoft Entra ID.
- Azure RBAC uses role assignments to apply sets of permissions to security principals. A security principal is an object that represents a user, group, service principal, or managed identity that is defined in Microsoft Entra ID.
Scenario Details
Consider the following scenario: A company or business unit dedicated to training next generation AI and ML ingests large volumes of data, to include batch data, streaming data, and datasets from third party providers. Due to the volume and varying structure of the data, the company and would benefit greatly from both the flexibility and scalability of Azure cloud services, especially in the categories of security, storage, high availability, and ingestion or ETL.
However, due to the consistent monthly requirement to use many dozens of GPU’s to train their models, they may opt to utilize on-premises computing clusters (rather than cloud based GPU services) for the bulk of their model training, due to cost savings.
In this architecture, Azure Data Factory facilitates the ETL process by ingesting and sorting data of varying types, both structured and unstructured. The data is then stored in Azure Data Lake Storage Gen2, which is accessed by on-premises users and applications at the company’s offices, through the use of the industry standard SMB protocol, ExpressRoute and Private Endpoints. Access to the Data Lake is secured by Microsoft Entra ID and Azure RBAC.
Considerations
The considerations for this architecture implement and expand upon the pillars of the Azure Well-Architected Framework.
Reliability
This category of considerations aims to significantly enhance the architecture’s resilience to malfunction, while facilitating rapid restoration to a normal functioning state if a failure occurs.
- Azure Storage features 99.997% availability and stores multiple copies of your data with customizable redundancy options, to include both LRS and ZRS by default. Because Azure Data Lake Storage Gen2 was built on Azure Blob Storage, it comes with extremely high availability and disaster recovery capabilities.
- Azure Firewall was built with high availability as a feature, so no extra configurations or settings are required in this category.
- Azure ExpressRoute features built-in redundancy in every peering location, as well as a 99.95% up-time per connection, according to their Service Level Agreements (SLA).
Scalability & Performance
Scalability is your architecture’s ability to seamlessly adapt (or scale) from its current state to a desired future state, which is ideally dependent on workload. In this scenario, the workload is measured in three categories, to include: (a) Data Ingestion and ETL, (b) Storage and (c) bandwidth and throughput.
- Ingestion & ETL: Without Azure Data Factory, companies must design and build (or acquire) custom data movement components and services to integrate data sources, then perform the ETL process. This can be very costly and time consuming. Azure Data Factory includes over 100 connectors and provides Code-free ETL as a feature of the service, allowing data engineers and cloud administrators to connect, ingest, transform and store data effortlessly.
- Storage: Due to the nature of training AI and Machine Learning model, data engineers require massive volumes of data which may vary greatly in structure, type and form. Azure Data Lake Storage Gen2 offers limitless scale, and the ability to utilize a wide range of data ingestion and processing tools. Another key features is its hierarchical namespace, enabling it to replace siloed storage solutions.
- Bandwidth: ExpressRoute Direct provides dual 100 Gbps connectivity for massive data ingestion. However the standard ExpressRoute Bandwidth Options range from 50 Mbps to 10 Gbps which is more than most VPN connections. ExpressRoute also offers dynamic bandwidth scaling without the need to tear down and rebuild your connections.
- Throughput: Azure Data Lake Storage Gen2 is an enterprise data lake solution, designed to store multiple petabytes of information while sustaining hundreds of gigabits of throughput often seen in big data analytic workloads.
Cost Optimization
Cost optimization is a continuous discipline focused on driving efficiency while reducing operational expenditures, thus maximizing profitability. The basis of this architecture is to grant an organization the ability to utilize advantages of Azure Cloud storage, security, ETL and other enterprise-scale services which are optimized to scale to cost by work load; while simultaneously granting flexibility to use on-premises computing and GPU clusters for the bulk of development and training. .
View the Azure Pricing Calculator for more information on your specific workload.
Security & Authentication
Features in this category seek to protect the architecture, it’s workloads and data from attacks of all types, by maintaining confidentiality, integrity and availability (CIA triad). Services and components of this architecture were selected based on security controls and services with the following components:
- Microsoft Entra ID: The cloud based identity and access management (IAM) service which allows administrators to grant fine-grain user and application access to resources on the Azure cloud.
- Azure RBAC controls and enforces permissions to cloud-based resources through security principle, role definition and scope.
- Azure Firewall is offered in three SKUs (Standard, Premium and Basic) all of which allow administrators to create a wide range of rule collections to reduce surface attack area of an organization.
- ExpressRoute connections do not go over the public internet when connecting users and applications with the Virtual Private Network. Additionally, public IP addresses associated with the ExpressRoute Gateway are used for internal management only. For more information, view the ExpressRoute FAQ
- Azure Data Lake Storage Gen2 features a “finer grain security model”, which supports both Azure RBAC and allows administrators to set permissions at directory and files. Further, all data is encrypted at rest with your choices of encryption keys.
Closing Notes
In a future post, I will demonstrate an improved state of this architecture which utilizes both on-prem and cloud based hardware for training ML models. I will also create a separate architecture which moves further upstream to the data collection points, and demonstrate how a company or division with many hundreds of thousands of IoT and ICS sensors may collect, transform and store their data in Azure for future use.
Please reach me on LinkedIn or through the contact form on this site, for questions comments or dialogue of any type.