This is the multi-page printable view of this section. Click here to print.
Topics
- 1: Cluster Provisioning Tools Contract
- 2: Azure LoadBalancer
- 3: Azure Permissions
- 4: Use Availability Zones
- 5: Support Multiple Node Types
- 6: Deploy Cross Resource Group Nodes
- 7: Multiple Services Sharing One IP Address
- 8: Tagging resources managed by Cloud Provider Azure
- 9: Kubelet Credential Provider
- 10: Node IPAM controller
- 11: Azure Private Link Service Integration
- 12: Multiple Standard LoadBalancers
1 - Cluster Provisioning Tools Contract
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Here is a list of Azure resource assumptions that are required for cloud provider Azure:
- All Azure resources MUST be under the same tenant.
- All virtual machine names MUST be the same as their hostname.
- The cluster name set for
kube-controller-manager --cluster-name=<cluster-name>
MUST not end with-internal
.
After the cluster is provisioned, cloud provider Azure MAY update the following Azure resources based on workloads:
- New routes would be added for each node if
--configure-cloud-routes
is enabled. - New LoadBalancer (including external and internal) would be created if they’re not existing yet.
- Virtual machines and virtual machine scale sets would be added to LoadBalancer backend address pools if they’re not added yet.
- New public IPs and NSG rules would be added when LoadBalancer typed services are created.
2 - Azure LoadBalancer
The way Azure defines a LoadBalancer is different from GCE or AWS. Azure’s LB can have multiple frontend IP refs. GCE and AWS only allow one, if you want more, you would need multiple LBs. Since Public IP’s are not part of the LB in Azure, an NSG is not part of the LB in Azure either. However, you cannot delete them in parallel, a Public IP can only be deleted after the LB’s frontend IP ref is removed.
The different Azure Resources such as LB, Public IP, and NSG are the same tier of Azure resources and circular dependencies need to be avoided. In other words, they should only depend on service state.
By default the basic SKU is selected for a load balancer. Services can be annotated to allow auto selection of available load balancers. Service annotations can also be used to provide specific availability sets that host the load balancers. Note that in case of auto selection or specific availability set selection, services are currently not auto-reassigned to an available loadbalancer when the availability set is lost in case of downtime or cluster scale down.
LoadBalancer annotations
Below is a list of annotations supported for Kubernetes services with type LoadBalancer
:
Annotation | Value | Description | Kubernetes Version |
---|---|---|---|
service.beta.kubernetes.io/azure-load-balancer-internal | true or false | Specify whether the load balancer should be internal. It’s defaulting to public if not set. | v1.10.0 and later |
service.beta.kubernetes.io/azure-load-balancer-internal-subnet | Name of the subnet | Specify which subnet the internal load balancer should be bound to. It’s defaulting to the subnet configured in cloud config file if not set. | v1.10.0 and later |
service.beta.kubernetes.io/azure-load-balancer-mode | auto , {vmset-name} | Specify the Azure load balancer selection algorithm based on vm sets (VMSS or VMAS). There are currently three possible load balancer selection modes : default, auto or “{vmset-name}”. This is only working for basic LB (see below for how it works) | v1.10.0 and later |
service.beta.kubernetes.io/azure-dns-label-name | Name of the PIP DNS label | Specify the DNS label name for the service’s public IP address (PIP). If it is set to empty string, DNS in PIP would be deleted. Because of a bug, before v1.15.10/v1.16.7/v1.17.3, the DNS label on PIP would also be deleted if the annotation is not specified. | v1.15.0 and later |
service.beta.kubernetes.io/azure-shared-securityrule | true or false | Specify that the service should be exposed using an Azure security rule that may be shared with another service, trading specificity of rules for an increase in the number of services that can be exposed. This relies on the Azure “augmented security rules” feature. | v1.10.0 and later |
service.beta.kubernetes.io/azure-load-balancer-resource-group | Name of the PIP resource group | Specify the resource group of the service’s PIP that are not in the same resource group as the cluster. | v1.10.0 and later |
service.beta.kubernetes.io/azure-allowed-service-tags | List of allowed service tags | Specify a list of allowed service tags separated by comma. | v1.11.0 and later |
service.beta.kubernetes.io/azure-load-balancer-tcp-idle-timeout | TCP idle timeouts in minutes | Specify the time, in minutes, for TCP connection idle timeouts to occur on the load balancer. Default and minimum value is 4. Maximum value is 30. Must be an integer. | v1.11.4, v1.12.0 and later |
service.beta.kubernetes.io/azure-pip-name | Name of PIP | Specify the PIP that will be applied to load balancer. It is used for IPv4 or IPv6 in a single stack cluster. | v1.16 and later |
service.beta.kubernetes.io/azure-pip-name-ipv6 | Name of IPv6 PIP | After v1.27, specify the IPv6 PIP that will be applied to load balancer in a dual stack cluster. For single stack clusters, this annotation will be ignored. | v1.27 and later |
service.beta.kubernetes.io/azure-pip-prefix-id | ID of Public IP Prefix | Specify the Public IP Prefix that will be applied to load balancer. It is for IPv4 or IPv6 in a single stack cluster. | v1.21 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-pip-prefix-id-ipv6 | ID of IPv6 Public IP Prefix | After v1.27, specify the IPv6 Public IP Prefix that will be applied to load balancer in a dual stack cluster. For single stack clusters, this annotation will be ignored. | v1.27 and later |
service.beta.kubernetes.io/azure-pip-tags | Tags of the PIP | Specify the tags of the PIP that will be associated to the load balancer typed service. Doc | v1.20 and later |
service.beta.kubernetes.io/azure-load-balancer-health-probe-interval | Health probe interval | Refer to the detailed docs here | v1.21 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe | The minimum number of unhealthy responses of health probe | Refer to the detailed docs here | v1.21 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path | Request path of the health probe | Refer to the detailed docs here | v1.20 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-load-balancer-ipv4 | Load balancer IPv4 address | Specify the load balancer IP of IPv4, deprecating Service.spec.loadBalancerIP | v1.21 and later |
service.beta.kubernetes.io/azure-load-balancer-ipv6 | Load balancer IPv6 address | Specify the load balancer IP of IPv6, deprecating Service.spec.loadBalancerIP | v1.21 and later |
service.beta.kubernetes.io/port_{port}_no_lb_rule | true/false | {port} is the port number in the service. When it is set to true, no lb rule and health probe rule for this port will be generated. health check service should not be exposed to the public internet(e.g. istio/envoy health check service) | v1.24 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/port_{port}_no_probe_rule | true/false | {port} is the port number in the service. When it is set to true, no health probe rule for this port will be generated. | v1.24 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/port_{port}_health-probe_protocol | Health probe protocol | {port} is the port number in the service. Explicit protocol for the health probe for the service port {port}, overriding port.appProtocol if set. Refer to the detailed docs here | v1.24 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/port_{port}_health-probe_port | port number or port name in service manifest | {port} is the port number in the service. Explicit port for the health probe for the service port {port}, overriding the default value. Refer to the detailed docs here | v1.24 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/port_{port}_health-probe_interval | Health probe interval | {port} is port number of service. Refer to the detailed docs here | v1.21 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/port_{port}_health-probe_num-of-probe | The minimum number of unhealthy responses of health probe | {port} is port number of service. Refer to the detailed docs here | v1.21 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/port_{port}_health-probe_request-path | Request path of the health probe | {port} is port number of service. Refer to the detailed docs here | v1.20 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-load-balancer-enable-high-availability-ports | Enable high availability ports on internal SLB | HA ports is required when applications require IP fragments | v1.20 and later |
service.beta.kubernetes.io/azure-deny-all-except-load-balancer-source-ranges | true or false | Deny all traffic to the service. This is helpful when the service.Spec.LoadBalancerSourceRanges is set to an internal load balancer typed service. When set the loadBalancerSourceRanges field on the service in order to whitelist ip src addresses, although the generated NSG has added the rules for loadBalancerSourceRanges, the default rule (65000) will allow any vnet traffic, basically meaning the whitelist is of no use. This annotation solves this issue. | v1.21 and later |
service.beta.kubernetes.io/azure-additional-public-ips | External public IPs besides the service’s own public IP | It is mainly used for global VIP on Azure cross-region LoadBalancer | v1.20 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip | true or false | Disable Floating IP configuration for load balancer | v1.21 and later with out-of-tree cloud provider |
service.beta.kubernetes.io/azure-pip-ip-tags | comma separated key-value pairs a=b,c=d , for example RoutingPreference=Internet | Refer to the doc | v1.21 and later with out-of-tree cloud provider |
Please note that
- When
loadBalancerSourceRanges
have been set on service spec,service.beta.kubernetes.io/azure-allowed-service-tags
won’t work because of DROP iptables rules from kube-proxy. The CIDRs from service tags should be merged intoloadBalancerSourceRanges
to make it work. - When
allocateLoadBalancerNodePorts
is set tofalse
, ensure the following conditions are met:- Set
externalTrafficPolicy
toLocal
. - And enable the FloatingIP feature by either not setting annotation
service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip
, or setting its value to false.
- Set
Setting LoadBalancer IP
If you want to specify an IP address for the load balancer, there are two ways:
- Recommended: Set Service annotations
service.beta.kubernetes.io/azure-load-balancer-ipv4
for an IPv4 address andservice.beta.kubernetes.io/azure-load-balancer-ipv6
for an IPv6 address. Dual-stack support will be implemented soon. It is highly recommended for new Services. - Deprecating: Set Service field:
Service.Spec.LoadbalancerIP
. This field is deprecating following upstream kubernetes and it cannot support dual-stack. However, current usage remains the same and existing Services are expected to work without modification.
Load balancer selection modes
This is only useful for cluster with basic SKU load balancers. There are currently three possible load balancer selection modes:
- Default mode - service has no annotation (“service.beta.kubernetes.io/azure-load-balancer-mode”). In this case the Loadbalancer of the primary Availability set is selected
- “auto” mode - service is annotated with
__auto__
value. In this case, services would be associated with the Loadbalancer with the minimum number of rules. - “{vmset-name}” mode - service is annotated with the name of a VMSS/VMAS. In this case, only load balancers of the specified VMSS/VMAS would be selected, and services would be associated with the one with the minimum number of rules.
Note that the “auto” mode is valid only if the service is newly created. It is not allowed to change the annotation value to
__auto__
of an existed service.
The selection mode for a load balancer only works for basic load balancers. Following is the detailed information of allowed number of VMSS/VMAS in a load balancer.
- Standard SKU supports any virtual machine in a single virtual network, including a mix of virtual machines, availability sets, and virtual machine scale sets. So all the nodes would be added to the same standard LB backend pool with a max size of 1000.
- Basic SKU only supports virtual machines in a single availability set, or a virtual machine scale set. Only nodes with the same availability set or virtual machine scale set would be added to the basic LB backend pool.
LoadBalancer SKUs
Azure cloud provider supports both basic
and standard
SKU load balancers, which can be set via loadBalancerSku
option in cloud config file. A list of differences between these two SKUs can be found here.
Note that the public IPs used in load balancer frontend configurations should be the same SKU. That is a standard SKU public IP for standard load balancer and a basic SKU public IP for a basic load balancer.
Azure doesn’t support a network interface joining load balancers with different SKUs, hence migration dynamically between them is not supported.
If you do require migration, please delete all services with type
LoadBalancer
(or change to other type)
Outbound connectivity
Outbound connectivity is also different between the two load balancer SKUs:
For the basic SKU, the outbound connectivity is opened by default. If multiple frontends are set, then the outbound IP is selected randomly (and configurable) from them.
For the standard SKU, the outbound connectivity is disabled by default. There are two ways to open the outbound connectivity: use a standard public IP with the standard load balancer or define outbound rules.
Standard LoadBalancer
Because the load balancer in a Kubernetes cluster is managed by the Azure cloud provider, and it may change dynamically (e.g. the public load balancer would be deleted if no services defined with type LoadBalancer
), outbound rules are the recommended path if you want to ensure the outbound connectivity for all nodes.
Especially note:
In the context of outbound connectivity, a single standalone VM, all the VM’s in an Availability Set, all the instances in a VMSS behave as a group. This means, if a single VM in an Availability Set is associated with a Standard SKU, all VM instances within this Availability Set now behave by the same rules as if they are associated with Standard SKU, even if an individual instance is not directly associated with it.
Public IP’s used as instance-level public IP are mutually exclusive with outbound rules.
Here is the recommended way to define the outbound rules when using separate provisioning tools:
- Create a separate IP (or multiple IPs for scale) in a standard SKU for outbound rules. Make use of the allocatedOutboundPorts parameter to allocate sufficient ports for your desired scenario scale.
- Create a separate pool definition for outbound, and ensure all virtual machines or VMSS virtual machines are in this pool. Azure cloud provider will manage the load balancer rules with another pool, so that provisioning tools and the Azure cloud provider won’t affect each other.
- Define inbound with load balancing rules and inbound NAT rules as needed, and set
disableOutboundSNAT
to true on the load balancing rule(s). Don’t rely on the side effect from these rules for outbound connectivity. It makes it messier than it needs to be and limits your options. Use inbound NAT rules to create port forwarding mappings for SSH access to the VM’s rather than burning public IPs per instance.
Exclude nodes from the load balancer
Excluding nodes from Azure LoadBalancer is supported since v1.20.0.
The kubernetes controller manager supports excluding nodes from the load balancer backend pools by enabling the feature gate ServiceNodeExclusion
. To exclude nodes from Azure LoadBalancer, label node.kubernetes.io/exclude-from-external-load-balancers=true
should be added to the nodes.
To use the feature, the feature gate
ServiceNodeExclusion
should be on (enabled by default since its beta on v1.19).The labeled nodes would be excluded from the LB in the next LB reconcile loop, which needs one or more LB typed services to trigger. Basically, users could trigger the update by creating a service. If there are one or more LB typed services existing, no extra operations are needed.
To re-include the nodes, just remove the label and the update would be operated in the next LB reconcile loop.
Limitations
- Excluding nodes from LoadBalancer is not supported on AKS managed nodes.
Using SCTP
SCTP protocol services are only supported on internal standard LoadBalancer, hence annotation service.beta.kubernetes.io/azure-load-balancer-internal: "true"
should be added to SCTP protocol services. See below for an example:
apiVersion: v1
kind: Service
metadata:
name: sctpservice
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
spec:
type: LoadBalancer
selector:
app: sctpserver
ports:
- name: sctpserver
protocol: SCTP
port: 30102
targetPort: 30102
Custom Load Balancer health probe
As documented here, Tcp, Http and Https are three protocols supported by load balancer service.
Currently, the default protocol of the health probe varies among services with different transport protocols, app protocols, annotations and external traffic policies.
- for local services, HTTP and /healthz would be used. The health probe will query NodeHealthPort rather than actual backend service
- for cluster TCP services, TCP would be used.
- for cluster UDP services, no health probes.
Note: For local services with PLS integration and PLS proxy protocol enabled, the default HTTP+/healthz health probe does not work. Thus health probe can be customized the same way as cluster services to support this scenario. For more details, please check PLS Integration Note.
Since v1.20, service annotation service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path
is introduced to determine the health probe behavior.
- For clusters <=1.23,
spec.ports.appProtocol
would only be used as probe protocol whenservice.beta.kubernetes.io/azure-load-balancer-health-probe-request-path
is also set. - For clusters >1.24,
spec.ports.appProtocol
would be used as probe protocol and/
would be used as default probe request path (service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path
could be used to change to a different request path).
Note that the request path would be ignored when using TCP or the spec.ports.appProtocol
is empty. More specifically:
loadbalancer sku | externalTrafficPolicy | spec.ports.Protocol | spec.ports.AppProtocol | service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path | LB Probe Protocol | LB Probe Request Path |
---|---|---|---|---|---|---|
standard | local | any | any | any | http | /healthz |
standard | cluster | udp | any | any | null | null |
standard | cluster | tcp | (ignored) | tcp | null | |
standard | cluster | tcp | tcp | (ignored) | tcp | null |
standard | cluster | tcp | http/https | TCP(<=1.23) or http/https(>=1.24) | null(<=1.23) or / (>=1.24) | |
standard | cluster | tcp | http/https | /custom-path | http/https | /custom-path |
standard | cluster | tcp | unsupported protocol | /custom-path | tcp | null |
basic | local | any | any | any | http | /healthz |
basic | cluster | tcp | (ignored) | tcp | null | |
basic | cluster | tcp | tcp | (ignored) | tcp | null |
basic | cluster | tcp | http | TCP(<=1.23) or http/https(>=1.24) | null(<=1.23) or / (>=1.24) | |
basic | cluster | tcp | http | /custom-path | http | /custom-path |
basic | cluster | tcp | unsupported protocol | /custom-path | tcp | null |
Since v1.21, two service annotations service.beta.kubernetes.io/azure-load-balancer-health-probe-interval
and load-balancer-health-probe-num-of-probe
are introduced, which customize the configuration of health probe. If service.beta.kubernetes.io/azure-load-balancer-health-probe-interval
is not set, Default value of 5 is applied. If load-balancer-health-probe-num-of-probe
is not set, Default value of 2 is applied. And total probe should be less than 120 seconds.
Custom Load Balancer health probe for port
Different ports in a service may require different health probe configurations. This could be because of service design (such as a single health endpoint controlling multiple ports), or Kubernetes features like the MixedProtocolLBService.
The following annotations can be used to customize probe configuration per service port.
port specific annotation | global probe annotation | Usage |
---|---|---|
service.beta.kubernetes.io/port_{port}_no_lb_rule | N/A (no equivalent globally) | if set true, no lb rules and probe rules will be generated |
service.beta.kubernetes.io/port_{port}_no_probe_rule | N/A (no equivalent globally) | if set true, no probe rules will be generated |
service.beta.kubernetes.io/port_{port}_health-probe_protocol | N/A (no equivalent globally) | Set the health probe protocol for this service port (e.g. Http, Https, Tcp) |
service.beta.kubernetes.io/port_{port}_health-probe_port | N/A (no equivalent globally) | Sets the health probe port for this service port (e.g. 15021) |
service.beta.kubernetes.io/port_{port}_health-probe_request-path | service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path | For Http or Https, sets the health probe request path. Defaults to / |
service.beta.kubernetes.io/port_{port}_health-probe_num-of-probe | service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe | Number of consecutive probe failures before the port is considered unhealthy |
service.beta.kubernetes.io/port_{port}_health-probe_interval | service.beta.kubernetes.io/azure-load-balancer-health-probe-interval | The amount of time between probe attempts |
For following manifest, probe rule for port httpsserver is different from the one for httpserver because annotations for port httpsserver are specified.
apiVersion: v1
kind: Service
metadata:
name: appservice
annotations:
service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe: "5"
service.beta.kubernetes.io/port_443_health-probe_num-of-probe: "4"
spec:
type: LoadBalancer
selector:
app: server
ports:
- name: httpserver
protocol: TCP
port: 80
targetPort: 30102
- name: httpsserver
protocol: TCP
appProtocol: HTTPS
port: 443
targetPort: 30104
In this manifest, the https ports use a different node port, an HTTP readiness check at port 10256 on /healthz(healthz endpoint of kube-proxy).
apiVersion: v1
kind: Service
metadata:
name: istio
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/port_443_health-probe_protocol: "http"
service.beta.kubernetes.io/port_443_health-probe_port: "10256"
service.beta.kubernetes.io/port_443_health-probe_request-path: "/healthz"
spec:
ports:
- name: https
protocol: TCP
port: 443
targetPort: 8443
nodePort: 30104
appProtocol: https
selector:
app: istio-ingressgateway
gateway: istio-ingressgateway
istio: ingressgateway
type: LoadBalancer
sessionAffinity: None
externalTrafficPolicy: Local
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
allocateLoadBalancerNodePorts: true
internalTrafficPolicy: Cluster
In this manifest, the https ports use a different health probe endpoint, an HTTP readiness check at port 30000 on /healthz/ready.
apiVersion: v1
kind: Service
metadata:
name: istio
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/port_443_health-probe_protocol: "http"
service.beta.kubernetes.io/port_443_health-probe_port: "30000"
service.beta.kubernetes.io/port_443_health-probe_request-path: "/healthz/ready"
spec:
ports:
- name: https
protocol: TCP
port: 443
targetPort: 8443
appProtocol: https
selector:
app: istio-ingressgateway
gateway: istio-ingressgateway
istio: ingressgateway
type: LoadBalancer
sessionAffinity: None
externalTrafficPolicy: Local
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
allocateLoadBalancerNodePorts: true
internalTrafficPolicy: Cluster
Probing kube-proxy with a shared health probe
This feature is supported since v1.28.5
In externalTrafficPolicy: Local
, SLB directly probes kube-proxy – the thing it is directing traffic to. If kube-proxy is experiencing an issue on a given node, this will be visible on the healthCheckNodePort and SLB will stop sending traffic to this node.
In externalTrafficPolicy: Cluster
, the probes are directed to the backend application, and thus SLB can only know kube-proxy’s health indirectly – by whether the probes are forwarded to a backend application and answered successfully. This indirection causes confusion and causes problems in multiple different ways.
It is provided since v1.28.5 that a shared health probe can be used to probe kube-proxy. This feature is enabled by setting clusterServiceLoadBalancerHealthProbeMode: "shared"
in the cloud provider configuration. When this feature is enabled, the health probe will be configured to probe kube-proxy on the healthCheckNodePort. This will allow SLB to directly probe kube-proxy and thus detect kube-proxy issues more quickly and accurately. The customization options are listed as below:
Configuration | Default | Description |
---|---|---|
clusterServiceLoadBalancerHealthProbeMode | servicenodeport | Supported values are shared and servicenodeport . All ETP cluster service will share one health probe if shared is set. Otherwise, each ETP cluster service will have its own health probe. |
clusterServiceSharedLoadBalancerHealthProbePort | 10256 | Default to kube-proxy healthCheckNodePort. |
clusterServiceSharedLoadBalancerHealthProbePath | /healthz | Default to kube-proxy health check path. |
When a service is integrated with a private link service and uses the proxy protocol, the health check requests to the kube-proxy will fail. A new cloud-node-manager
sidecar health-probe-proxy
is introduced to solve this issue. The sidecar will forward the health check requests to the kube-proxy and return the response to the load balancer. The sidecar will read these requests, parse the proxy protocol header, and forward the request to the kube-proxy. If the proxy protocol is not used, this daemonset will forward the request to the kube-proxy without any modification. To enable health-probe-proxy
sidecar, turn on cloudNodeManager.enableHealthProbeProxy
in the helm chart, or deploy it as a daemonset manually. To read more, check this documentation.
Configure Load Balancer backend
This feature is supported since v1.23.0
The backend pool type can be configured by specifying loadBalancerBackendPoolConfigurationType
in the cloud configuration file. There are three possible values:
nodeIPConfiguration
(default). In this case we attach nodes to the LB by calling the VMSS/NIC API to associate the corresponding node IP configuration with the LB backend pool.nodeIP
. In this case we attach nodes to the LB by calling the LB API to add the node private IP addresses to the LB backend pool.podIP
(not supported yet). In this case we do not attach nodes to the LB. Instead we directly adding pod IPs to the LB backend pool.
To migrate from one backend pool type to another, just change the value of loadBalancerBackendPoolConfigurationType
and re-apply the cloud configuration file. There will be downtime during the migration process.
Migration API from nodeIPConfiguration
to nodeIP
This feature is supported since v1.24.0
The migration from nodeIPConfiguration
to nodeIP
can be done without downtime by configuring "enableMigrateToIPBasedBackendPoolAPI": true
in the cloud configuration file.
## Load balancer limits
The limits of the load balancer related resources are listed below:
**Standard Load Balancer**
| Resource | Limit |
| --------------------------------------- | ----------------------------------------------- |
| Load balancers | 1,000 |
| Rules per resource | 1,500 |
| Rules per NIC (across all IPs on a NIC) | 300 |
| Frontend IP configurations | 600 |
| Backend pool size | 1,000 IP configurations, single virtual network |
| Backend resources per Load Balancer | 150 |
| High-availability ports | 1 per internal frontend |
| Outbound rules per Load Balancer | 600 |
| Load Balancers per VM | 2 (1 Public and 1 internal) |
The limit is up to 150 resources, in any combination of standalone virtual machine resources, availability set resources, and virtual machine scale-set placement groups.
**Basic Load Balancer**
| Resource | Limit |
| --------------------------------------- | ---------------------------------------------- |
| Load balancers | 1,000 |
| Rules per resource | 250 |
| Rules per NIC (across all IPs on a NIC) | 300 |
| Frontend IP configurations | 200 |
| Backend pool size | 300 IP configurations, single availability set |
| Availability sets per Load Balancer | 1 |
| Load Balancers per VM | 2 (1 Public and 1 internal) |
3 - Azure Permissions
Azure cloud provider requires a set of permissions to manage the Azure resources. Here is a list of all permissions and reasons of why they’re required.
// Required to create, delete or update LoadBalancer for LoadBalancer service
Microsoft.Network/loadBalancers/delete
Microsoft.Network/loadBalancers/read
Microsoft.Network/loadBalancers/write
Microsoft.Network/loadBalancers/backendAddressPools/read
Microsoft.Network/loadBalancers/backendAddressPools/write
Microsoft.Network/loadBalancers/backendAddressPools/delete
// Required to allow query, create or delete public IPs for LoadBalancer service
Microsoft.Network/publicIPAddresses/delete
Microsoft.Network/publicIPAddresses/read
Microsoft.Network/publicIPAddresses/write
// Required if public IPs from another resource group are used for LoadBalancer service
// This is because of the linked access check when adding the public IP to LB frontendIPConfiguration
Microsoft.Network/publicIPAddresses/join/action
// Required to create or delete security rules for LoadBalancer service
Microsoft.Network/networkSecurityGroups/read
Microsoft.Network/networkSecurityGroups/write
// Required to create, delete or update AzureDisks
Microsoft.Compute/disks/delete
Microsoft.Compute/disks/read
Microsoft.Compute/disks/write
Microsoft.Compute/locations/DiskOperations/read
// Required to create, update or delete storage accounts for AzureFile or AzureDisk
Microsoft.Storage/storageAccounts/delete
Microsoft.Storage/storageAccounts/listKeys/action
Microsoft.Storage/storageAccounts/read
Microsoft.Storage/storageAccounts/write
Microsoft.Storage/operations/read
// Required to create, delete or update routeTables and routes for nodes
Microsoft.Network/routeTables/read
Microsoft.Network/routeTables/routes/delete
Microsoft.Network/routeTables/routes/read
Microsoft.Network/routeTables/routes/write
Microsoft.Network/routeTables/write
// Required to query information for VM (e.g. zones, faultdomain, size and data disks)
Microsoft.Compute/virtualMachines/read
// Required to attach AzureDisks to VM
Microsoft.Compute/virtualMachines/write
// Required to query information for vmssVM (e.g. zones, faultdomain, size and data disks)
Microsoft.Compute/virtualMachineScaleSets/read
Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read
Microsoft.Compute/virtualMachineScaleSets/virtualmachines/instanceView/read
// Required to add VM to LoadBalancer backendAddressPools
Microsoft.Network/networkInterfaces/write
// Required to add vmss to LoadBalancer backendAddressPools
Microsoft.Compute/virtualMachineScaleSets/write
// Required to attach AzureDisks and add vmssVM to LB
Microsoft.Compute/virtualMachineScaleSets/virtualmachines/write
// Required to upgrade VMSS models to latest for all instances
// only needed for Kubernetes 1.11.0-1.11.9, 1.12.0-1.12.8, 1.13.0-1.13.5, 1.14.0-1.14.1
Microsoft.Compute/virtualMachineScaleSets/manualupgrade/action
// Required to query internal IPs and loadBalancerBackendAddressPools for VM
Microsoft.Network/networkInterfaces/read
// Required to query internal IPs and loadBalancerBackendAddressPools for vmssVM
microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces/read
// Required to get public IPs for vmssVM
Microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces/ipconfigurations/publicipaddresses/read
// Required to check whether subnet existing for ILB in another resource group
Microsoft.Network/virtualNetworks/read
Microsoft.Network/virtualNetworks/subnets/read
// Required to create, update or delete snapshots for AzureDisk
Microsoft.Compute/snapshots/delete
Microsoft.Compute/snapshots/read
Microsoft.Compute/snapshots/write
// Required to get vm sizes for getting AzureDisk volume limit
Microsoft.Compute/locations/vmSizes/read
Microsoft.Compute/locations/operations/read
// Required to create, update or delete PrivateLinkService for Service
Microsoft.Network/privatelinkservices/delete
Microsoft.Network/privatelinkservices/privateEndpointConnections/delete
Microsoft.Network/privatelinkservices/read
Microsoft.Network/privatelinkservices/write
Microsoft.Network/virtualNetworks/subnets/write
4 - Use Availability Zones
Feature Status: Alpha since v1.12.
Kubernetes v1.12 adds support for Azure availability zones (AZ). Nodes in availability zone will be added with label failure-domain.beta.kubernetes.io/zone=<region>-<AZ>
and topology-aware provisioning is added for Azure managed disks storage class.
TOC:
Pre-requirements
Because only standard load balancer is supported with AZ, it is a prerequisite to enable AZ for the cluster. It should be configured in Azure cloud provider configure file (e.g. /etc/kubernetes/cloud-config/azure.json
):
{
"loadBalancerSku": "standard",
...
}
If topology-aware provisioning feature is used, feature gate VolumeScheduling
should be enabled on master components (e.g. kube-apiserver, kube-controller-manager and kube-scheduler).
Node labels
Both zoned and unzoned nodes are supported, but the value of node label failure-domain.beta.kubernetes.io/zone
are different:
- For zoned nodes, the value is
<region>-<AZ>
, e.g.centralus-1
. - For unzoned nodes, the value is faultDomain, e.g.
0
.
e.g.
$ kubectl get nodes --show-labels
NAME STATUS AGE VERSION LABELS
kubernetes-node12 Ready 6m v1.11 failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1,...
Load Balancer
loadBalancerSku
has been set to standard
in cloud provider configure file, so standard load balancer and standard public IPs will be provisioned automatically for services with type LoadBalancer
. Both load balancer and public IPs are zone redundant.
Managed Disks
Zone-aware and topology-aware provisioning are supported for Azure managed disks. To support these features, a few options are added in AzureDisk storage class:
- zoned: indicates whether new disks are provisioned with AZ. Default is true.
- allowedTopologies: indicates which topologies are allowed for topology-aware provisioning. Only can be set if
zoned
is not false.
StorageClass examples
An example of zone-aware provisioning storage class is:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
labels:
kubernetes.io/cluster-service: "true"
name: managed-premium
parameters:
kind: Managed
storageaccounttype: Premium_LRS
zoned: "true"
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer
Another example of topology-aware provisioning storage class is:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
labels:
kubernetes.io/cluster-service: "true"
name: managed-premium
parameters:
kind: Managed
storageaccounttype: Premium_LRS
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- centralus-1
- centralus-2
PV examples
When feature gate VolumeScheduling
disabled, no NodeAffinity
set for zoned PV:
$ kubectl describe pv
Name: pvc-d30dad05-9ad8-11e8-94f2-000d3a07de8c
Labels: failure-domain.beta.kubernetes.io/region=southeastasia
failure-domain.beta.kubernetes.io/zone=southeastasia-2
Annotations: pv.kubernetes.io/bound-by-controller=yes
pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
Finalizers: [kubernetes.io/pv-protection]
StorageClass: default
Status: Bound
Claim: default/pvc-azuredisk
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 5Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/region in [southeastasia]
failure-domain.beta.kubernetes.io/zone in [southeastasia-2]
Message:
Source:
Type: AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
DiskName: k8s-5b3d7b8f-dynamic-pvc-d30dad05-9ad8-11e8-94f2-000d3a07de8c
DiskURI: /subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Compute/disks/k8s-5b3d7b8f-dynamic-pvc-d30dad05-9ad8-11e8-94f2-000d3a07de8c
Kind: Managed
FSType:
CachingMode: None
ReadOnly: false
Events: <none>
When feature gate VolumeScheduling
enabled, NodeAffinity
will be populated for zoned PV:
$ kubectl describe pv
Name: pvc-0284337b-9ada-11e8-a7f6-000d3a07de8c
Labels: failure-domain.beta.kubernetes.io/region=southeastasia
failure-domain.beta.kubernetes.io/zone=southeastasia-2
Annotations: pv.kubernetes.io/bound-by-controller=yes
pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
Finalizers: [kubernetes.io/pv-protection]
StorageClass: default
Status: Bound
Claim: default/pvc-azuredisk
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 5Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/region in [southeastasia]
failure-domain.beta.kubernetes.io/zone in [southeastasia-2]
Message:
Source:
Type: AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
DiskName: k8s-5b3d7b8f-dynamic-pvc-0284337b-9ada-11e8-a7f6-000d3a07de8c
DiskURI: /subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Compute/disks/k8s-5b3d7b8f-dynamic-pvc-0284337b-9ada-11e8-a7f6-000d3a07de8c
Kind: Managed
FSType:
CachingMode: None
ReadOnly: false
Events: <none>
While unzoned disks are not able to attach in zoned nodes, NodeAffinity
will also be set for them so that they will only be scheduled to unzoned nodes:
$ kubectl describe pv pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
Name: pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
Labels: <none>
Annotations: pv.kubernetes.io/bound-by-controller=yes
pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
Finalizers: [kubernetes.io/pv-protection]
StorageClass: azuredisk-unzoned
Status: Bound
Claim: default/unzoned-pvc
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 5Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/region in [southeastasia]
failure-domain.beta.kubernetes.io/zone in [0]
Term 1: failure-domain.beta.kubernetes.io/region in [southeastasia]
failure-domain.beta.kubernetes.io/zone in [1]
Term 2: failure-domain.beta.kubernetes.io/region in [southeastasia]
failure-domain.beta.kubernetes.io/zone in [2]
Message:
Source:
Type: AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
DiskName: k8s-5b3d7b8f-dynamic-pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
DiskURI: /subscriptions/<subscription>/resourceGroups/<rg-name>/providers/Microsoft.Compute/disks/k8s-5b3d7b8f-dynamic-pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
Kind: Managed
FSType:
CachingMode: None
ReadOnly: false
Events: <none>
Appendix
Note that unlike most cases, fault domain and availability zones mean different on Azure:
- A Fault Domain (FD) is essentially a rack of servers. It consumes subsystems like network, power, cooling etc.
- Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking.
An Availability Zone in an Azure region is a combination of a fault domain, and an update domain (Same like FD, but for updates. When upgrading a deployment, it is carried out one update domain at a time). For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains.
Reference
See design docs for AZ in KEP for Azure availability zones.
5 - Support Multiple Node Types
Kubernetes v1.26 adds support for using Azure VMSS Flex VMs as the cluster nodes. Besides, mixing up different VM types in the same cluster is also supported. There is no API change expected from end users’ perspective when manipulating the Kubernetes cluster, however, users can choose to specify the VM type when configuring the Cloud Provider to further optimize the API calls in Cloud Controller Manager. Below are the configurations suggested based on the cluster modes.
Node Type | Configurations | Comments |
---|---|---|
Standalone VMs or AvailabilitySet VMs | vmType == standard | This will bypass the node type check and assume all the nodes in the cluster are standalone VMs / AvailabilitySet VMs. This should only be used for pure standalone VM / AvailabilitySet VM clusters. |
VMSS Uniform VMs | vmType==vmss && DisableAvailabilitySetNodes==true && EnbleVmssFlexNodes==false | This will bypass the node type check and assume all the nodes in the cluster are VMSS Uniform VMs. This should only be used for pure VMSS Uniform VM clusters. |
VMSS Flex VMs | vmType==vmssflex | This will bypass the node type check and assume all the nodes in the cluster are VMSS Flex VMs. This should only be used for pure VMSS Flex VM clusters (since v1.26.0). |
Standalone VMs, AvailabilitySet VMs, VMSS Uniform VMs and VMSS Flex VMs | vmType==vmss && (DisableAvailabilitySetNodes==false || EnbleVmssFlexNodes==true) | This should be used the clusters of which the nodes are mixed from standalone VMs, AvailabilitySet VMs, VMSS Flex VMs (since v1.26.0) and VMSS Uniform VMs. Node type will be checked and corresponding cloud provider API will be called based on the node type. |
6 - Deploy Cross Resource Group Nodes
Feature status: GA since v1.21.
Kubernetes v1.21 adds support for cross resource group (RG) nodes and unmanaged (such as on-prem) nodes in Azure cloud provider. A few assumptions are made for such nodes:
- Cross-RG nodes are in same region and set with required labels (as clarified in the following part)
- Nodes will not be part of the load balancer managed by cloud provider
- Both node and container networking should be configured properly by provisioning tools
- AzureDisk is supported for Azure cross-RG nodes, but not for on-prem nodes
Pre-requirements
Because cross-RG nodes and unmanaged nodes won’t be added to Azure load balancer backends, feature gate ServiceNodeExclusion
should be enabled for master components (ServiceNodeExclusion
has been GA and enabled by default since v1.21).
Cross-RG nodes
Cross-RG nodes should register themselves with required labels together with cloud provider:
node.kubernetes.io/exclude-from-external-load-balancers
, which is used to exclude the node from load balancer.alpha.service-controller.kubernetes.io/exclude-balancer=true
should be used if the cluster version is below v1.16.0.
kubernetes.azure.com/resource-group=<rg-name>
, which provides external RG and is used to get node information.- cloud provider config
--cloud-provider=azure
when using kube-controller-manager--cloud-provider=external
when using cloud-controller-manager
For example,
kubelet ... \
--cloud-provider=azure \
--cloud-config=/etc/kubernetes/cloud-config/azure.json \
--node-labels=node.kubernetes.io/exclude-from-external-load-balancers=true,kubernetes.azure.com/resource-group=<rg-name>
Unmanaged nodes
On-prem nodes are different from Azure nodes, all Azure coupled features (such as load balancers and Azure managed disks) are not supported for them. To prevent the node being deleted, Azure cloud provider will always assumes the node existing.
On-prem nodes should register themselves with labels node.kubernetes.io/exclude-from-external-load-balancers=true
and kubernetes.azure.com/managed=false
:
node.kubernetes.io/exclude-from-external-load-balancers=true
, which is used to exclude the node from load balancer.kubernetes.azure.com/managed=false
, which indicates the node is on-prem or on other clouds.
For example,
kubelet ...\
--cloud-provider= \
--node-labels=node.kubernetes.io/exclude-from-external-load-balancers=true,kubernetes.azure.com/managed=false
Limitations
Cross resource group nodes and unmanaged nodes are unsupported when joined to an AKS cluster. Using these labels on AKS-managed nodes is not supported.
Reference
See design docs for cross resource group nodes in KEP 20180809-cross-resource-group-nodes.
7 - Multiple Services Sharing One IP Address
This feature is supported since v1.20.0.
Provider Azure supports sharing one IP address among multiple load balancer typed external or internal services. To share an IP address among multiple public services, a public IP resource is needed. This public IP could be created in advance or let the cloud provider provision it when creating the first external service. Specifically, Azure would create a public IP resource automatically when an external service is discovered.
apiVersion: v1
kind: Service
metadata:
name: nginx
namespace: default
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx
type: LoadBalancer
Note that the annotations service.beta.kubernetes.io/azure-load-balancer-ipv4
, service.beta.kubernetes.io/azure-load-balancer-ipv6
, field Service.Spec.LoadBalancerIP
are not set, or Azure would find a pre-allocated public IP with the address. After obtaining the IP address of the service, you could create other services using this address.
apiVersion: v1
kind: Service
metadata:
name: https
namespace: default
annotations:
service.beta.kubernetes.io/azure-load-balancer-ipv4: 1.2.3.4 # the IP address could be the same as it is of `nginx` service
spec:
ports:
- port: 443
protocol: TCP
targetPort: 443
selector:
app: https
type: LoadBalancer
Note that if you specify the annotations service.beta.kubernetes.io/azure-load-balancer-ipv4
, service.beta.kubernetes.io/azure-load-balancer-ipv6
or field Service.Spec.LoadBalancerIP
but there is no corresponding public IP pre-allocated, an error would be reported.
DNS
Even if multiple services can refer to one public IP, the DNS label cannot be re-used. The public IP would have the label kubernetes-dns-label-service: <svcName>
to indicate which service is binding to the DNS label. In this case if there is another service sharing this specific IP address trying to refer to the DNS label, an error would be reported. For managed public IPs, this label will be added automatically by the cloud provider. For static public IPs, this label should be added manually.
Using public IP name instead of IP address to share the public IP
This feature is supported since v1.24.0.
In addition to using the IP address annotation, you could also use the public IP name to share the public IP. The public IP name could be specified by the annotation service.beta.kubernetes.io/azure-pip-name
. You can point to a system-created public IP or a static public IP.
apiVersion: v1
kind: Service
metadata:
name: https
namespace: default
annotations:
service.beta.kubernetes.io/azure-pip-name: pip-1
spec:
ports:
- port: 443
protocol: TCP
targetPort: 443
selector:
app: https
type: LoadBalancer
Restrictions
The cloud provider azure manages the lifecycle of the system-created public IPs. By default, there are two kinds of system managed tags: kubernetes-cluster-name
and service
(see the picture below). The controller manager would
add the service name to the service
if a service is trying to refer to the public IP, and remove the name from the service
if the service is deleted. The public IP would be deleted if there is no service
in the tag service
. However, according to the docs of azure tags, there are several restrictions:
Each resource, resource group, and subscription can have a maximum of 50 tag name/value pairs. If you need to apply more tags than the maximum allowed number, use a JSON string for the tag value. The JSON string can contain many values that are applied to a single tag name. A resource group or subscription can contain many resources that each have 50 tag name/value pairs.
The tag name is limited to 512 characters, and the tag value is limited to 256 characters. For storage accounts, the tag name is limited to 128 characters, and the tag value is limited to 256 characters.
Based to that, we suggest to use static public IPs when there are more than 10 services sharing the IP address.
8 - Tagging resources managed by Cloud Provider Azure
This feature is supported since v1.20.0.
We could use tags to organize your Azure resources and management hierarchy. Cloud Provider Azure supports tagging managed resource through configuration file or service annotation.
Specifically, the shared resources (load balancer, route table and security group) could be tagged by setting tags
in azure.json
:
{
"tags": "a=b,c=d"
}
the controller manager would parse this configuration and tag the shared resources once restarted.
The non-shared resource (public IP) could be tagged by setting tags
in azure.json
or service annotation service.beta.kubernetes.io/azure-pip-tags
. The format of the two is similar and the tags in the annotation would be considered first when there are conflicts between the configuration file and the annotation.
The annotation
service.beta.kubernetes.io/azure-pip-tags
only works for managed public IPs. For BYO public IPs, the cloud provider would not apply any tags to them.
When the configuration, file or annotation, is updated, the old ones would be updated if there are conflicts. For example, after updating {"tags": "a=b,c=d"}
to {"tags": "a=c,e=f"}
, the new tags would be a=c,c=d,e=f
.
Integrating with system tags
This feature is supported since v1.21.0.
Normally the controller manager don’t delete the existing tags even if they are not included in the new version of azure configuration files, because the controller manager doesn’t know which tags should be deleted and which should not (e.g., tags managed by cloud provider itself). We can leverage the config systemTags
in the cloud configuration file to control what tags can be deleted. Here are the examples:
Tags | SystemTags | existing tags on resources | new tags on resources |
---|---|---|---|
“a=b,c=d” | "" | {} | {“a”: “b”, “c”: “d”} |
“a=b,c=d” | "" | {“a”: “x”, “c”: “y”} | {“a”: “b”, “c”: “d”} |
“a=b,c=d” | "" | {“e”: “f”} | {“a”: “b”, “c”: “d”, “e”: “f”} /* won’t delete e because the SystemTags is empty */ |
“c=d” | “a” | {“a”: “b”} | {“a”: “b”, “c”: “d”} /* won’t delete a because it’s in the SystemTags */ |
“c=d” | “x” | {“a”: “b”} | {“c”: “d”} /* will delete a because it’s not in Tags or SystemTags */ |
Please consider migrating existing “tags” to “tagsMap”, the support of “tags” configuration would be removed in a future release.
Including special characters in tags
This feature is supported since v1.23.0.
Normally we don’t support special characters such as =
or ,
in key-value pairs. These characters will be treated as separator and will not be included in the key/value literal. To solve this problem, tagsMap
is introduced since v1.23.0, in which a JSON-style tag is acceptable.
{
"tags": "a=b,c=d",
"tagsMap": {"e": "f", "g=h": "i,j"}
}
tags
and tagsMap
will be merged, and similarly, they are case-insensitive.
9 - Kubelet Credential Provider
Note: The Kubelet credential provider feature is still in alpha and shouldn’t be used in production environments. Please use
--azure-container-registry-config=/etc/kubernetes/cloud-config/azure.json
if you need pulling images from ACR in production.
As part of Out-of-Tree Credential Providers, the kubelet builtin image pulling from ACR (which could be enabled by setting kubelet --azure-container-registry-config=<config-file>
) would be moved out-of-tree credential plugin acr-credential-provider
. Please refer the original KEP for details.
In order to switch the kubelet credential provider to out-of-tree, you’ll have to
- Remove
--azure-container-registry-config
from kubelet configuration options. - Add
--feature-gates=KubeletCredentialProviders=true
to kubelet configuration options. - Create directory
/var/lib/kubelet/credential-provider
, download ‘acr-credential-provider’ binary to this directory and add--image-credential-provider-bin-dir=/var/lib/kubelet/credential-provider
to kubelet configuration options. - Create the following credential-provider-config.yaml file and add
--image-credential-provider-config=/var/lib/kubelet/credential-provider-config.yaml
to kubelet configuration options.
# cat /var/lib/kubelet/credential-provider-config.yaml
kind: CredentialProviderConfig
apiVersion: kubelet.config.k8s.io/v1
providers:
- name: acr-credential-provider
apiVersion: credentialprovider.kubelet.k8s.io/v1
defaultCacheDuration: 10m
matchImages:
- "*.azurecr.io"
- "*.azurecr.cn"
- "*.azurecr.de"
- "*.azurecr.us"
args:
- /etc/kubernetes/azure.json
10 - Node IPAM controller
This feature is supported since v1.21.0.
Background
The in-tree Node IPAM controller only supports a fixed node CIDR mask size for all nodes, while in multiple node pool (VMSS) scenarios, different mask sizes are required for different node pools. There is a GCE-specific cloud CIDR allocator for a similar scenario, but that is not exposed in cloud provider API and it is planned to be moved out-of-tree.
Hence this docs proposes an out-of-tree node IPAM controller. Specifically, allocate different pod CIDRs based on different CIDR mask size for different node pools (VMSS or VMAS).
Usage
There are two kinds of CIDR allocator in the node IPAM controller, which are RangeAllocator
and CloudAllocator
.
The RangeAllocator
is the default one which allocates the pod CIDR for every node in the range of the cluster CIDR.
The CloudAllocator
allocates the pod CIDR for every node in the range of the CIDR on the corresponding VMSS or VMAS.
The pod CIDR mask size of each node that belongs to a specific VMSS or VMAS is set by a specific tag
{"kubernetesNodeCIDRMaskIPV4": "24"}
or {"kubernetesNodeCIDRMaskIPV6": "64"}
. Note that the mask size tagging on
the VMSS or VMAS must be within the cluster CIDR, or an error would be thrown.
When the above tag doesn’t exist on VMSS/VMAS, the default mask size (24 for ipv4 and 64 for ipv6) would be used.
To turn on the out-of-tree node IPAM controller:
- Disable the in-tree node IPAM controller by setting
--allocate-node-cidrs=false
in kube-controller-manager. - Enable the out-of-tree counterpart by setting
--allocate-node-cidrs=true
in cloud-controller-manager. - To use
RangeAllocator
:- configure the
--cluster-cidr
,--service-cluster-ip-range
and--node-cidr-mask-size
; - if you enable the ipv6 dualstack, setting
--node-cidr-mask-size-ipv4
and--node-cidr-mask-size-ipv6
instead of--node-cidr-mask-size
. An error would be reported if--node-cidr-mask-size
and--node-cidr-mask-size-ipv4
(or--node-cidr-mask-size-ipv6
) are set to non-zero values at a time. If only--node-cidr-mask-size
is set, which is not recommended, the--node-cidr-mask-size-ipv4
and--node-cidr-mask-size-ipv6
would be set to this value by default.
- configure the
- To use
CloudAllocator
:- set the
--cidr-allocator-type=CloudAllocator
; - configure mask sizes of each VMSS/VMAS by tagging
{"kubernetesNodeCIDRMaskIPV4": "custom-mask-size"}
and{"kubernetesNodeCIDRMaskIPV4": "custom-mask-size"}
if necessary.
- set the
Configurations
kube-controller-manager
kube-controller-manager would be configured with option --allocate-node-cidrs=false
to disable the in-tree node IPAM controller.
cloud-controller-manager
The following configurations from cloud-controller-manager would be used as default options:
name | type | default | description |
---|---|---|---|
allocate-node-cidrs | bool | true | Should CIDRs for Pods be allocated and set on the cloud provider. |
cluster-cidr | string | “10.244.0.0/16” | CIDR Range for Pods in cluster. Requires –allocate-node-cidrs to be true. It will be ignored when enabling dualstack. |
service-cluster-ip-range | string | "" | CIDR Range for Services in cluster, this would get excluded from the allocatable range. Requires –allocate-node-cidrs to be true. |
node-cidr-mask-size | int | 24 | Mask size for node cidr in cluster. Default is 24 for IPv4 and 64 for IPv6. |
node-cidr-mask-size-ipv4 | int | 24 | Mask size for IPv4 node cidr in dual-stack cluster. Default is 24. |
node-cidr-mask-size-ipv6 | int | 64 | Mask size for IPv6 node cidr in dual-stack cluster. Default is 64. |
cidr-allocator-type | string | “RangeAllocator” | The CIDR allocator type. “RangeAllocator” or “CloudAllocator”. |
Limitations
- We plan to integrate out-of-tree node ipam controller with cluster-api-provider-azure to provider a better experience. Before that, the manual configuration is required.
- It is not supported to change the custom mask size value on the tag once it is set.
- For now, there is no e2e test covering this feature, so there can be potential bugs. It is not recommended enabling it in the production environment.
11 - Azure Private Link Service Integration
Azure Private Link Service (PLS) is an infrastructure component that allows users to privately connect via a Private Endpoint (PE) in a VNET in Azure and a Frontend IP Configuration associated with an Azure Load Balancer (ALB). With Private Link, users as service providers can securely provide their services to consumers who can connect from within Azure or on-premises without data exfiltration risks.
Before Private Link Service integration, users who wanted private connectivity from on-premises or other VNETs to their services in the Azure Kubernetes cluster were required to create a Private Link Service (PLS) to reference the Azure LoadBalancer. The user would then create a Private Endpoint (PE) to connect to the PLS to enable private connectivity. With this feature, a managed PLS to the LB would be created automatically, and the user would only be required to create PE connections to it for private connectivity.
Note: When PLS has TCP proxy protocol V2 enabled (
service.beta.kubernetes.io/azure-pls-proxy-protocol: true
) and serviceexternalTrafficPolicy
is set toLocal
, LB health probe is down. This is because when PLS has proxy protocol enabled, the corresponding LB HTTP health probe would use proxy protocol as well. When service’sexternalTrafficPolicy
is set toLocal
, health probe depends on kube-proxy’s health check service which does not accept proxy protocol and all health probes fail. PR #3931 allows users to customize health probe whenexternalTrafficPolicy
is set toLocal
and thus provides the workaround. It will be released soon.
PrivateLinkService annotations
Below is a list of annotations supported for Kubernetes services with Azure PLS created:
Annotation | Value | Description | Required | Default |
---|---|---|---|---|
service.beta.kubernetes.io/azure-pls-create | "true" | Boolean indicating whether a PLS needs to be created. | Required | |
service.beta.kubernetes.io/azure-pls-name | <PLS name> | String specifying the name of the PLS resource to be created. | Optional | "pls-<LB frontend config name>" |
service.beta.kubernetes.io/azure-pls-resource-group | Resource Group name | String specifying the name of the Resource Group where the PLS resource will be created | Optional | MC_ resource |
service.beta.kubernetes.io/azure-pls-ip-configuration-subnet | <Subnet name> | String indicating the subnet to which the PLS will be deployed. This subnet must exist in the same VNET as the backend pool. PLS NAT IPs are allocated within this subnet. | Optional | If service.beta.kubernetes.io/azure-load-balancer-internal-subnet , this ILB subnet is used. Otherwise, the default subnet from config file is used. |
service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address-count | [1-8] | Total number of private NAT IPs to allocate. | Optional | 1 |
service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address | "10.0.0.7 ... 10.0.0.10" | A space separated list of static IPv4 IPs to be allocated. (IPv6 is not supported right now.) Total number of IPs should not be greater than the ip count specified in service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address-count . If there are fewer IPs specified, the rest are dynamically allocated. The first IP in the list is set as Primary . | Optional | All IPs are dynamically allocated. |
service.beta.kubernetes.io/azure-pls-fqdns | "fqdn1 fqdn2" | A space separated list of fqdns associated with the PLS. | Optional | [] |
service.beta.kubernetes.io/azure-pls-proxy-protocol | "true" or "false" | Boolean indicating whether the TCP PROXY protocol should be enabled on the PLS to pass through connection information, including the link ID and source IP address. Note that the backend service MUST support the PROXY protocol or the connections will fail. | Optional | false |
service.beta.kubernetes.io/azure-pls-visibility | "sub1 sub2 sub3 … subN" or "*" | A space separated list of Azure subscription ids for which the private link service is visible. Use "*" to expose the PLS to all subs (Least restrictive). | Optional | Empty list [] indicating role-based access control only: This private link service will only be available to individuals with role-based access control permissions within your directory. (Most restrictive) |
service.beta.kubernetes.io/azure-pls-auto-approval | "sub1 sub2 sub3 … subN" | A space separated list of Azure subscription ids. This allows PE connection requests from the subscriptions listed to the PLS to be automatically approved. | Optional | [] |
For more details about each configuration, please refer to Azure Private Link Service Documentation.
Design Details
Creating managed PrivateLinkService
When a LoadBalancer
typed service is created without the annotations service.beta.kubernetes.io/azure-load-balancer-ipv4
, service.beta.kubernetes.io/azure-load-balancer-ipv6
or field Service.Spec.LoadBalancerIP
set, an LB frontend IP configuration is created with a dynamically generated IP. If the service has the annotation service.beta.kubernetes.io/azure-load-balancer-ipv4
or service.beta.kubernetes.io/azure-load-balancer-ipv6
set, an existing LB frontend IP configuration may be reused if one exists; otherwise a static configuration is created with the specified IP. When a service is created with annotation service.beta.kubernetes.io/azure-pls-create
set to true
or updated later with the annotation added, a PLS resource attached to the LB frontend is created in the default resource group or the resource group user set in config file with key PrivateLinkServiceResourceGroup
.
The Kubernetes service creating the PLS is assigned as the owner of the resource. Azure cloud provider tags the PLS with cluster name and service name kubernetes-owner-service: <namespace>/<service name>
. Only the owner service can later update the properties of the PLS resource.
If there’s a managed PLS already created for the LB frontend, the same PLS is reused automatically since each LB frontend can be referenced by only one PLS. If the LB frontend is attached to a user defined PLS, service creation should fail with proper error logged.
For now, Azure cloud provider does not manage any Private Link Endpoint resources. Once a PLS is created, users can create their own PEs to connect to the PLS.
Deleting managed PrivateLinkService
Once a PLS is created, it shares the lifetime of the LB frontend IP configuration and is deleted only when its corresponding LB frontend gets deleted. As a result, a PLS may still exist even when its owner service is deleted. This is out of the consideration that multiple Kubernetes services can share the same LB frontend IP configuration and thus share the PLS automatically. More details are discussed in next section.
If there are active PE connections to the PLS, all connections are removed and the PEs become obsolete. Users are responsible for cleaning up the PE resources.
Sharing managed PrivateLinkService
Multiple Kubernetes services can share the same LB frontend by specifying the same annotations service.beta.kubernetes.io/azure-load-balancer-ipv4
, service.beta.kubernetes.io/azure-load-balancer-ipv6
or field Service.Spec.LoadBalancerIP
(for more details, please refer to Multiple Services Sharing One IP Address). Once a PLS is attached to the LB frontend, these services automatically share the PLS. Users can access these services via the same PE but different ports.
Azure cloud provider tags the service creating the PLS as the owner (kubernetes-owner-service: <namespace>/<service name>
) and only allows that service to update the configurations of the PLS. If the owner service is deleted or if user wants some other service to take control, user can modify the tag value to a new service in <namespace>/<service name>
pattern.
PLS is only automatically deleted when the LB frontend IP configuration is deleted. One can delete a service while preserving the PLS by creating a temporary service referring to the same LB frontend.
Managed PrivateLinkService Creation example
Below we provide an example for creating a Kubernetes service object with Azure ILB and PLS created:
apiVersion: v1
kind: Service
metadata:
name: myService
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true" # Use an internal LB with PLS
service.beta.kubernetes.io/azure-pls-create: "true"
service.beta.kubernetes.io/azure-pls-name: myServicePLS
service.beta.kubernetes.io/azure-pls-ip-configuration-subnet: pls-subnet
service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address-count: "1"
service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address: 10.240.0.9 # Must be available in pls-subnet
service.beta.kubernetes.io/azure-pls-fqdns: "fqdn1 fqdn2"
service.beta.kubernetes.io/azure-pls-proxy-protocol: "false"
service.beta.kubernetes.io/azure-pls-visibility: "*"
service.beta.kubernetes.io/azure-pls-auto-approval: "subId1"
spec:
type: LoadBalancer
selector:
app: myApp
ports:
- name: myAppPort
protocol: TCP
port: 80
targetPort: 80
Restrictions
- PLS does not support basic Load Balancer or IP-based Load Balancer.
- PLS connectivity is broken with Azure external Standard Load Balancer and floating ip enabled (default). To use managed private link service, users can either create an internal service by setting annotation
service.beta.kubernetes.io/azure-load-balancer-internal
totrue
or disable floating ip by setting annotationservice.beta.kubernetes.io/azure-disable-load-balancer-floating-ip
totrue
(more details here). - Due to limitation of kubernetes#95555, when the service’s externalTrafficPolicy set to Local, PLS need to use a different subnet from Pod’s subnet. If the same subnet is required, then the service should use Cluster externalTrafficPolicy.
- PLS only works with IPv4 and cannot be deployed to an SLB with IPv6 frontend ipConfigurations. In dual-stack clusters, users cannot create a service with PLS if there’s existing IPv6 service deployed on the same load balancer.
- For other limitations, please check Azure Private Link Service Doc.
12 - Multiple Standard LoadBalancers
Multiple Standard LoadBalancers
Backgrounds
There will be only a single Standard Load Balancer and a single Internal Load Balancer (if required) per cluster by default. This imposes a number of limits on clusters based on Azure Load Balancer limits, the largest being based on the 300 rules per NIC limitation. Any IP:port combination in a frontEndIPConfiguration that maps to a member of a backend pool counts as one of the 300 rules for that node. This limits any AKS cluster to a maximum of 300 LoadBalancer service IP:port combinations (so a maximum of 300 services with one port, or fewer if services have multiple ports). Load balancers are also limited to no more than 8 private link services targeting a given load balancer.
Configuration
Introduce a new cloud configuration option multipleStandardLoadBalancerConfigurations
. Example:
{
...
"loadBalancerBackendPoolConfigurationType": "nodeIP",
"multipleStandardLoadBalancerConfigurations": [
{
"name": "<clusterName>",
"autoPlaceServices": true
},
{
"name": "lb-2",
"autoPlaceServices": false,
"serviceNamespaceSelector": [
"matchExpressions": [
{
"key": "key1",
"operator": "In",
"values": [
"val1"
]
}
]
],
"nodeSelector": {
"matchLabels": {
"key1": "val1"
}
},
"primaryVMSet": "vmss-1"
}
]
}
To enable the multiple standard load balancers, set
loadBalancerSKU
toStandard
,loadBalancerBackendPoolConfigurationType
tonodeIP
and at least onemultipleStandardLoadBalancerConfiguration
. If one or more conditions are not met, the cloud provider will either throw an error or fall back to the single standard load balancer.
default lbs
The default lb <clustername>
is required in loadBalancerProfiles
. The cloud provider will check if there is an lb config named <clustername>
. If not, an error will be reported in the service event.
internal lbs
The behavior of internal lbs remains the same as is. It shares the same config as its public counterpart and will be automatically created if needed with the name <external-lb-name>-internal
. Internal lbs are not required in the loadBalancerProfiles
, all lb names in it are considered public ones.
Service selection
In the cases of basic lb and the previous revision of multiple slb design, we use service annotation service.beta.kubernetes.io/azure-load-balancer-mode
to decide which lb the service should be attached to. It can be set to an agent pool name, and the service will be attached to the lb belongs to that agent pool. If set to __auto__
, we pick an lb with the fewest number of lb rules for the service. This selection logic will be replaced by the following:
New service annotation
service.beta.kubernetes.io/azure-load-balancer-configurations: <lb-config-name1>,<lb-config-name2>
will replace the old annotationservice.beta.kubernetes.io/azure-load-balancer-mode
which will only be useful for basic SKU load balancers. If all selected lbs are not eligible, an error will be reported in the service events. If multiple eligible lbs are provided, choose one with the lowest number of rules.AllowServicePlacement
This load balancer can have services placed on it. Defaults to true, can be set to false to drain and eventually remove a load balancer. This will not impact existing services on the load balancer.ServiceNamespaceSelector
Only services created in namespaces that match the selector will be allowed to select that load balancer, either manually or automatically. If not supplied, services created in any namespaces can be created on that load balancer. If the value is changed, all services on this slb will be moved onto another one with the public/internal IP addresses unchanged. If the services have no place to go, an error should be thrown in the service event.ServiceLabelSelector
Similar toServiceNamespaceSelector
. Services must match this selector to be placed on this load balancer.
Node selection
When the cluster is initially migrated to or created with multiple standard load balancers, each node will be evaluated to see what load balancer it should be placed into.
Valid placement targets will be determined as follows (rules match from top to bottom, first match wins):
- If this node is in an agent pool that is selected as a primary agent pool for a load balancer, that load balancer will be the only potential placement target.
- If the nodeSelectors on any load balancer configurations match this node, then all load balancer configurations that match it will be potential placement targets.
- If no nodeSelectors on any load balancer configurations match this node, then all load balancers that do not have any nodeSelectors will be potential placement targets.
After the list of potential placement targets has been calculated, the node will be placed into the kubernetes backend pool of the load balancer with the fewest number of nodes already assigned.
Service with ExternalTrafficPolicy=Local
Each local service owns a backend pool named after the service name. The backend pool will be created in the service reconciliation loop when the service is created or updated from external traffic policy cluster. It will be deleted in the service reconciliation loop when: 1, the service is deleted; 2, the service is changed to etp cluster; 3, the cluster is migrated from multi-slb to single-slb; and 4, the service is moved to another load balancer.
Besides the service reconciliation loop, an endpointslice informer is also responsible for updating the dedicated backend pools. It watches all endpointslices of local services, monitors any updating events, and updates the corresponding backend pool when service endpoints changes. Considering local services may churn quickly, the informer will send backend pool updating operations to a buffer queue. The queue merges operations targeting to the same backend pool, and updates them every 30s by default. The updating interval can be adjusted by changing loadBalancerBackendPoolUpdateIntervalInSeconds
in cloud configurations.
Local service dedicated backend pool and
<clusterName>
backend pool cannot be reconciled in one loop. Hence, the operation triggered by the update of local service or its endpoints will not affect<clusterName>
backend pool.