Kube Controller Manager 源码分析
Controller Manager 在k8s 集群中扮演着中心管理的角色,它负责Deployment, StatefulSet, ReplicaSet 等资源的创建与管理,可以说是k8s的核心模块,下面我们以概略的形式走读一下k8s Controller Manager 代码。
func NewControllerManagerCommand() *cobra.Command { s, err := options.NewKubeControllerManagerOptions() if err != nil { klog.Fatalf("unable to initialize command options: %v", err) } cmd := &cobra.Command{ Use: "kube-controller-manager", Long: `The Kubernetes controller manager is a daemon that embeds the core control loops shipped with Kubernetes. In applications of robotics and automation, a control loop is a non-terminating loop that regulates the state of the system. In Kubernetes, a controller is a control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state. Examples of controllers that ship with Kubernetes today are the replication controller, endpoints controller, namespace controller, and serviceaccounts controller.`, Run: func(cmd *cobra.Command, args []string) { verflag.PrintAndExitIfRequested() utilflag.PrintFlags(cmd.Flags()) c, err := s.Config(KnownControllers(), ControllersDisabledByDefault.List()) if err != nil { fmt.Fprintf(os.Stderr, "%v\n", err) os.Exit(1) } if err := Run(c.Complete(), wait.NeverStop); err != nil { fmt.Fprintf(os.Stderr, "%v\n", err) os.Exit(1) } }, }
Controller Manager 也是一个命令行,通过一系列flag启动,具体的各个flag 我们就不多看,有兴趣的可以去文档或者flags_opinion.go 文件里面去过滤一下,我们直接从Run 函数入手。
Run Function 启动流程
Kube Controller Manager 既可以单实例启动,也可以多实例启动。 如果为了保证 HA 而启动多个Controller Manager,它就需要选主来保证同一时间只有一个Master 实例。我们来看一眼Run 函数的启动流程,这里会把一些不重要的细节函数略过,只看重点
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error { run := func(ctx context.Context) { rootClientBuilder := controller.SimpleControllerClientBuilder{ ClientConfig: c.Kubeconfig, } controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done()) if err != nil { klog.Fatalf("error building controller context: %v", err) } if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil { klog.Fatalf("error starting controllers: %v", err) } controllerContext.InformerFactory.Start(controllerContext.Stop) close(controllerContext.InformersStarted) select {} } id, err := os.Hostname() if err != nil { return err } // add a uniquifier so that two processes on the same host don't accidentally both become active id = id + "_" + string(uuid.NewUUID()) rl, err := resourcelock.New(c.ComponentConfig.Generic.LeaderElection.ResourceLock, "kube-system", "kube-controller-manager", c.LeaderElectionClient.CoreV1(), resourcelock.ResourceLockConfig{ Identity: id, EventRecorder: c.EventRecorder, }) if err != nil { klog.Fatalf("error creating lock: %v", err) } leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{ Lock: rl, LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration, RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration, RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration, Callbacks: leaderelection.LeaderCallbacks{ OnStartedLeading: run, OnStoppedLeading: func() { klog.Fatalf("leaderelection lost") }, }, WatchDog: electionChecker, Name: "kube-controller-manager", }) panic("unreachable") }
这里的基本流程如下:
- 首先定义了run 函数,run 函数负责具体的controller 构建以及最终的controller 操作的执行
- 使用Client-go 提供的选主函数来进行选主
- 如果获得主权限,那么就调用OnStartedLeading 注册函数,也就是上面的run 函数来执行操作,如果没选中,就hang住等待
选主流程解析
Client-go 选主工具类主要是通过kubeClient 在Configmap或者Endpoint选择一个资源创建,然后哪一个goroutine 创建成功了资源,哪一个goroutine 获得锁,当然所有的锁信息都会存在Configmap 或者Endpoint里面。之所以选择这两个资源类型,主要是考虑他们被Watch的少,但是现在kube Controller Manager 还是适用的Endpoint,后面会逐渐迁移到ConfigMap,因为Endpoint会被kube-proxy Ingress Controller等频繁Watch,我们来看一眼集群内Endpoint内容
[root@iZ8vb5qgxqbxakfo1cuvpaZ ~]# kubectl get ep -n kube-system kube-controller-manager -o yaml apiVersion: v1 kind: Endpoints metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"iZ8vbccmhgkyfdi8aii1hnZ_d880fea6-1322-11e9-913f-00163e033b49","leaseDurationSeconds":15,"acquireTime":"2019-01-08T08:53:49Z","renewTime":"2019-01-22T11:16:59Z","leaderTransitions":1}' creationTimestamp: 2019-01-08T08:52:56Z name: kube-controller-manager namespace: kube-system resourceVersion: "2978183" selfLink: /api/v1/namespaces/kube-system/endpoints/kube-controller-manager uid: cade1b65-1322-11e9-9931-00163e033b49
可以看到,这里面涵盖了当前Master ID,获取Master的时间,更新频率以及下一次更新时间。这一切最终还是靠ETCD 完成的选主。主要的选主代码如下
func New(lockType string, ns string, name string, client corev1.CoreV1Interface, rlc ResourceLockConfig) (Interface, error) { switch lockType { case EndpointsResourceLock: return &EndpointsLock{ EndpointsMeta: metav1.ObjectMeta{ Namespace: ns, Name: name, }, Client: client, LockConfig: rlc, }, nil case ConfigMapsResourceLock: return &ConfigMapLock{ ConfigMapMeta: metav1.ObjectMeta{ Namespace: ns, Name: name, }, Client: client, LockConfig: rlc, }, nil default: return nil, fmt.Errorf("Invalid lock-type %s", lockType) } }
StartController
选主完毕后,就需要真正启动controller了,我们来看一下启动controller 的代码
func StartControllers(ctx ControllerContext, startSATokenController InitFunc, controllers map[string]InitFunc, unsecuredMux *mux.PathRecorderMux) error { // Always start the SA token controller first using a full-power client, since it needs to mint tokens for the rest // If this fails, just return here and fail since other controllers won't be able to get credentials. if _, _, err := startSATokenController(ctx); err != nil { return err } // Initialize the cloud provider with a reference to the clientBuilder only after token controller // has started in case the cloud provider uses the client builder. if ctx.Cloud != nil { ctx.Cloud.Initialize(ctx.ClientBuilder, ctx.Stop) } for controllerName, initFn := range controllers { if !ctx.IsControllerEnabled(controllerName) { klog.Warningf("%q is disabled", controllerName) continue } time.Sleep(wait.Jitter(ctx.ComponentConfig.Generic.ControllerStartInterval.Duration, ControllerStartJitter)) klog.V(1).Infof("Starting %q", controllerName) debugHandler, started, err := initFn(ctx) if err != nil { klog.Errorf("Error starting %q", controllerName) return err } if !started { klog.Warningf("Skipping %q", controllerName) continue } if debugHandler != nil && unsecuredMux != nil { basePath := "/debug/controllers/" + controllerName unsecuredMux.UnlistedHandle(basePath, http.StripPrefix(basePath, debugHandler)) unsecuredMux.UnlistedHandlePrefix(basePath+"/", http.StripPrefix(basePath, debugHandler)) } klog.Infof("Started %q", controllerName) } return nil }
- 遍历所有的controller list
- 执行每个controller 的Init Function
那么一共有多少Controller 呢
func NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc { controllers := map[string]InitFunc{} controllers["endpoint"] = startEndpointController controllers["replicationcontroller"] = startReplicationController controllers["podgc"] = startPodGCController controllers["resourcequota"] = startResourceQuotaController controllers["namespace"] = startNamespaceController controllers["serviceaccount"] = startServiceAccountController controllers["garbagecollector"] = startGarbageCollectorController controllers["daemonset"] = startDaemonSetController controllers["job"] = startJobController controllers["deployment"] = startDeploymentController controllers["replicaset"] = startReplicaSetController controllers["horizontalpodautoscaling"] = startHPAController controllers["disruption"] = startDisruptionController controllers["statefulset"] = startStatefulSetController controllers["cronjob"] = startCronJobController controllers["csrsigning"] = startCSRSigningController controllers["csrapproving"] = startCSRApprovingController controllers["csrcleaner"] = startCSRCleanerController controllers["ttl"] = startTTLController controllers["bootstrapsigner"] = startBootstrapSignerController controllers["tokencleaner"] = startTokenCleanerController controllers["nodeipam"] = startNodeIpamController controllers["nodelifecycle"] = startNodeLifecycleController if loopMode == IncludeCloudLoops { controllers["service"] = startServiceController controllers["route"] = startRouteController controllers["cloud-node-lifecycle"] = startCloudNodeLifecycleController // TODO: volume controller into the IncludeCloudLoops only set. } controllers["persistentvolume-binder"] = startPersistentVolumeBinderController controllers["attachdetach"] = startAttachDetachController controllers["persistentvolume-expander"] = startVolumeExpandController controllers["clusterrole-aggregation"] = startClusterRoleAggregrationController controllers["pvc-protection"] = startPVCProtectionController controllers["pv-protection"] = startPVProtectionController controllers["ttl-after-finished"] = startTTLAfterFinishedController controllers["root-ca-cert-publisher"] = startRootCACertPublisher return controllers }
答案就在这里,上面的代码列出来了当前kube controller manager 所有的controller,既有大家熟悉的Deployment StatefulSet 也有一些不熟悉的身影。下面我们以Deployment 为例看看它到底干了什么
Deployment Controller
先来看一眼Deployemnt Controller 启动函数
func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) { if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] { return nil, false, nil } dc, err := deployment.NewDeploymentController( ctx.InformerFactory.Apps().V1().Deployments(), ctx.InformerFactory.Apps().V1().ReplicaSets(), ctx.InformerFactory.Core().V1().Pods(), ctx.ClientBuilder.ClientOrDie("deployment-controller"), ) if err != nil { return nil, true, fmt.Errorf("error creating Deployment controller: %v", err) } go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop) return nil, true, nil }
看到这里,如果看过上一篇针对Client-go Informer 文章的肯定不陌生,这里又使用了InformerFactory,而且是好几个。其实kube Controller Manager 里面大量使用了Informer,Controller 就是使用 Informer 来通知和观察所有的资源。可以看到,这里Deployment Controller 主要关注Deployment ReplicaSet Pod 这三个资源。
Deployment Controller 资源初始化
下面来看一下Deployemnt Controller 初始化需要的资源
// NewDeploymentController creates a new DeploymentController. func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) { eventBroadcaster := record.NewBroadcaster() eventBroadcaster.StartLogging(klog.Infof) eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")}) if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil { if err := metrics.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil { return nil, err } } dc := &DeploymentController{ client: client, eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}), queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"), } dc.rsControl = controller.RealRSControl{ KubeClient: client, Recorder: dc.eventRecorder, } dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dc.addDeployment, UpdateFunc: dc.updateDeployment, // This will enter the sync loop and no-op, because the deployment has been deleted from the store. DeleteFunc: dc.deleteDeployment, }) rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dc.addReplicaSet, UpdateFunc: dc.updateReplicaSet, DeleteFunc: dc.deleteReplicaSet, }) podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ DeleteFunc: dc.deletePod, }) dc.syncHandler = dc.syncDeployment dc.enqueueDeployment = dc.enqueue dc.dLister = dInformer.Lister() dc.rsLister = rsInformer.Lister() dc.podLister = podInformer.Lister() dc.dListerSynced = dInformer.Informer().HasSynced dc.rsListerSynced = rsInformer.Informer().HasSynced dc.podListerSynced = podInformer.Informer().HasSynced return dc, nil }
是不是这里的代码似曾相识,如果接触过Client-go Informer 的代码,可以看到这里如出一辙,基本上就是对创建的资源分别触发对应的Add Update Delete 函数,同时所有的资源通过Lister获得,不需要真正的Query APIServer。
先来看一下针对Deployment 的Handler
func (dc *DeploymentController) addDeployment(obj interface{}) { d := obj.(*apps.Deployment) klog.V(4).Infof("Adding deployment %s", d.Name) dc.enqueueDeployment(d) } func (dc *DeploymentController) updateDeployment(old, cur interface{}) { oldD := old.(*apps.Deployment) curD := cur.(*apps.Deployment) klog.V(4).Infof("Updating deployment %s", oldD.Name) dc.enqueueDeployment(curD) } func (dc *DeploymentController) deleteDeployment(obj interface{}) { d, ok := obj.(*apps.Deployment) if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj)) return } d, ok = tombstone.Obj.(*apps.Deployment) if !ok { utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a Deployment %#v", obj)) return } } klog.V(4).Infof("Deleting deployment %s", d.Name) dc.enqueueDeployment(d) }
不论是Add Update Delete,处理方法如出一辙,都是一股脑的塞到Client-go 提供的worker Queue里面。 再来看看ReplicaSet
func (dc *DeploymentController) addReplicaSet(obj interface{}) { rs := obj.(*apps.ReplicaSet) if rs.DeletionTimestamp != nil { // On a restart of the controller manager, it's possible for an object to // show up in a state that is already pending deletion. dc.deleteReplicaSet(rs) return } // If it has a ControllerRef, that's all that matters. if controllerRef := metav1.GetControllerOf(rs); controllerRef != nil { d := dc.resolveControllerRef(rs.Namespace, controllerRef) if d == nil { return } klog.V(4).Infof("ReplicaSet %s added.", rs.Name) dc.enqueueDeployment(d) return } // Otherwise, it's an orphan. Get a list of all matching Deployments and sync // them to see if anyone wants to adopt it. ds := dc.getDeploymentsForReplicaSet(rs) if len(ds) == 0 { return } klog.V(4).Infof("Orphan ReplicaSet %s added.", rs.Name) for _, d := range ds { dc.enqueueDeployment(d) } } func (dc *DeploymentController) updateReplicaSet(old, cur interface{}) { curRS := cur.(*apps.ReplicaSet) oldRS := old.(*apps.ReplicaSet) if curRS.ResourceVersion == oldRS.ResourceVersion { // Periodic resync will send update events for all known replica sets. // Two different versions of the same replica set will always have different RVs. return } curControllerRef := metav1.GetControllerOf(curRS) oldControllerRef := metav1.GetControllerOf(oldRS) controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) if controllerRefChanged && oldControllerRef != nil { // The ControllerRef was changed. Sync the old controller, if any. if d := dc.resolveControllerRef(oldRS.Namespace, oldControllerRef); d != nil { dc.enqueueDeployment(d) } } // If it has a ControllerRef, that's all that matters. if curControllerRef != nil { d := dc.resolveControllerRef(curRS.Namespace, curControllerRef) if d == nil { return } klog.V(4).Infof("ReplicaSet %s updated.", curRS.Name) dc.enqueueDeployment(d) return } // Otherwise, it's an orphan. If anything changed, sync matching controllers // to see if anyone wants to adopt it now. labelChanged := !reflect.DeepEqual(curRS.Labels, oldRS.Labels) if labelChanged || controllerRefChanged { ds := dc.getDeploymentsForReplicaSet(curRS) if len(ds) == 0 { return } klog.V(4).Infof("Orphan ReplicaSet %s updated.", curRS.Name) for _, d := range ds { dc.enqueueDeployment(d) } } }
总结一下Add 和 Update
- 根据ReplicaSet ownerReferences 寻找到对应的Deployment Name
- 判断是否Rs 发生了变化
- 如果变化就把Deployment 塞到Wokrer Queue里面去
最后看一下针对Pod 的处理
func (dc *DeploymentController) deletePod(obj interface{}) { pod, ok := obj.(*v1.Pod) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the Pod // changed labels the new deployment will not be woken up till the periodic resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj)) return } pod, ok = tombstone.Obj.(*v1.Pod) if !ok { utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a pod %#v", obj)) return } } klog.V(4).Infof("Pod %s deleted.", pod.Name) if d := dc.getDeploymentForPod(pod); d != nil && d.Spec.Strategy.Type == apps.RecreateDeploymentStrategyType { // Sync if this Deployment now has no more Pods. rsList, err := util.ListReplicaSets(d, util.RsListFromClient(dc.client.AppsV1())) if err != nil { return } podMap, err := dc.getPodMapForDeployment(d, rsList) if err != nil { return } numPods := 0 for _, podList := range podMap { numPods += len(podList.Items) } if numPods == 0 { dc.enqueueDeployment(d) } } }
可以看到,基本思路差不多,当检查到Deployment 所有的Pod 都被删除后,将Deployment name 塞到Worker Queue 里面去。
Kube Controller Manager 源码分析(下)
http://toutiao.com/item/6649524312334139908/
作者:xianlubird