天道不一定酬所有勤
但是,天道只酬勤

使用 KubeFate基于 K8S 集群部署 FATE 集群

GitHub 22k Star 的Java工程师成神之路,不来了解一下吗!

开源联邦学习框架 Fate 有多种部署方式,分别是基于Docker-Compose的部署、standalone部署、Native的集群部署、基于KubeFATE的部署。

如果想快速体验一下FATE,或者跑的模型和数据在单台机器就够了,可以参考采用基于Docker-Compose的部署方案 ,部署起来比较简单。

如果您只是想开发算法,而开发机器性能又不高,也想去测试底层的egg那些模块,那standalone是很方便的方案;

如果对FATE的使用需求因数据集和模型变大,需要扩容,并且里面有数据需要维护一个FATE集群,则考虑使用基于KubeFATE在Kubernetes集群的部署方案。

最后一种Native的集群部署方案,一般是在特殊原因下才会用,如内部无法部署Kubernetes,或者需要对FATE的部署进行自己的二次开发等。

本次基于KubeFATE在Kubernetes集群的部署做了一下尝试,过程中遇到很多坑,因为官方文档介绍的是基于 MiniKube 部署测试环境的方案,所以,很多我遇到的问题,他们都没遇到。过程很漫长、且痛苦,最终终于部署成功了。

部署过程及遇到的问题如下:

本次部署过程,都是在 Master机器上执行的,不需要在 Node上执行。

部署前提

已经有两个 K8S集群,并且这两个集群都部署了ingress-controller,可通过网络联通。(K8S集群部署及 Ingress安装参考:基于CentOS 部署一套 K8S 集群

集群检查

部署前先检查下K8S集群的机器情况

[root@k8s-master1 ~]# kubectl get node -o wide
NAME          STATUS   ROLES                  AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION              CONTAINER-RUNTIME
k8s-master1   Ready    control-plane,master   20h   v1.22.3   172.29.247.176   <none>        CentOS Linux 7 (Core)   3.10.0-693.2.2.el7.x86_64   docker://18.9.1
k8s-node1     Ready    <none>                 20h   v1.22.3   172.29.247.175   <none>        CentOS Linux 7 (Core)   3.10.0-693.2.2.el7.x86_64   docker://18.9.1
k8s-node2     Ready    <none>                 20h   v1.22.3   172.29.247.177   <none>        CentOS Linux 7 (Core)   3.10.0-693.2.2.el7.x86_64   docker://18.9.1

部署 KubeFATE

下载 kubeFate 的压缩包

wget https://github.com/FederatedAI/KubeFATE/releases/download/v1.5.1/kubefate-k8s-v1.5.1.tar.gz

我这里用的是1.5.1,可以把上面的连接替换成你想要的版本;

https://github.com/FederatedAI/KubeFATE/releases/download/{需要替换的 verssion}/kubefate-k8s-{需要替换的 verssion}.tar.gz

解压缩

tar -zxvf kubefate-k8s-v1.5.1.tar.gz

解压后主要包含以下文件:

cluster-serving.yaml cluster-spark.yaml   cluster.yaml         config.yaml          examples             kubefate             kubefate.yaml        rbac-config.yaml

部署rbac-config.yaml

kubectl apply -f ./rbac-config.yaml
namespace/kube-fate created
serviceaccount/kubefate-admin created
clusterrolebinding.rbac.authorization.k8s.io/kubefate created

查看部署结果:

[root@k8s-master1 ~]# kubectl get pod,ingress -n kube-fate
NAME                            READY   STATUS    RESTARTS   AGE
pod/kubefate-857dd6fcb5-srprb   1/1     Running   0          44s
pod/mariadb-7c8848bd55-w85dw    1/1     Running   0          44s

配置ingress host

查看ingress的pod运行的的node的名称

[root@k8s-master1 ~]# kubectl get pod -A -o wide | grep ingress
ingress-nginx   nginx-ingress-controller-7d4544b644-j65jb   1/1     Running   0          17m    172.29.247.177   k8s-node2     <none>           <none>

可以发现,node2是运行ingress的节点。

然后把 node2的Host做一下绑定到kubefate.net 上:

echo "192.168.1.1 kubefate.net" >> /etc/hosts

这里的92.168.1.1 替换成你自己的 IP。

安装 kubefate 命令行工具

chmod +x ./kubefate && sudo mv ./kubefate /usr/local/bin/kubefate

检查是否可以连通

[root@k8s-master1 ~]# kubefate version
* kubefate commandLine version=v1.3.0
* kubefate service version=v1.3.0

如果出现kubefate service version=v1.3.0 这一行,表示连通成功。

但是我在实际运行时出错了:

[root@k8s-master1 ~]# kubefate version
kubefate: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by kubefate)

这是因为Kube 对一个动态库有依赖,这是个 bug,在1.6.1版本修复了,详见:https://github.com/FederatedAI/KubeFATE/issues/372

但是我这里因为外部机构用的1.5.1,我没办法升级到1.6.1,所以尝试着解决问题。

解决GLIBC_2.28 找不到

安装 GLIBC_2.28

curl -O http://ftp.gnu.org/gnu/glibc/glibc-2.28.tar.gz
tar zxf glibc-2.28.tar.gz 
cd glibc-2.28/
mkdir build
cd build/
../configure --prefix=/usr/local/glibc-2.28

报错:

configure: error:
*** These critical programs are missing or too old: make bison compiler
*** Check the INSTALL file for required versions.

提示缺少 bison make 和 compiler ,需要依次安装。

查看本机并未安装 bison

[root@k8s-master1 ~]# bison --version
-bash: bison: 未找到命令

执行安装命令:

yum install bison

安装 make:

wget http://ftp.gnu.org/gnu/make/make-4.2.tar.gz
tar -xzvf make-4.2.tar.gz
cd make-4.2
sudo ./configure
sudo make
sudo make install
sudo rm -rf /usr/bin/make
sudo cp ./make /usr/bin/
make -v

安装 compile:

yum -y install centos-release-scl
yum -y install devtoolset-8-gcc devtoolset-8-gcc-c++ devtoolset-8-binutils
scl enable devtoolset-8 bash
echo "source /opt/rh/devtoolset-8/enable" >>/etc/profile

然后重新执行 GLIBC的安装

cd glibc-2.28/
mkdir build
cd build/
sudo ../configure --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make
make install

查看安装结果:

strings /lib64/libc.so.6 |grep GLIBC_2.28
GLIBC_2.28
GLIBC_2.28

这时候再执行kubefate version就可以了。

connection refused解决

这里如果遇到问题:

[root@k8s-master1 build]#  kubefate version
* kubefate commandLine version=v1.3.0
* kubefate service connection error, Post "http://localhost:8080/v1/user/login": dial tcp 127.0.0.1:8080: connect: connection refused

是因为 没有在ingress-controller的配置中增加 dnsPolicy: ClusterFirstWithHostNet,详见:ingress 安装

如果遇到问题:

[root@k8s-master1 ~]#  kubefate version
* kubefate commandLine version=v1.3.0
* kubefate service connection error, Post "http://kubefate.net/v1/user/login": dial tcp 192.168.1.1:80: i/o timeout

那可能是因为80端口没开,需要开启80端口。

或者可能是 Ingress 的 ip 配置错误了。检查/etc/hosts 中 kubefate.net的 host 绑定是否正确

部署Fate

创建namespace

kubectl create namespace fate-10000

fate-10000这个名字自己定义一个。这一步,如果有两个集群,需要生成两个,并使用不同的名字:

kubectl create namespace fate-10000
kubectl create namespace fate-9999

修改cluster.yaml 文件

在解压后的kubeFate文件夹中,有cluster.yaml文件,打开并编辑,内容如下:

name: fate-10000  -- 你的 partyId
namespace: fate-10000  -- 你的 namespace,和刚刚创建的保持一致
chartName: fate
chartVersion: v1.5.1
partyId: 10000
registry: ""
imageTag: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
  enabled: false
modules:
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client
rollsite:
   type: NodePort
   nodePort: 30100 ---你的端口号
   partyList:
   - partyId: 9999  ---需要做连邦学习的对方 partyId
     partyIp: 192.168.1.1 ---需要做连邦学习的对方 IP
     partyPort: 30101 ---需要做连邦学习的对方 端口

nodemanager:
   count: 3
   sessionProcessorsPerNode: 4
   list:
   - name: nodemanager
     nodeSelector:
     sessionProcessorsPerNode: 4
     subPath: "nodemanager"
     existingClaim: ""
     storageClass: "nodemanager"
     accessMode: ReadWriteOnce
     size: 1Gi

python:
   type: NodePort
   httpNodePort: 30097
   grpcNodePort: 30092

mysql:
   nodeSelector:
   ip: mysql
   port: 3306
   database: eggroll_meta
   user: fate
   password: fate_dev
   subPath: ""
   existingClaim: ""
   storageClass: "mysql"
   accessMode: ReadWriteOnce
   size: 1Gi

这里如果有两个集群做直连,则需要在创建一个cluster2.yaml文件:

name: fate-9999  -- 你的 partyId
namespace: fate-9999  -- 你的 namespace,和刚刚创建的保持一致
chartName: fate
chartVersion: v1.5.1
partyId: 9999
registry: ""
imageTag: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
  enabled: false
modules:
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client
rollsite:
   type: NodePort
   nodePort: 30101 ---你的端口号
   partyList:
   - partyId: 10000  ---需要做连邦学习的对方 partyId
     partyIp: 192.168.1.2 ---需要做连邦学习的对方 IP
     partyPort: 30100 ---需要做连邦学习的对方 端口

nodemanager:
   count: 3
   sessionProcessorsPerNode: 4
   list:
   - name: nodemanager
     nodeSelector:
     sessionProcessorsPerNode: 4
     subPath: "nodemanager"
     existingClaim: ""
     storageClass: "nodemanager"
     accessMode: ReadWriteOnce
     size: 1Gi

python:
   type: NodePort
   httpNodePort: 30097
   grpcNodePort: 30092

mysql:
   nodeSelector:
   ip: mysql
   port: 3306
   database: eggroll_meta
   user: fate
   password: fate_dev
   subPath: ""
   existingClaim: ""
   storageClass: "mysql"
   accessMode: ReadWriteOnce
   size: 1Gi

执行 fate 部署

[root@k8s-master1 ~]# kubefate cluster install -f cluster.yaml
create job success, job id=d8caf016-2269-4d0d-bb3f-5c8ad9824350


[root@k8s-master1 ~]# kubefate cluster install -f cluster2.yaml
create job success, job id=d8caf016-2269-4d0d-bb3f-5c8ad9824350

查看任务状态:

[root@k8s-master1 ~]# kubefate job list
UUID                                    CREATOR METHOD          STATUS  STARTTIME           CLUSTERID                               AGE
d8caf016-2269-4d0d-bb3f-5c8ad9824350    admin   ClusterInstall  Failed  2021-11-09 03:58:11 558ce45f-27c5-474e-951a-093637d0e484    3s

如果有失败,查看报错信息:

[root@k8s-master1 ~]# kubefate job describe d8caf016-2269-4d0d-bb3f-5c8ad9824350
UUID        d8caf016-2269-4d0d-bb3f-5c8ad9824350
StartTime   2021-11-09 03:58:11
EndTime     2021-11-09 03:58:15
Duration    3s
Status      Failed
Creator     admin
ClusterId   558ce45f-27c5-474e-951a-093637d0e484
Result      ConfigMap "nodemanager-0-config" is invalid: metadata.labels: Invalid value: "1.6881e+06": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is
            '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

根据报错重修修改配置文件后,再执行,正常的话,会提示 Running:

[root@k8s-master1 ~]# kubefate job list
UUID                                    CREATOR METHOD          STATUS  STARTTIME           CLUSTERID                               AGE
4e587c68-8ad4-4546-bd1e-3d7a36b7d694    admin   ClusterInstall  Running 2021-11-09 04:54:26 ce23b18b-cea8-462c-95e7-76192ef08ea7    9s

这时候查看任务的详细信息

kubefate job describe 4e587c68-8ad4-4546-bd1e-3d7a36b7d694
UUID        4e587c68-8ad4-4546-bd1e-3d7a36b7d694
StartTime   2021-11-09 04:54:26
EndTime     0001-01-01 00:00:00
Duration    41s
Status      Running
Creator     admin
ClusterId   ce23b18b-cea8-462c-95e7-76192ef08ea7
Result      Cluster install success
SubJobs     nodemanager-2-6dd74c79cb-7p5gj PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            nodemanager-7cfb965848-vnr2p PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            python-56bc7865-xxxqb PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            rollsite-56878dd8b9-hdgf9 PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            clustermanager-694bbccdc9-tjxt8 PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            mysql-867fb9c446-qnppk PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            nodemanager-0-5c8d46c664-vk2ff PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
            nodemanager-1-56d4f9676-llx4k PodStatus: Pending, SubJobStatus: Pending, Duration:    41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00

等所有任务状态都变成 SUCCESS,则部署成功。

测试 FATE

在集群中的任意一台机器上执行:

kubectl exec -it svc/fateflow -c python -n fate-10000 -- bash

cd ../examples/toy_example/

python run_toy_example.py 10000 9999 1

成功的话会输出以下内容:

stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202111090955384060332&role=guest&party_id=10000",
        "job_dsl_path": "/data/projects/fate/jobs/202111090955384060332/job_dsl.json",
        "job_id": "202111090955384060332",
        "job_runtime_conf_on_party_path": "/data/projects/fate/jobs/202111090955384060332/guest/job_runtime_on_party_conf.json",
        "job_runtime_conf_path": "/data/projects/fate/jobs/202111090955384060332/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/logs/202111090955384060332",
        "model_info": {
            "model_id": "guest-10000#host-9999#model",
            "model_version": "202111090955384060332"
        },
        "pipeline_dsl_path": "/data/projects/fate/jobs/202111090955384060332/pipeline_dsl.json",
        "train_runtime_conf_path": "/data/projects/fate/jobs/202111090955384060332/train_runtime_conf.json"
    },
    "jobId": "202111090955384060332",
    "retcode": 0,
    "retmsg": "success"
}


job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
[INFO] [2021-11-09 09:55:41,741] [845:139855671260992] - secure_add_guest.py[line:99]: begin to init parameters of secure add example guest
[INFO] [2021-11-09 09:55:41,742] [845:139855671260992] - secure_add_guest.py[line:102]: begin to make guest data
[INFO] [2021-11-09 09:55:42,494] [845:139855671260992] - secure_add_guest.py[line:105]: split data into two random parts
[INFO] [2021-11-09 09:55:44,404] [845:139855671260992] - secure_add_guest.py[line:108]: share one random part data to host
[INFO] [2021-11-09 09:55:44,411] [845:139855671260992] - secure_add_guest.py[line:111]: get share of one random part data from host
[INFO] [2021-11-09 09:55:45,184] [845:139855671260992] - secure_add_guest.py[line:114]: begin to get sum of guest and host
[INFO] [2021-11-09 09:55:46,094] [845:139855671260992] - secure_add_guest.py[line:117]: receive host sum from guest
[INFO] [2021-11-09 09:55:46,122] [845:139855671260992] - secure_add_guest.py[line:124]: success to calculate secure_sum, it is 1999.9999999999993
...

可以到每台机器上看下 FATE 的部署情况:

[root@k8s-node1 ~]# kubectl get pods --namespace=fate-10000
NAME                              READY   STATUS    RESTARTS   AGE
clustermanager-694bbccdc9-tjxt8   1/1     Running   0          5h4m
mysql-867fb9c446-qnppk            1/1     Running   0          5h4m
nodemanager-0-5c8d46c664-vk2ff    2/2     Running   0          5h4m
nodemanager-1-56d4f9676-llx4k     2/2     Running   0          5h4m
nodemanager-2-6dd74c79cb-7p5gj    2/2     Running   0          5h4m
nodemanager-7cfb965848-vnr2p      2/2     Running   0          5h4m
python-56bc7865-xxxqb             3/3     Running   1          5h4m
rollsite-56878dd8b9-hdgf9         1/1     Running   0          5h4m
````

新增Party

如果想在一个已有的联邦学习集群中,新增一方,那么首先需要修改cluster.yaml文件,把新的 Party 的信息配置上:

rollsite:
   type: NodePort
   nodePort: 10000
  # exchange:
    # ip: 192.168.0.1
    # port: 30000
   partyList:
   - partyId: 9999
     partyIp: 192.168.1.2 
     partyPort: 9370
   - partyId: 30100
     partyIp: 192.168.1.3
     partyPort: 30100
  # nodeSelector:

然后,执行集群更新命令:

kubefate cluster update -f cluster.yaml

同样也会创建几个 job,等 job 运行成功后就加好了。

当然,在对方也要配置上本 Party的信息后才能通信。

(全文完)

扫描二维码,关注作者微信公众号
赞(0)
如未加特殊说明,此网站文章均为原创,转载必须注明出处。HollisChuang's Blog » 使用 KubeFate基于 K8S 集群部署 FATE 集群
分享到: 更多 (0)

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

HollisChuang's Blog

联系我关于我