天道不一定酬所有勤
但是,天道只酬勤

联邦学习框架 FATE 问题排查记录

GitHub 22k Star 的Java工程师成神之路,不来了解一下吗!

mysql 不断重启

docker-compose ps 查看容器状态,发现 Mysql 和 Python一直处于Restarting状态

# docker-compose ps
            Name                           Command                 State                   Ports
--------------------------------------------------------------------------------------------------------------
confs-168801_client_1           /bin/sh -c flow init -c /d ...   Up           0.0.0.0:20000->20000/tcp
confs-168801_clustermanager_1   /tini -- bash -c java -Dlo ...   Up           4670/tcp, 8080/tcp
confs-168801_fateboard_1        /bin/sh -c java -Dspring.c ...   Up           0.0.0.0:8080->8080/tcp
confs-168801_mysql_1            docker-entrypoint.sh mysqld      Restarting
confs-168801_nodemanager_1      /tini -- bash -c java -Dlo ...   Up           4671/tcp, 8080/tcp
confs-168801_python_1           container-entrypoint /bin/ ...   Restarting
confs-168801_rollsite_1         /tini -- bash -c java -Dlo ...   Up           8080/tcp, 0.0.0.0:9370->9370/tcp

先查看 mysql 的容器名:

docker ps
CONTAINER ID        IMAGE                                                    COMMAND                  CREATED             STATUS                          PORTS                                                      NAMES
8ddc6e8aeee9        hub.c.163.com/federatedai/mysql:8                        "docker-entrypoint.s…"   8 hours ago         Restarting (1) 53 seconds ago                                                              confs-168801_mysql_1

查看 MySQL 日志:

docker logs 8ddc6e8aeee9

看到错误日志:

2021-11-08T10:27:26.422440Z 1 [ERROR] [MY-012639] [InnoDB] Write to file ./ibtmp1 failed at offset 0, 1048576 bytes should have been written, only 0 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
2021-11-08T10:27:26.422522Z 1 [ERROR] [MY-012640] [InnoDB] Error number 28 means 'No space left on device'
2021-11-08T10:27:26.422685Z 1 [ERROR] [MY-012267] [InnoDB] Could not set the file size of './ibtmp1'. Probably out of disk space
2021-11-08T10:27:26.422766Z 1 [ERROR] [MY-012926] [InnoDB] Unable to create the shared innodb_temporary.
2021-11-08T10:27:26.422857Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2021-11-08T10:27:26.818621Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2021-11-08T10:27:26.818852Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2021-11-08T10:27:26.819105Z 0 [ERROR] [MY-010119] [Server] Aborting
2021-11-08T10:27:26.819512Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.21)  MySQL Community Server - GPL.

错误日志提示磁盘没有空间了,通过 df 命令查看,确认是磁盘空间被耗尽:

[root@kubefate001 confs-168801]# df -h
文件系统        容量  已用  可用 已用% 挂载点
/dev/vda1        40G   39G     0  100% /
devtmpfs        7.5G     0  7.5G    0% /dev
tmpfs           7.6G     0  7.6G    0% /dev/shm
tmpfs           7.6G  860K  7.6G    1% /run
tmpfs           7.6G     0  7.6G    0% /sys/fs/cgroup

kubefete client 和 server 版本不一致

kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:51:19Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.22) and server (1.20) exceeds the supported minor version skew of +/-1

需要将Client版本降级到1.20.0 即可:

Linux:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.20.0/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl

苹果系统:
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.20.0/bin/darwin/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl

windows:
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.20.0/bin/windows/amd64/kubectl.exe

文件上传失败

在使用 KubeBoard上传文件的时候,报错提示:413 Request Entity Too Large

这是因为 nginx 的上传文件大小限制导致的,需要调整一下。

我们是使用ingress暴露的 nginx 服务,所以需要修改配置并使其生效。

首先,修改mandatory.yaml文件

kind: ConfigMap
apiVersion: v1
metadata:
  name: nginx-configuration
  namespace: ingress-nginx
  labels:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
data:
  proxy-body-size: "1024m"

在ConfigMap中增加proxy-body-size: “1024m”

然后执行kubectl apply -f mandatory.yaml即可。

集群连不通

[Roll Site Error TransInfo]
 location msg=operation POST not supported
 stack info=scala.NotImplementedError: operation POST not supported
        at com.webank.eggroll.rollsite.EggSiteServicer.processCommand(EggSiteServicer.scala:171)
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /36.112.80.34:9370

最终排查下来是端口不通导致的。配置上端口就好了

(全文完)

扫描二维码,关注作者微信公众号
赞(0)
如未加特殊说明,此网站文章均为原创,转载必须注明出处。HollisChuang's Blog » 联邦学习框架 FATE 问题排查记录
分享到: 更多 (0)

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

HollisChuang's Blog

联系我关于我