备忘

you are better than you think

eBPF实战之超时问题排查

19 Sep, 2020

问题描述

线上A模块访问B模块,A模块部署在物理机,B模块部署在容器上。早晚高峰A访问B会出现超时报警。

问题排查

业务方找到我们排查，我首先申请了监控中提示超时的两台机器权限，看一下超时规律 timeout 登录10.167.8.25容器所在的宿主机常规排查

dmesg显示如下,联系值班同学和系统部同学硬件报修。

EDAC sbridge MC1: HANDLING MCE MEMORY ERROR

漂移容器10.167.8.252后，10.167.41.18 在两台物理机上小时级别日志中会出现近10次

docker源码快速阅读

19 Sep, 2020

阅读前准备

git: https://github.com/moby/moby.git
branch: v18.06.3-ce
ide: goland2020.2

配置goland的build tag

一起见证docker 代码中套路 exec.Command=> cmd.Start() => cmd.Wait()

进程模型

docker   
  |      
  V       
dockerd -> containerd ---> shim -> runc -> runc init -> process1
                      |--> shim -> runc -> runc init -> process2
                      +--> shim -> runc -> runc init -> process3

记一次第三方库的PR

13 Dec, 2019

背景

线上很多场景可能会用到pstree，比如查看容器的所有子进程，回滚任务需要杀死正在运行中的子进程。

一个轻量级的库

我们选用的是一个轻量级的pstree。github.com/sbinet/pstree这个库比较简单，首先遍历/proc/ 获取所有PID

files, err := filepath.Glob("/proc/[0-9]*")
...
procs := make(map[int]Process, len(files))
	for _, dir := range files {
		proc, err := scan(dir)
		if err != nil {
			return nil, fmt.Errorf("could not scan %s: %w", dir, err)
		}
		if proc.Stat.Pid == 0 {
			// process vanished since Glob.
			continue
		}
		procs[proc.Stat.Pid] = proc
	}

然后，scan()是读取/proc/[pid]/stat，获取pid和相应的ppid、name等信息

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19