记一次日志引起的故障:服务器 mongod 日志出现暴增,一天增加了 7+G 之多… Why?

问题现象

内网测试服务器出现了日志暴增的现象,日志描述如下:

1
2
- I WRITE [conn1117] update db.robots query: { date: "2018-05-09", serial:"93542d35-5030-41b9-xxxx-0e36c546c336" } planSummary: COLLSCAN update: { $set: {
- I COMMAND  [conn1117] command db.$cmd command: update { update: "robots", updates: [ { q:{ date: "2018-05-09", serial: "93542d35-5030-41b9-xxxx-0e36c546c336" }, u: { $set: {}}

可以从日志描述中了解到:COLLSCAN,即 全表扫描 导致了 慢查询 问题,进而导致日志所有操作日志都会被记录在日志文件中去,也就产生了日志暴增问题。

那么,问题是:什么原因造成了 COOLSCAN慢操作 是如何定义的? 或超过多少时间的操作被认为是 慢操作

  • 日志描述的主要操作是:客户端的一个定时任务,即:每台设备每 6s 中都要向服务器上报一次设备数据包

问题原因

先来看看文档中的一些介绍:

  • 官网 Doc-Log Messages 里面提到了:

    Client operations appear in the log if their duration exceeds the slow operation threshold(慢操作阀值) or when the log verbosity level(日志详细程度) is 1 or higher. These log entries include the full command object associated with the operation.

  • Log Verbosity Level(日志详细程度)

    default: 0

  • slow operation threshold 慢操作阈值

针对日志记录做测试

  • 测试:每 6s 向服务器发起一次接口请求

  • 结果:

    • 持续时间内,按一定频率(i.e: 6s)请求都会被 mongod 服务记录到日志中去,导致了日志文件暴增
    • 记录到日志中的操作,其耗时都超过了 slowns.default(100ms)
    • 磁盘空间被消耗
  • 分析:查询效率低下导致慢操作被记录日志,而索引设计的不合理又导致了查询效率低下问题

  • 解决:

    • 查询优化 => 代码逻辑优化 + 索引优化 + 数据模型设计优化
    • 日志管理 => 使用 logrotate 回滚拆分日志文件

查询优化主要在代码中实现,这里就不做赘述了。 来看看使用 logrotate 回滚拆分日志文件吧

问题解决:使用 logrotate 回滚拆分日志文件

选型

使用Logrotate管理MongoDB日志-选型

使用Logrotate管理MongoDB日志

logrotate使用指南

找到日志文件及PID记录文件

查看 mongod.conf 就可知晓:

1
2
3
4
systemLog:
  path: /data/log/mongod.log
processManagement:
  pidFilePath: /var/run/mongodb/mongod.pid

编写配置文件

通过 man logrotate 查看详细参数,结合业务需求,编写的配置文件如下:文件存放在 /etc/logrotate.d/ 目录

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ vim /etc/logrotate.d/mongod

# 添加下面内容
/data/log/*.log {
    daily
    rotate 3
    size 512M
    dateext
    compress
    missingok
    notifempty
    copytruncate
    create 644 root root
    sharedscripts
    postrotate
        /bin/kill -SIGUSR1 'cat /var/run/mongodb/mongod.pid 2> /dev/null' 2>/dev/null || true
    endscript

验证

执行命令 logrotate -f -v /etc/logrotate.d/mongod 来验证日志是否被轮换了,示例执行结果如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
$ logrotate -f -v /etc/logrotate.d/mongod
reading config file /etc/logrotate.d/mongod
Allocating hash table for state file, size 15360 B

Handling 1 logs

rotating pattern: /data/log/*.log  forced from command line (3 rotations)
empty log files are not rotated, old logs are removed
considering log /data/log/mongod.log
  log needs rotating
rotating log /data/log/mongod.log, log->rotateCount is 3
dateext suffix '-20180511'
glob pattern '-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
glob finding old rotated logs failed
copying /data/log/mongod.log to /data/log/mongod.log-20180511
truncating /data/log/mongod.log
running postrotate script
compressing log with: /bin/gzip

$ ls /data/log/
mongod.log  mongod.log-20180511.gz

Note: 当日志文件的实际大小小于配置文件(i.e:/etc/logrotate.d/mongod) 中的 size 阈值时,是不会进行转储操作的 (即:cron.daily 任务也不会进行转储)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
$ du -sh /data/log/mongod.log
248K    mongod.log

# debug 测试 /etc/logrotate.d/mongod 配置, 此时 size=100M
$ logrotate -dv /etc/logrotate.d/mongod
reading config file /etc/logrotate.d/mongod
Allocating hash table for state file, size 15360 B

Handling 1 logs

rotating pattern: /data/log/mongod.log  104857600 bytes (3 rotations)
empty log files are not rotated, old logs are removed
considering log /data/log/mongod.log
  log does not need rotating (log size is below the 'size' threshold)
not running postrotate script, since no logs were rotated

# debug 测试 /etc/logrotate.d/mongod 配置, 此时 size=100K
$ logrotate -dv /etc/logrotate.d/mongod
reading config file /etc/logrotate.d/mongod
Allocating hash table for state file, size 15360 B

Handling 1 logs

rotating pattern: /data/log/mongod.log  102400 bytes (3 rotations)
empty log files are not rotated, old logs are removed
considering log /data/log/mongod.log
  log needs rotating
rotating log /data/log/mongod.log, log->rotateCount is 3
dateext suffix '-20180511'
glob pattern '-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
glob finding old rotated logs failed
copying /data/log/mongod.log to /data/log/mongod.log-20180511
truncating /data/log/mongod.log
running postrotate script
running script with arg /data/log/mongod.log : "
            /bin/kill -SIGUSR1 'cat /var/run/mongodb/mongod.pid 2> /dev/null' 2> /dev/null || true
"
compressing log with: /bin/gzip

确定执行 cron.daily 情况

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ tail --lines 1000 /var/log/cron | grep cron.daily
May 11 03:01:01 localhost anacron[15856]: Will run job `cron.daily' in 25 min.
May 11 03:26:01 localhost anacron[15856]: Job `cron.daily' started
May 11 03:26:01 localhost run-parts(/etc/cron.daily)[15929]: starting logrotate
May 11 03:26:01 localhost run-parts(/etc/cron.daily)[15937]: finished logrotate
May 11 03:26:01 localhost run-parts(/etc/cron.daily)[15929]: starting man-db.cron
May 11 03:26:05 localhost run-parts(/etc/cron.daily)[16285]: finished man-db.cron
May 11 03:26:05 localhost run-parts(/etc/cron.daily)[15929]: starting mlocate
May 11 03:26:12 localhost run-parts(/etc/cron.daily)[16296]: finished mlocate
May 11 03:26:12 localhost anacron[15856]: Job `cron.daily' terminated
[root@VM_110_229_centos ~]$

See Also

Thanks to the authors 🙂