Linux磁盘空间剧增突发故障解决过程全记录

1.下午,用户反映,无法通过某台服务器向网络打印机打印文件.其它机器正常.
错误现象:
lpr: error - unable to print file: client-error-request-value-too-long

2.错误原因:
1. Are you trying to print a file >2GB? If so, that doesn't
work in CUPS 1.1.x and earlier.

2. Does the RequestRoot directory (/var/spool/cups by default)
exist? If not, "mkdir /var/spool/cups"

3. Does the TempDir directory (/var/spool/cups/tmp by default)
exist? If not, "mkdir /var/spool/cups/tmp"

4. Is the disk full? "df -k /var/spool/cups" will show if
this is the case. If the disk is full, delete files to
free up disk space.

3.判断是第四点原因造成
df -h,
/var 100% used.

4.定位
cd /var
find . -maxdepth 1 -type d -print | xargs du -sk | sort -rn
定位找到文件
/var/log/bandwidth 占用了19g空间

5.删除
rm -rf bandwidth

6.空间仍然没有施放

7.看cpu,rotate进程占用大量cpu,
kill -9

8.恢复正常,可以打印.

9.继续找原因
find /home -type f -ls | perl -e 'while(<>;){$s+=(split)[6];};print "$s\n";'

This will count all hidden files, including the ones from lost+found.

The number should be close to what how much space is used on disk.

If df shows a something way different you may want to run something like:
lsof | grep home

Look for some suspicious applications.

The idea is that if an application opens a file and the file is removed
while the application keeps it open, the actual data is not removed from
disk until the program exits or close the file.

My advice:
1. boot in single mode, recommended from a rescue disk so you have a
'clean' kernel.
2. test du/find versus df output.
3. if different run fsck with -f and, if you can afford to wait, -c flags.
4. I think du/find should show similar numbers now (if it doesn't and
you booted with a 'clean' kernel then I'll be interested for details)
5. reboot with the normal kernel
6. if df shows different than du then most likely you've been hacked -
replace OS with a clean copy - try to find what programs are different
than the original, etc.

删除文件,如果进程在的话,空间是不会被释放的.

10.检查发现/etc/syslog.conf
中的一行配置
kern.=debug -/var/log/bandwidth
安装webmin时候自动配置的.

注释掉,万事大吉了.