Ubuntu Servers :: IOwait High, No Disk Writes, Server Hangs?
Jul 24, 2011
I run a website that a very steady flow of traffic and Im seeing recent issues that I just dont like. Server is 10.04.2 on a supermicro i7-950/6gb RAM with two 500gb Samsung F3 drives in a software RAID1 (1x5400, 1x7200) and for several weeks, its been running very well. Recently, Im seeing the server hang for 5-20 seconds. IOwait goes through the roof, nothing can write to the disk. Apache logs stop, redis fails to rebuild caches, mysql errors and then it continues and moves back to normal operation
/ is ext4, the kernel was 2.6.32-server-x64 but since updating to 2.6.38-server-x64, the issue has dropped from maybe once per 10 minutes to once per 15 minutes. 3 IOstat copy/pastes show this when it hangs.
Code:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
[code]...
No smart errors or smart diags show any issues with any of the disks and kernel.log shows near nothing other than a process hang, 120 seconds about 5 days ago.
I've been having a problem in Ubuntu 9.10 recently where starting about 2 minutes after startup my computer slows down and becomes unresponsive. I believe the problem is associated with a high IOWait because I have the system monitor applet on my Gnome Panel and it displays 100% IOWait every time my system starts to slow down.
I have tried booting into other kernel version and the problem persists. I don't really know what IOWait is or how to diagnose this problem further. I've looked around online and it seems like you have to find a specific process that is causing the IOWait, but I don't understand how to go about doing that.
I've been having a problem in Ubuntu 9.10 recently where starting about 2 minutes after startup my computer slows down and becomes unresponsive. I believe the problem is associated with a high IOWait because I have the system monitor applet on my Gnome Panel and it displays 100% IOWait every time my system starts to slow down. I have tried booting into other kernel version and the problem persists. I don't really know what IOWait is or how to diagnose this problem further. I've looked around online and it seems like you have to find a specific process that is causing the IOWait, but I don't understand how to go about doing that.
I've got a new Ubuntu 10.04 server install with a new 3 disk RAID 5. The boot disk is separate, not part of the RAID. I was trying to practice what I would do if a disk died to recover the RAID, so I unplugged one of the three disks. The machine now just hangs on startup. It shows fsck at the top of the screen but doesn't got anywhere from there. If you press a key it shows the Ubuntu splash screen. If I plug the disk back in, everything boots up normally. So, my question is, how do I get the machine to boot with one of the RAID members missing? I know I can recover it using the Live CD, but it would be nice to be able to get back into the machine without the CD.
I am running latest apache2 available in the lucid repos on my desktop. All packages are updated as of this moment. Now in the root of my web server I have placed several soft links that point to folders on another ext3/ntfs partitions on the same disk. When I try to download any large file (say above 500M)on this server using firefox, when the 'save' window appears, my desktop freezes, I notice very high cpu-ram-disk usage, even though I have not yet clicked on 'ok' to save the file. This issue is not present when the file size is small. Note that firefox and the webserver are running on the same computer.
Also I have tried nginx and lighttpd and the issue is present there as well. When I tried downloading the same files using Internet Explorer 6.0 using a XP VM the issue is not present. However on Windows as well using Firefox the issue recurs.
I'm using 9.10 Desktop 64bit on a Dell Latitude D830. I have 4 gigs of RAM and a 7200 rpm sata hard drive. Everything works pretty well, video, sound, and network. Flash isn't as smooth as in Windows but I assume that's a Flash/64bit thing and not necessarily an Ubuntu thing.
However, one area of performance still lags far behind my Windows XP experience, and that's disk writes. For instance, when I'm copying a large amount of information from a USB drive or from my Windows partition to my native partition, I can barely switch windows until the task is complete, nevermind trying to surf the Web. I thought that that might be related to the slow performance of the ntfs driver, but recently I have been doing a lot of work with VMWare, and I get the same result when trying to pause virtual machines - it writes a largish amount of information to disk in a short amount of time and I can't do much until it finishes.
Here are some things I looked at based on other threads to try to debug my issue:
That disk read speed is a little faster than average - several more tests showed it hovering around the 66 MB/sec range. Of the other info, I see that UDMA is on which I understand is good, but if there is something else there that I should fix I don't see it.
I know that my computer can handle these tasks (at least the vmware stuff) without such a significant interface slowdown because in XP I did it with less RAM than I have now. Is this just the way that the linux kernel scheduler fails to account for UI needs?
We are graphing various system parameters using Cacti. One of our graphs shows hard drive reads and writes. A question came up: why do we need this graph?
Can anyone explain the difference between IOWAIT%/IDLE%/LOAD AVG? We have a 3 servers (oracle rac) attached to a SAN. When we added another CPU to our 3rd server, the IOWAIT across the oracle rac dropped a lot. We used to get pretty high IOWAIT % before. So I don't really understand how the extra CPU could do this. I thought the IOWAIT % is high because the server is waiting for the SAN, so I thought the SAN was the bottleneck. So I don't really understand the difference between the 3 things (iowait, idle and load avg).
I downloaded the x86 server version for Lucid twice today. Both times before burn the md5 match those published on the Ubuntu site. I am attempting to install the system on a Dell poweredge 2500. 2x 1.4 gb P4 procs, 1 gb ram, 750 GB HDD (raid) Boot to installer. it goes through the standard language checks, keyboard, checks for cdrom, then attempts to start copying files.
It gets to 17% copy on fs-core-modules-2.6.32.21-generic-di and hangs there for 5 minutes or so. Then comes back with the error that it could not get the file and asks if I want to retry. all retries lead back to this error. As I said, MD5 is exact, burner does not complain of a bad burn, and disk is closed at the end of the burn. so, has anyone else been able to install Lucid server x86? burned four disks today, one pair on two different burners. All fail at 17% retrieving that file.
I have Fedora installed on a netbook. I customarily mount several NFS shares on this machine, from both a desktop system running F11 and a small server running FreeBSD. On the server side the shares are write-enabled. On the server side the shares are write-enabled. In the past this has worked fine.
However since upgrading to F12, my configuration no longer works as before. Reading from the NFS shares is no problem, but as soon as I try to write to one, either in Nautilus or from any other program, including on the cmd line, all hell breaks lose. Nautilus crashes, and I am unable to remount the shares. Usually rebooting the client is the my only recourse.
There are no clues in dmesg on either the client or the server. In terminal trying to remount a "trashed" share I see this:
Code: $ sudo mount -v venus:/media/disk8 /media./disk8 mount: no type was given - I'll assume nfs because of the colon mount.nfs: timeout set for Sat Feb 20 17:34:59 2010 mount.nfs: prog 100003, trying vers=3, prot=6
[Code].....
The NFS versions under F11, F12, and FreeBSD are the current ones (all updates applied).
I just installed 10.04 i386 Server(release version)last night on an older IBM workstation (sorry, for the life of me can't remember model and I'm at work now); P4 2.4, 512Mb RAM, generic Intel graphics. This was a base install (aside from OpenSSH) with no window mgr. The HDD was wiped during the install and GRUB2 wrote to the MBR.
The install went smoothly, and on the soft reboot after the install I got a shell and was able to login, etc. and even SSH in.I unplugged the machine, moved it, booted, and got a cursor in the upper-left of the screen for about 3 seconds, and then the monitor would enter power save mode. (I tried 2 different models of monitor). While the cursor was on the screen the HDD access LED was blinking with activity, but that would stop right before the screen would enter standby. Tried a few reboots with the same result.
I'm thinking this is not a video driver issue (as similar threads have suggested),because the splash screen for the install disc renders properly and on the first successful boot the shell rendered appropriately at a high resolution.I can hold down shift to enter the GRUB2 menu, which also renders at a higher-than-text resolution and is defaulted / displays the correct options.
Most other threads with a similar yet non-video issue centered around GRUB2 issues with a dual-boot machine, which mine is not.I did go in via rescue mode (off the install CD -- the rescue option of the GRUB menu has the same result as a standard boot), and checked that my grub.cfg was correct, and I ran grub-install to no avail, and also ran grub-mkconfig on a subsequent boot to no avail, yet the grub.cfg generated appeared identical to the original generated during install.
The partition/file system is default layout.
20 Gb IDE (/dev/sda) /dev/sda1 /boot ext2 /dev/sda2 Ext
[code]....
I'd be happy to give more info if someone has a troubleshooting direction in which to go.
I'm trying to track down a problem with my server. Basically, I can get it to boot with a LiveCD and do rescue stuff with it - I just can no longer get it to boot. I've been slowly working my way through all of the error messages that come up during the boot process, and knocking them off one by one. Right now, they're fairly minor.
I'm trying to find out what the big issue is by looking at the logs. The last thing that it does is bring up apache2, says [OK] and then hangs. I've tried to do ls -lustR on /var/log to see what's the last log touched, and that's syslog or boot.log, but the last entry says something along the lines of Apache2 is up and running.
The attached log file includes two crashes/reboots within the past day or so I have recently started trying to set up / manage a Linux (Ubuntu 10.04.2 LTS) server in our data center (all other servers are Windows boxes). The server periodically hangs and becomes unresponsive and I'm at a loss to find anything in any log that indicates a specific cause. Sometimes it's up for hours, sometimes days (14 days at longest). Plugging a monitor in to the machine after a hang shows nothing at all. In an effort to troubleshoot the problem we've tried disabling APIC, more out of "educated desperation" than anything else. Unfortunately we are limited in some of the troubleshooting we can do, as we have a single client website hosted on the box (the reason we set it up) so anything that involves significant downtime is a problem.
As this is our first attempt at setting up a linux box, we are using a "well equipped" desktop grade machine but not what I would call "server grade" hardware. This is a standalone box, not a VPS. We are using a hardware, not software, RAID array and have plenty of memory in the box.
Caveats / Background:
I am relatively new to Linux in general. I spend much more time writing code than managing servers. I'm comfortable with working on the box, but I'm not really a sysadmin guy. I'm comfortable with the command line but have more experience with OS X (BSD). I am unsure of all of the tools / information / Logs that may be available, though I try to be thorough in checking what I do know. I did not physically configure the hardware so I'm not sure of all of the specs but I can get any info I need to troubleshoot. I may be skipping very basic steps or missing obvious places to look for information without knowing it.
A little more detail:
Real memory: 8GB Ubuntu 10.04.2 LTS Hardware RAID 10 Managing sites with Webmin version 1.550
Server is in a remote data center. Hands on-troubleshooting is difficult. We have attempted two Linux setups at this point. The first was on a hardware config identical to this one, but with no actual pieces of hardware reused. That attempt was using CentOS and we were attempting to set up CPanel. We scrapped that install because of this same problem (periodic crashing / hanging). The second attempt (this one) is showing the same behavior. The only thing I can really see in common are the hardware configuration (though CentOS & Ubuntu may have more in common than I think).
The box will run fine for hours, days, or even weeks, and then just stop responding entirely. I check all of the logs I know to check (primarily messages, syslog and kern.log) but I don't see anything that seems like an error to me. I do see lines that I don't understand that may or may not be problems, such as:
Code:
rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="814" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Most of our syslog entries seem to be logs of webmin related cron jobs running. My gut tells me that there is possibly some component in our configuration Linux does not like or needs a driver update (maybe the raid card for example), but I'm unsure of how to do more to track down or determine what that might be. Guess and check is expensive. Another thought I've had is that one or more of the cron jobs that are running are tripping something up, but it doesn't appear to be reproducible on demand and, again, I'm at a loss on how to test that theory any further. The same cron job does not appear to be running each time the server goes down. This is a portion the log just prior to our last hang:
Code:
Aug 8 11:00:01 linhost01 CRON[10771]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
I have a fileserver running 10.04 server 64bit and samba. I connect it to my desktop which is 10.04 desktop 64bit.I have the server mounted on my desktop in fstab as://10.0.0.2/share /media/share cifs guest, uid= 1000.Up until 30 June 2010 it was all fine. Now when I write the server it is very slow e.g. 2Mbps though when I read I get >100Mbps so I think my network is still ok. If i use nautilus smb://10.0.0.2/share I can write at >100Mbps and also read at >100Mbps...So any ideas why the write speed via the fstab mount samba has started to go really slowly in the last couple of days?
I have been trying to set up an LDAP server for a development environment as part of an internship for a week now, and I cannot get past this point. I have been following the 10.04 server guide to set up LDAP here: URL...Once I get to the following point in the guide, it just hangs:"As an example of modifying the cn=config tree, add another attribute to the index list using ldapmodify:"I've been working on this for a week and can't understand why this won't work. I am fairly certain that I've followed the guide to a 'T.'Any idea why am I receiving a permission denied error? Is this a permissions issue with one of the config files?
I have two pc's that I've recently converted to Lenny boxes, and I've noticed that whenever there's a lot of disk IO, one of my cores (both pc's have dual core cpu's) jump to 100%. Is this normal? They're both amd64, and I think both of them should be using the AHCI SATA driver.
If you need any more information about my setup, let me know.
Most of the time when using USB flash drives under Linux, the OS will lock up at 100% IOWAIT, with a load average of 12-18. Even my music stutters and stops. Using USB drives makes my Core I7 3.0GHz HyperThreaded act like a Pentium 2 running Vista. It's embarrassing. A client asks for a file, and it takes me 2 hours to copy it. This is a deal breaker. I've been trying to find a solution. If I can't, I'm sorry to say I'll be leaving Linux for something that can use USB drives. I've been using Ubuntu for 4 years, I'd hate to leave. Don't take that as threat, just a stark fact based on business decisions.
I have Bind on the machine at present, but I was wondering how much disk would be needed to make it a full DNS server that could act in place of a dead upstream service.
i am having a problem that i would call a bit "important" with my server. so, from last 3 weeks the used space of my hard disk (RAID I) started growing up. i have 2 x 1 tb HDD working on RAID I and i did not install anything those weeks. the space just started changing from 90 GB till 580 GB. now the situation is stable there but i think it's not normal.
the bandwidth usage is low (like 120 gb in 2 months) and i am running 6 counter strike gameservers, a forum, a very little website and some local stuffs... a friend of mine told me that my server could have been hacked but i am afraid it did... some useful informations: when i reboot the server the used space goes down again to ~100 GB and then it starts going up again. i cant really find where all those files are located:
My problem is extremely slow write on hard disk and 100% cpu usage and it happens when I want to write something on the hard derive not any other external derive.
Tried a fresh ubuntu install. No change. I am not even sure if it is a software or hardware problem.
Core 2 Duo E4600 2GB DDR2 RAM (1 stick) Intel ICH10R based motherboard (tried an ICH9R aswell) 4-port SATA controller (PCI Sil 3114) O/S: Ubuntu Desktop x64 10.04 LTS (using 'desktop' because I like having a remote desktop)
The Storage Setup Disks: Assorted selection of 9 disk. 750GB, 1000GB and 1500GB Seagate and Western Digital disks. The disks are joined through a standard LVM2 configuration. I don't know the LVM term, but normally you'd call it a JBOD setup. On that LVM device, I've put a cryptsetup device, made with the LUKS tools (aes-xts-plain 256) On the cryptsetup device, I've created and mounted an EXT4 partition.
All in all, a completely standard LVM2 and LUKS setup, running EXT4. After a reboot, I proceed to unlock my cryptsetup encryption device, and then mount the EXT4 partition. All is well, the mount is accessible and everything looks fine. I then try to send a file to the mount, via Samba. After a few hundred MB written, the I/O wait goes berserk. It stays at 50% (dual core setup remember). The system becomes unresponsive to network commands (can't browse samba) for about 5-10 minutes. When it finally responds, the I/O wait is gone and everything is now fine. I can write and read hundreds of GB's of data without any issues at all. I can benchmark and stress all disks perfectly fine and no logs are showing disk errors.
I tried monitoring my disks with 'iostat -d 2' while the I/O wait was happening, and there is some slight Blk_read/s activity on 1 disk at a time. First for example /dev/sda is showing a little Blk_read/s acitivty, then it jumps to the next disk, and when every disk has show that slight Blk_read/s activity (500-800 or so) the problem is gone and the I/O wait is no more. I've tried changing motherboards, switching disks around on the controllers, checking individual disks, replacing disks and I've tried different versions of Ubuntu. The problem however persists. I could see it being a network issue, possibly a driver issue. But since the NIC is a standard RTL8111 on-board it seems unlike that the problem wouldn't be more widespread since this NIC is litterally being used everywhere. I did change my motherboard, so a faulty NIC seems unlikely twice in a row.
i am learning to using ubuntu as my server and learning using vps too
now i getting consfuse about my server memory usage i just have 3 sites , 1 blog site and 2 company profile but apache memory usage is more than 300MB and total of memory use in my server is more than 500 MB (maximum 512MB burst memory)
i am using drupal for my website is this normal ? because in last week, memory consumption in my server no more than 380 MB
I upgraded webserver to new ubuntu server 10.04 (x86-64). After upgrade the increased load from 0,3 to 1,4. On webserver running phpbb, which generating slow quieres, which not before upgrade to lucid. HW conf: Intel Core i7, 8GB ram, WD Raptor 10k rpm. Week17 upgrade to new version.
I am having a problem with the server that I use to host my personal site. The load average quite often spikes to exceed 1.00 for the 1 and 5 minute intervals, and the 15 minute interval gets above .5. This occurs while the server is idle, serving very few or no requests and with the CPU 99% idle with <1% IOWAIT usage. I have checked top and vmstat, but neither one provides any useful info. Top continues to say the CPU is 99% idle, and vmstat says that there are 0 runnable and 0 blocking tasks. Occasionally, vmstat will say that there is 1 runnable task, but this doesn't even coincide with the load average spikes. I have already searched for other solutions to this problem, but everything I have seen says to use top and/or vmstat, but those aren't showing anything out of the ordinary. Can anyone recommend anything else I might do?
My server has a Pentium 4 HT 3gHz processor, 2GB RAM, and runs Kubuntu 10.10. (The reason it runs Kubuntu instead of Ubuntu Server is that it needs an X environment so that the Nvidia driver can initialize and put its graphics card into a power saving mode.)
I have a home server based on Ubuntu Linux 10.04.2.
Hardware: Motherboard - Asus AT4NM10-I (Intel NM10, PCI) CPU - Integrated Intel Atom D410 RAM - 2 Gb Lan - D-Link DGE-528T Gigabit Adapter
Provider gives 8/2 Mbit ADSL connection.
So tried Deluge and Transmission, and integrated or external network card and no luck.
When torrent file is being seeded on top speed network starts freezing, server almost unreachable, video freezing when watching it by LAN from server... etc...
When I pause upload - everything starts working ok!
Network based on gigabit switch and cooper UTP cables...
top says there's only 12MB free (out of 1GB), but I can't figure out what's using all the RAM. rtorrent is using 13MB, and the rest are in bytes. (ran top as root)