This blog turned 1 year old last month and I think all of my readers were glad to read it. I’ve never asked for help and I offered help to people who needed it. But today I’m forced to ask my readers for help because it is the first time in my long practice when I really don’t know how to solve my problem.
So, here is the problem. Company I work for has a project (really mission-critical one), major parts of which were written in PHP and worked on some not-so-fast server for years. But recently we’ve decided to move it to a new one (pretty powerful) server – 2xXeon 3.2Ghz with 4Gb RAM to improve our users’ experience.
I’ve installed Debian Etch with 2.6.18-4 kernel, lighttpd and PHP as fastcgi on this server and moved all software there. After one week of testing and working in “safe mode” we’ve finished migration and were really glad because everything “worked” smoothly. But yesterday we’ve found a problem – really weird problem.
We have some “big” pages in our client area – some 700kb-1,5Mb html pages with stats. Yesterday one of our clients noticed that some of these pages didn’t fully load and sometimes browser asked him for saving “Binary PHP” file to disk. After some initial investigation I was able to reproduce this bug, but i didn’t know what was happening – Firefox asked me about saving results to disk and sometimes pages didn’t fully load, but there was no entries in php and lighttpd error logs.
I’ve spend about 12 hours trying to use lighttpd, php 4.4, 5.0, 5.1, 5.2.0, 5.2.1
and 5.2.2RC2, apache 1.3, 2.0, 2.2, nginx 0.4.x and 0.5.x but it did not solve the problem – I was able to reduce possibility of this problem’s appearance, but still sometimes it happened.
After 1,5 days without sleep I decided to trace entire page creation process from the beginning: I’ve created 1mb file, consisting of spaces, named it test.php, made a copy of it and named it test.php.bin. Then I created a test script that was trying to download both of these files 100 times and watched what’s happening. Every download of test.php.bin gave us a 1Mb file, but almost 70% of downloaded ‘php’ files were broken by 1-2 binary blocks inside of a downloaded data (that blocks were some sets of 0×00 bytes).
Strace’ing of php fastcgi process showed that php sends all requested information to the web server and there is nothing to do with php. Next step was to trace web server’s (nginx at that moment) process. I’ve started nginx in single-worker mode and tested it with that testcase script – results were the same – 100 correct binary files and about 70 broken text files. Next step was to start strace. I’ve started it and noticed really high cpu load because of really intensive logging. When testcase script finished it’s work I was shocked – all 200 requests to the server were successful!
During next few hours I was trying to ‘play’ with epoll/select/poll, some /proc/sys/net tunings, etc, etc, but results were the same – when server had high load everything was fine, and if there was no load, files were broken.
At this moment I’ve came out with a really weird workaround – I’ve added limit_rate=128Kbytes/sec setting in nginx config file and at this moment everything looks better – all tests work fine and there is no more problems in our browsers.
So, that’s it. What I want to ask my readers is to help me to find out the source of this problem because I really don’t want to hit such problem on other server again having no appropriate solution. Thanks to all in advance.