SOS!

Posted by Oleksiy Kovyrin under Blog, Networks · русский

This blog turned 1 year old last month and I think all of my readers were glad to read it. I’ve never asked for help and I offered help to people who needed it. But today I’m forced to ask my readers for help because it is the first time in my long practice when I really don’t know how to solve my problem.

So, here is the problem. Company I work for has a project (really mission-critical one), major parts of which were written in PHP and worked on some not-so-fast server for years. But recently we’ve decided to move it to a new one (pretty powerful) server – 2xXeon 3.2Ghz with 4Gb RAM to improve our users’ experience.

I’ve installed Debian Etch with 2.6.18-4 kernel, lighttpd and PHP as fastcgi on this server and moved all software there. After one week of testing and working in “safe mode” we’ve finished migration and were really glad because everything “worked” smoothly. But yesterday we’ve found a problem – really weird problem.

We have some “big” pages in our client area – some 700kb-1,5Mb html pages with stats. Yesterday one of our clients noticed that some of these pages didn’t fully load and sometimes browser asked him for saving “Binary PHP” file to disk. After some initial investigation I was able to reproduce this bug, but i didn’t know what was happening – Firefox asked me about saving results to disk and sometimes pages didn’t fully load, but there was no entries in php and lighttpd error logs.

I’ve spend about 12 hours trying to use lighttpd, php 4.4, 5.0, 5.1, 5.2.0, 5.2.1
and 5.2.2RC2, apache 1.3, 2.0, 2.2, nginx 0.4.x and 0.5.x but it did not solve the problem – I was able to reduce possibility of this problem’s appearance, but still sometimes it happened.

After 1,5 days without sleep I decided to trace entire page creation process from the beginning: I’ve created 1mb file, consisting of spaces, named it test.php, made a copy of it and named it test.php.bin. Then I created a test script that was trying to download both of these files 100 times and watched what’s happening. Every download of test.php.bin gave us a 1Mb file, but almost 70% of downloaded ‘php’ files were broken by 1-2 binary blocks inside of a downloaded data (that blocks were some sets of 0×00 bytes).

Strace’ing of php fastcgi process showed that php sends all requested information to the web server and there is nothing to do with php. Next step was to trace web server’s (nginx at that moment) process. I’ve started nginx in single-worker mode and tested it with that testcase script – results were the same – 100 correct binary files and about 70 broken text files. Next step was to start strace. I’ve started it and noticed really high cpu load because of really intensive logging. When testcase script finished it’s work I was shocked – all 200 requests to the server were successful!

During next few hours I was trying to ‘play’ with epoll/select/poll, some /proc/sys/net tunings, etc, etc, but results were the same – when server had high load everything was fine, and if there was no load, files were broken.

At this moment I’ve came out with a really weird workaround – I’ve added limit_rate=128Kbytes/sec setting in nginx config file and at this moment everything looks better – all tests work fine and there is no more problems in our browsers.

So, that’s it. What I want to ask my readers is to help me to find out the source of this problem because I really don’t want to hit such problem on other server again having no appropriate solution. Thanks to all in advance.


Related posts:

  1. Lighttpd Book from Packt – Great Thanksgiving Present
  2. Russian Feed Has Been Fixed
  3. Flash Video (FLV) Streaming With Nginx
  4. RSS Feed has been fixed
  5. Nginx With PHP As FastCGI Howto

21 Responses to this entry

SheLTeR: id’s blog » Blog Archive » SOS! says:

[...] In the company i work together with my friend Alexey Kovyrin he met a really weird problem with php+apache/lghthttpd/nginx after migrating a project to a new server. A really weird and strange problem. He was working on a solution for 34 hours, but with no (normal) results. He asks for help in finding why it is happening and what causes this problem. Detailed description of the problem is in his blog. Any thoughts are welcome, thanks to all in advance! [...]

Dan Kubb says:

I see no one’s replied yet, so I’ll throw my hat in. Not sure how much help I can be — I’m not a PHP developer, but I have debugged PHP/FastCGI with LightTPD and Nginx before, so anything’s probably better than nothing…

The weird part about this is that you tried Apache, Nginx and LightTPD and you got the same result with each. This plus the fact that strace shows PHP is sending all the requested information to the server is really weird.

Have you tried communicating with the FCGI process using a unix socket AND an internet socket? Can you eliminate the web server from the equation, and write a script that uses the FastCGI protocol to talk to the FCGI process directly?

When I hit a wall, I usually try to reduce things to the smallest number of moving parts and then try again. Close the feedback loop as tightly as possible. Eliminate the web server if you can. Try to compile PHP in FastCGI mode with no external library dependencies. Heck, even see if you can execute the script from the command-line with the ENV variables set to expected values and try duplicate the problem.

Scoundrel says:

2Dan Kubb: Thanks for advices. I’ve tried unix and tcp sockets without luck :-(

Your idea of writing some wrapper to be sure than php sends correct data is really great – will try to implement it tomorrow.

Dan Kubb says:

@Soundrel: I wasn’t even suggesting writing a wrapper, I was more suggesting writing a “bot” that communicates with the script via the FastCGI protocol directly. You could run it directly against the app and raise any exceptions if the data comes out other than expected.

A wrapper seems like a more difficult problem to solve. A bot would simply need to simulate a FastCGI request, and log the response.

DM says:

Итак, вот что вышло у меня:
файл размером в 1М, забит нулями, PHP работает как модуль апача. В процессе была загрузка 100% (cpu: Pentium(R) 4 CPU 1400MHz, mem: 516680 kB)
Тестил через ab -c 100 http://dm.test/test.php
Если скажете как скачать его 100 раз иначе – попробую иначе.

Server Software: Apache/2.2.3
Server Hostname: dm.test
Server Port: 80

Document Path: /test.php
Document Length: 1048576 bytes

Concurrency Level: 1
Time taken for tests: 474.532609 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 104877000 bytes
HTML transferred: 104857600 bytes
Requests per second: 0.21 [#/sec] (mean)
Time per request: 4745.326 [ms] (mean)
Time per request: 4745.326 [ms] (mean, across all concurrent requests)
Transfer rate: 215.83 [Kbytes/sec] received

id says:

to _Andrey_:
нет, не тестировали, но такое проявляется редко на ещё одном сервере и было ещё на одном. Т.е. 3 разных сервера, на которых один и тот же баг – имо вариант проблем с памятью да и с железом в целом исключают.

Scoundrel says:

2Aleks: Thanks – I’ll take a look at this tcp scaling problem.

About chatting with me – you can reach me via any IM (contacts are on resume page).

I’ve deleted strace/tcpdump/etc but I think I can get new dumps ;-)

Camel says:

Were your test scripts local or remote? I’ve had similar issues, check your ethernet MTU, it couldn’t hurt to try lowering it to 1450.

Scoundrel says:

Hmmm… I’ve got major update – looks like there is no problem when web server is not on port 80. Just tested it with the same config but on port 8080 – everything works fine on full speeds without any problems (at lest on my 100-files-download test). Really strange :-(

Scoundrel says:

2Camel: Test script was local, but errors was on local and remote clients – distance from the server does not matter.

Dimon says:

We have rebuilt the kernel – versions from 2.6.16.51 to 2.6.19.7 – nothing helped. Trying different NIC cards also didn’t help.

Siava says:

Была похожая проблема после обновления ядра до 2.6.21.*, всё стало на свои места после возвращения на ветку 2.6.16.*