Parallel processing in PHP

Since PHP does not offer native threads, we have to get creative to do parallel processing. I will introduce 3 fundamentally different concepts to emulate multithreading as good as possible.

Using systemcalls

If you have some basic linux knowledge, you will know that a background process can be started by adding ampersand to the systemcall (in Windows, it’s the start-command)

dav@david:/var/www$ php index.php &
[1] 3229

The PHP script is running silently in the background. What is being printed to the shell (3229) is the process id, so that we are able to kill the process using

kill 3229

A problem with this approach is, that any output of the script is lost, so we have to redirect the output stream to a file, just like this:

php index.php > output.txt 2>&1 &

The purpose of the scary 2>&1 is to redirect stderr to stdout, so when your script produces any kind of php error, it will also get caught by the output-file. Putting everything together, we get

$cmd = "php script.php";

$outputfile = "/var/www/files/out.";
$pidfile = "/var/www/files/pid.";

for ($i = 0; $i < $process_count; $i++)
    exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outputfile.$i, $pidfile.$i));

Looks confusing, right? We’ve added echo $! >> %s to the command, so that the process id of the background script gets written to a file. This proves to be useful to keep track of all running processes.

If you want to kill all php-processes, the following command will do:

killall php

Needless to say that when you add the php shebang #!/usr/bin/php to the top of your script and make it executable using chmod +x script.php, the system command needs to be modified to ./script.php instead of php script.php.

To check if a process is still running, you might use some variation of the ps command as done here (stolen from Steffen):

function is_running($pid)
{
	$c = "ps -A -o pid,s | grep " . escapeshellarg($pid);
	exec($c, $output);

	if (count($output) && preg_match("~(\d+)\s+(\w+)$~", trim($output[0]), $m))
	{
		$status = trim($m[2]);
		if (in_array($status, array("D","R","S")))
		{
			return true;
		}
	}
	
	return false;
}

Using fork()

Using the pnctl-functions of php, you get the ability to fork a process (pcntl_fork, not availible on Windows). Before you get too excited, read to following quote from a comment written on php.net that exactly reflects my experience with forking in php:

You should be _very_ careful with using fork in scripts beyond academic examples, or rather just avoid it alltogether, unless you are very aware of it’s limitations.
The problem is that it just forks the whole php process, including not only the state of the script, but also the internal state of any extensions loaded.
This means that all memory is copied, but all file descriptors are shared among the parent and child processes.
And that can cause major havoc if some extension internally maintains file descriptors.
The primary example is ofcourse mysql, but this could be any extensions that maintains open files or network sockets.

You have been warned! Look at the following example:

for ($i = 0; $i < 4; $i++)
{
    pcntl_fork();
}

echo "hi there! pid: " . getmypid() . "\n";

Output:

dav@david:/var/www$ php script.php
hi there! pid: 3534
hi there! pid: 3536
hi there! pid: 3538
hi there! pid: 3539
hi there! pid: 3540
hi there! pid: 3541
hi there! pid: 3542
hi there! pid: 3537
hi there! pid: 3543
dav@david:/var/www$ 
hi there! pid: 3544
hi there! pid: 3545
hi there! pid: 3546
hi there! pid: 3548
hi there! pid: 3547
hi there! pid: 3549
hi there! pid: 3550

As you can see, we get 2 ^ fork count processes. Somewhere in the middle of the output, the original script is finished but some forks are still running. It’s even possible to communicate with processes that you forked. Forking is a very interesting area of computer science, nevertheless i don’t recommend using fork in real-world php applications.

Using curl

The last way to process multiple scripts in parallel is to abuse the webserver and curl. With curl, we are able to execute multiple requests in parallel (inspired by Gonzalo Ayuso).

$url = "http://localhost/calc.php";
$mh = curl_multi_init();
$handles = array();
$process_count = 15;

while ($process_count--)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
    curl_multi_add_handle($mh, $ch);
    $handles[] = $ch;
}

$running=null;

do 
{
    curl_multi_exec($mh, $running);
} 
while ($running > 0);

for($i = 0; $i < count($handles); $i++) 
{
    $out = curl_multi_getcontent($handles[$i]);
    print $out . "\r\n";
    curl_multi_remove_handle($mh, $handles[$i]);
}

curl_multi_close($mh);

Here, we call the script calc.php 15 times. The content of calc.php is:

<?php
echo "my pid: " . getmypid();
?>

The output is as follows:

dav@david:/var/www$ php script.php
my pid: 1401
my pid: 1399
my pid: 1399
my pid: 1403
my pid: 1403
my pid: 1398
my pid: 1398
my pid: 1402
my pid: 3767
my pid: 3768
my pid: 3769
my pid: 3772
my pid: 3771
my pid: 3773
my pid: 3770

Interesting to see, that we see the same process id a few times. Keep in mind, that you trigger an http-request, so you are losing performance because a webserver has to do some work. Furthermore, the called script will be working with the ordinary php.ini, and not php-cli.ini.

What about the speed? Benchmarks!

What would you take away from this post, when you didn’t know which parallel processing method is the fastest? I’ve written a little benchmark script using the 3 methods described above, did 3 runs and calculated the average. Basically, this is my benchmark scipt calc.php:

$starttime = time();
$duration = 10;

$filename = "/var/www/results/" . getmypid() . ".out";

$loops = 0;

while (true)
{
    for ($i = 0; $i < 10000; $i++)
    {
        sqrt($i);
    }
    
    $loops++;
    
    if ($starttime + $duration <= time())
        break;
}

file_put_contents($filename, $loops);

My system:

Ubuntu 10.10 (Kernel 2.6.35-28)
4 gig Ram
Intel Core 2 Duo T7500 (2 * 2.2GHz)

I’m fully aware that this benchmark is in no way representative, because writing the result files to harddisk might influence other processes, that are still running and my time comparison may also be slightly inaccurate. Ah, before you ask: I haven’t used set_time_limit because it sucks. So bring on the results!

Method Proc.  Iterations

exec     1    2183 
exec     2    3953 
exec     4    4283 
exec     8    4378
exec    16    4586
exec    32    4868

curl     1    2203
curl     2    2843
curl     4    3029
curl     8    3556
curl    16    3986
curl    32    4373

fork     1    2274
fork     2    4299
fork     4    4245
fork     8    4309
fork    16    4177
fork    32    4577

As you can see, the more parallel processes, the more iterations in total. I haven’t tested 64 processes and more because my system almost froze (memory usage and cpu utilization). Feel free to interpret the results in any way you want but in the end, it boils down to the exec – method because fork is evil and curl is not a serious alternative.

Finally, if you want to do some testing on your own, here is my benchmark file. Place it in the same folder with the calc.php from above, give the file execute rights and create a folder results. The file is invoked by using ./bench.php method processcount, so possible calls are

./bench.php exec 16
./bench.php curl 8
./bench.php fork 32
./bench.php -> no parameter to display results

The file itself:

#!/usr/bin/php
<?php
$mode = isset($argv[1]) ? $argv[1] : "results";
$process_count = isset($argv[2]) ? $argv[2] : 1;

//cleanup
if ($mode != "results" && count(glob("/var/www/results/*")))
{
    exec("rm /var/www/results/*");
}

if ($mode == "exec")
{
    $cmd = "php calc.php";

    $outputfile = "/var/www/results/out.";
    $pidfile = "/var/www/results/pid.";

    for ($i = 0; $i < $process_count; $i++)
        exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outputfile.$i, $pidfile.$i));
}
elseif ($mode == "curl")
{
    $url = "http://localhost/calc.php";
    $mh = curl_multi_init();
    
    while ($process_count--)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, false);
        curl_setopt($ch, CURLOPT_TIMEOUT, 30);
        curl_multi_add_handle($mh, $ch);
    }
    
    $running=null;
    
    do 
    {
        curl_multi_exec($mh, $running);
    } 
    while ($running > 0);
}
elseif ($mode == "fork")
{
    for ($i = 0; $i < log($process_count, 2); $i++)
    {
        pcntl_fork();
    }
    
    include "calc.php";
}
else
{
    $total = 0;

    foreach (glob("/var/www/results/*.out") as $f)
    {
        $runtime = file_get_contents($f);
        $total += $runtime;
        echo $runtime . "\r\n";
    }

    echo "Total: " . $total . "\r\n";
}

Weitere Posts:

Dieser Beitrag wurde unter php, Performance, webdev, Linux veröffentlicht. Setze ein Lesezeichen auf den Permalink.

19 Antworten auf Parallel processing in PHP

  1. Pingback: World Spinner
  2. mike sagt:

    IMHO the curl method is a horrible abuse of technology. You do actually say „not a serious alternative“ as well – I fully agree. :)

    Why not talk about other options such as gearman? Then you are using technologies designed for parallel processing/non-blocking/async…

    1. david sagt:

      Hi, you are totally right – I should have mentioned it.

  3. Jason sagt:

    The problem with using & to fork processes is the child processes are totally dependent on the parent process… meaning that if the parent processes exits… so do the child processes. This isn’t really multithreading if you ask me.

    One solution i’ve used is to use the „at“ command

    I agree the pcntl_fork option kind of sucks for php and the curl example is almost not worth mentioning.

  4. Pingback: abcphp.com
  5. Erik sagt:

    Nice comparison, I’ve used the exec method before, didn’t know about the other two.
    @Jason you can use nohup to have the process break from the parent and continue running. I’ve used this method for initializing long mysql dumps or restores.

  6. You can also use the default stream extension with non-blocking options to parallelize requests. It also works fine for webservice-intensive applications. stream_select() will avoid the idle loop by providing you with the streams ready to be interacted with.

    Gearman is great when available.

  7. Indrek sagt:

    You have done little wrong in fork example. Better example:

    $pids = array();
    for ($i = 0; $i < 4; $i++) { if ($pid = pcntl_fork()) { $pids[] = $pid; break; // Now I'm child process and exit from loop } } // Now must wait until all children are finished while ($pids) { $pid = pcntl_wait(0); // remove $pid from $pids }

  8. Pingback: Mass Tweet
  9. Patrick sagt:

    Well, gotta stop using the curl method. ^^

  10. Nikvasi sagt:

    I think PHP is not suited for this at all. First PHP is not thread safe, so we can not use native threading with pcntl_fork().
    In second exec() is good for one process, but with multiple of them you can simply overload the server and you have no way to manage this processes (only via kill pid, its very hard and not worth it).
    My best decision was curl or file_get_contents – them both simple and Apache controls resource usage. One huge minus for this feature that you can not kill it (in some cases you can) when you set_time_limit(0) and run child as daemon.

  11. javier sagt:

    Best info about parallel processing I could find in all the net, the only thing I would add would be gearman, but it’s kinda different since you have to set up your php for it.
    Kind regards David

  12. Adam sagt:

    I’m in two minds at the moment and can’t decide on the best approach for developing a PHP application where it’s main purpose is scheduling tasks. I’m currently re-factoring a previous version which utilized the Symfony Process component for running/tracking processes in parallel, which under the hood uses the PHP exec function.

    The only problem I see currently with this implementation, is the separate processes will run through CLI and not be able to take advantage of op-code cache, with the likes of OPcache, APC etc.

    For this reason, I’m almost favouring the CURL solution. I would be interested in knowing how these different approaches compared, when you incorporate load balancing on the server using Nginx and having OPcache enabled. It also allows for (I won’t say better, but) easier scalability. I’d like to hear your thoughts and whether your comment about curl not being a serious alternative still stands.

    Also, I’m aware that there’s better ways of carrying out this type of work, Gearman, ZeroMQ etc. However I’m tied to a Windows operating system for this project which hasn’t made it easy.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert