Since PHP does not offer native threads, we have to get creative to do parallel processing. I will introduce 3 fundamentally different concepts to emulate multithreading as good as possible.
Using systemcalls
If you have some basic linux knowledge, you will know that a background process can be started by adding ampersand to the systemcall (in Windows, it’s the start-command)
dav@david:/var/www$ php index.php & [1] 3229
The PHP script is running silently in the background. What is being printed to the shell (3229) is the process id, so that we are able to kill the process using
kill 3229
A problem with this approach is, that any output of the script is lost, so we have to redirect the output stream to a file, just like this:
php index.php > output.txt 2>&1 &
The purpose of the scary 2>&1 is to redirect stderr to stdout, so when your script produces any kind of php error, it will also get caught by the output-file. Putting everything together, we get
$cmd = "php script.php"; $outputfile = "/var/www/files/out."; $pidfile = "/var/www/files/pid."; for ($i = 0; $i < $process_count; $i++) exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outputfile.$i, $pidfile.$i));
Looks confusing, right? We’ve added echo $! >> %s to the command, so that the process id of the background script gets written to a file. This proves to be useful to keep track of all running processes.
If you want to kill all php-processes, the following command will do:
killall php
Needless to say that when you add the php shebang #!/usr/bin/php to the top of your script and make it executable using chmod +x script.php, the system command needs to be modified to ./script.php instead of php script.php.
To check if a process is still running, you might use some variation of the ps command as done here (stolen from Steffen):
function is_running($pid) { $c = "ps -A -o pid,s | grep " . escapeshellarg($pid); exec($c, $output); if (count($output) && preg_match("~(\d+)\s+(\w+)$~", trim($output[0]), $m)) { $status = trim($m[2]); if (in_array($status, array("D","R","S"))) { return true; } } return false; }
Using fork()
Using the pnctl-functions of php, you get the ability to fork a process (pcntl_fork, not availible on Windows). Before you get too excited, read to following quote from a comment written on php.net that exactly reflects my experience with forking in php:
You should be _very_ careful with using fork in scripts beyond academic examples, or rather just avoid it alltogether, unless you are very aware of it’s limitations.
The problem is that it just forks the whole php process, including not only the state of the script, but also the internal state of any extensions loaded.
This means that all memory is copied, but all file descriptors are shared among the parent and child processes.
And that can cause major havoc if some extension internally maintains file descriptors.
The primary example is ofcourse mysql, but this could be any extensions that maintains open files or network sockets.
You have been warned! Look at the following example:
for ($i = 0; $i < 4; $i++) { pcntl_fork(); } echo "hi there! pid: " . getmypid() . "\n";
Output:
dav@david:/var/www$ php script.php hi there! pid: 3534 hi there! pid: 3536 hi there! pid: 3538 hi there! pid: 3539 hi there! pid: 3540 hi there! pid: 3541 hi there! pid: 3542 hi there! pid: 3537 hi there! pid: 3543 dav@david:/var/www$ hi there! pid: 3544 hi there! pid: 3545 hi there! pid: 3546 hi there! pid: 3548 hi there! pid: 3547 hi there! pid: 3549 hi there! pid: 3550
As you can see, we get 2 ^ fork count processes. Somewhere in the middle of the output, the original script is finished but some forks are still running. It’s even possible to communicate with processes that you forked. Forking is a very interesting area of computer science, nevertheless i don’t recommend using fork in real-world php applications.
Using curl
The last way to process multiple scripts in parallel is to abuse the webserver and curl. With curl, we are able to execute multiple requests in parallel (inspired by Gonzalo Ayuso).
$url = "http://localhost/calc.php"; $mh = curl_multi_init(); $handles = array(); $process_count = 15; while ($process_count--) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_TIMEOUT, 30); curl_multi_add_handle($mh, $ch); $handles[] = $ch; } $running=null; do { curl_multi_exec($mh, $running); } while ($running > 0); for($i = 0; $i < count($handles); $i++) { $out = curl_multi_getcontent($handles[$i]); print $out . "\r\n"; curl_multi_remove_handle($mh, $handles[$i]); } curl_multi_close($mh);
Here, we call the script calc.php 15 times. The content of calc.php is:
<?php echo "my pid: " . getmypid(); ?>
The output is as follows:
dav@david:/var/www$ php script.php my pid: 1401 my pid: 1399 my pid: 1399 my pid: 1403 my pid: 1403 my pid: 1398 my pid: 1398 my pid: 1402 my pid: 3767 my pid: 3768 my pid: 3769 my pid: 3772 my pid: 3771 my pid: 3773 my pid: 3770
Interesting to see, that we see the same process id a few times. Keep in mind, that you trigger an http-request, so you are losing performance because a webserver has to do some work. Furthermore, the called script will be working with the ordinary php.ini, and not php-cli.ini.
What about the speed? Benchmarks!
What would you take away from this post, when you didn’t know which parallel processing method is the fastest? I’ve written a little benchmark script using the 3 methods described above, did 3 runs and calculated the average. Basically, this is my benchmark scipt calc.php:
$starttime = time(); $duration = 10; $filename = "/var/www/results/" . getmypid() . ".out"; $loops = 0; while (true) { for ($i = 0; $i < 10000; $i++) { sqrt($i); } $loops++; if ($starttime + $duration <= time()) break; } file_put_contents($filename, $loops);
My system:
Ubuntu 10.10 (Kernel 2.6.35-28) 4 gig Ram Intel Core 2 Duo T7500 (2 * 2.2GHz)
I’m fully aware that this benchmark is in no way representative, because writing the result files to harddisk might influence other processes, that are still running and my time comparison may also be slightly inaccurate. Ah, before you ask: I haven’t used set_time_limit because it sucks. So bring on the results!
Method Proc. Iterations exec 1 2183 exec 2 3953 exec 4 4283 exec 8 4378 exec 16 4586 exec 32 4868 curl 1 2203 curl 2 2843 curl 4 3029 curl 8 3556 curl 16 3986 curl 32 4373 fork 1 2274 fork 2 4299 fork 4 4245 fork 8 4309 fork 16 4177 fork 32 4577
As you can see, the more parallel processes, the more iterations in total. I haven’t tested 64 processes and more because my system almost froze (memory usage and cpu utilization). Feel free to interpret the results in any way you want but in the end, it boils down to the exec – method because fork is evil and curl is not a serious alternative.
Finally, if you want to do some testing on your own, here is my benchmark file. Place it in the same folder with the calc.php from above, give the file execute rights and create a folder results. The file is invoked by using ./bench.php method processcount, so possible calls are
./bench.php exec 16 ./bench.php curl 8 ./bench.php fork 32 ./bench.php -> no parameter to display results
The file itself:
#!/usr/bin/php <?php $mode = isset($argv[1]) ? $argv[1] : "results"; $process_count = isset($argv[2]) ? $argv[2] : 1; //cleanup if ($mode != "results" && count(glob("/var/www/results/*"))) { exec("rm /var/www/results/*"); } if ($mode == "exec") { $cmd = "php calc.php"; $outputfile = "/var/www/results/out."; $pidfile = "/var/www/results/pid."; for ($i = 0; $i < $process_count; $i++) exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outputfile.$i, $pidfile.$i)); } elseif ($mode == "curl") { $url = "http://localhost/calc.php"; $mh = curl_multi_init(); while ($process_count--) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_NOBODY, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, false); curl_setopt($ch, CURLOPT_TIMEOUT, 30); curl_multi_add_handle($mh, $ch); } $running=null; do { curl_multi_exec($mh, $running); } while ($running > 0); } elseif ($mode == "fork") { for ($i = 0; $i < log($process_count, 2); $i++) { pcntl_fork(); } include "calc.php"; } else { $total = 0; foreach (glob("/var/www/results/*.out") as $f) { $runtime = file_get_contents($f); $total += $runtime; echo $runtime . "\r\n"; } echo "Total: " . $total . "\r\n"; }
IMHO the curl method is a horrible abuse of technology. You do actually say „not a serious alternative“ as well – I fully agree. :)
Why not talk about other options such as gearman? Then you are using technologies designed for parallel processing/non-blocking/async…
Hi, you are totally right – I should have mentioned it.
The problem with using & to fork processes is the child processes are totally dependent on the parent process… meaning that if the parent processes exits… so do the child processes. This isn’t really multithreading if you ask me.
One solution i’ve used is to use the „at“ command
I agree the pcntl_fork option kind of sucks for php and the curl example is almost not worth mentioning.
Nice comparison, I’ve used the exec method before, didn’t know about the other two.
@Jason you can use nohup to have the process break from the parent and continue running. I’ve used this method for initializing long mysql dumps or restores.
You can also use the default stream extension with non-blocking options to parallelize requests. It also works fine for webservice-intensive applications. stream_select() will avoid the idle loop by providing you with the streams ready to be interacted with.
Gearman is great when available.
You have done little wrong in fork example. Better example:
$pids = array();
for ($i = 0; $i < 4; $i++) { if ($pid = pcntl_fork()) { $pids[] = $pid; break; // Now I'm child process and exit from loop } } // Now must wait until all children are finished while ($pids) { $pid = pcntl_wait(0); // remove $pid from $pids }
Well, gotta stop using the curl method. ^^
I think PHP is not suited for this at all. First PHP is not thread safe, so we can not use native threading with pcntl_fork().
In second exec() is good for one process, but with multiple of them you can simply overload the server and you have no way to manage this processes (only via kill pid, its very hard and not worth it).
My best decision was curl or file_get_contents – them both simple and Apache controls resource usage. One huge minus for this feature that you can not kill it (in some cases you can) when you set_time_limit(0) and run child as daemon.
Best info about parallel processing I could find in all the net, the only thing I would add would be gearman, but it’s kinda different since you have to set up your php for it.
Kind regards David
I’m in two minds at the moment and can’t decide on the best approach for developing a PHP application where it’s main purpose is scheduling tasks. I’m currently re-factoring a previous version which utilized the Symfony Process component for running/tracking processes in parallel, which under the hood uses the PHP exec function.
The only problem I see currently with this implementation, is the separate processes will run through CLI and not be able to take advantage of op-code cache, with the likes of OPcache, APC etc.
For this reason, I’m almost favouring the CURL solution. I would be interested in knowing how these different approaches compared, when you incorporate load balancing on the server using Nginx and having OPcache enabled. It also allows for (I won’t say better, but) easier scalability. I’d like to hear your thoughts and whether your comment about curl not being a serious alternative still stands.
Also, I’m aware that there’s better ways of carrying out this type of work, Gearman, ZeroMQ etc. However I’m tied to a Windows operating system for this project which hasn’t made it easy.