Mobify.me acts as a browser and a caching proxy, with all the benefits and limitations that entails. One thing we need to do a lot of is fetching URL’s. We need to cope with every situation a browser may encounter including sites with different types of state, cookies, POST and GET requests, redirects and error codes.
Getting what you want from the web isn’t as easy as you might like. We started using httplib2. This is an all python library that supported all the basic functionality we needed. We found httplib2′s pure python implementation appealing and it wraps the tried and trusted httplib but adds easy caching, persistence, redirects, cookies and compression among other things. Overall, we were pretty happy with it, but it had one limitation: it uses pysocks for connecting to a proxy. We use a squid cache http proxy on our back end to speed up page retrieval, but also to throttle and manage connections. Unfortunately, pysocks only currently supports the CONNECT tunneling method of proxying. Squid permits tunneling, but does not apply any of it’s limits (such as max body response size) to CONNECT tunneled responses.
We’d been looking at using pycurl for a while. Since it wraps to libcurl, a native library, it has a reputation for performance. It is a very actively maintained project and is probably the most developed URL fetching mechanism living outside of a web browser. First, we tested relative performance of pycurl vs httplib2. Our results: both are very fast, but pycurl is about a factor of 4 faster. If you aren’t doing hundreds of requests a minute, performance is not a good reason to choose pycurl. But for us, the performance and the ability to use an HTTP based proxy was critical. Read on for more thoughts on the switch and tips for using pycurl.
Pycurl, being a thin wrapper around the C library libcurl, is not very pythonic in behavior. Here were a few discoveries I made while working on our new pycurl core:
1) Pycurl uses callback functions to return header and body responses rather than return values. These functions need to look & behave like python file objects. Python StringIO is well suited to this task, and if you have it available cStringIO will give you native performance.
2) Pycurl will handle the cookies for you, but requires a physical file name to load cookies from and a cookie jar file in which to store them. We recommend using the python tempfile module wit hthe mkstemp file to create a secure temporary file. This temporary file can then be populated with your desired cookies. The cookies can be in one of two formats, ‘Netscape / Mozilla cookie data format’, or regular HTTP-style headers. If you go with the latter (we did), you must use the Set-Cookie: syntax for when a cookie is being set, not the Cookie: syntax used when returning a cookie to the server. If you expect to be following redirects, it is important to use the cookiefile and cookiejar options of pycurl, otherwise when pycurl is following redirects it will not send back cookies that take place during a redirect. (Typically, these cookies follow a 302 response after an HTTP POST.)
3) There is a veritable bible of options for libcurl. This is the best reference page for them: curl_easy_setopt.3 man page.
4) If you are allowing pycurl to follow redirects, you may still want the header information from every site it visited. No problem! All of the headers are available to you in the headers callback output, including pages prior to the last one visited, separated by a n. Here’s a regular expression to help you out in splitting them: xpr = re.compile(r’^HTTP/1.*?n$’, re.M|re.U|re.S)
Ciao!
Peter
-
http://findsurebet.com/tennis-odds.html Johny B

