First session of the day was on Scaling Python on the Web; rough notes which I may clean up later:
* How fast is fast enough?
** Don’t prematurely optimize
** *Know* where the bottlenecks are, and optimize those specifically?
* Orders of magnitude: static (httpd), dynamic (python), db-queried
* Even 40 req/s in 3.4m pages/day
* Hundreds to low thousands of dynamic page views is usually good enough
* Scaling isn’t about the language, it’s about:
** DRY: cache!
** share nothing
* built a sample photo-app, FlickrKillr, for demonstration purposes
** preloaded with 100k’s users, 10-20 photos each
* first iteration: CGI
** roughly 23 requests/second
** problems:
*** loading Python interpreter for each request
*** all resources initialized for each request (inc. db connection)
** possible remedies:
*** run a Python web server (long-running process)
*** make one db connection per thread instead of request
** other remedies:
*** fastcgi
*** snakelets, twisted.web, RhubarbTart
*** mod_python
* second iteration: python app server (CherryPy used for this demo)
** roughly 139 requests per second
** problems
*** global interpreter lock — can only utilize one core on a dual core machine
*** sessions in the database — prefer an in-memory session store
** remedies:
*** run multiple instances of CherryPy (overcode GIL)
*** but then we need to balance with something like nginx
** other options
*** cherrypy in mod_python
* version 3: load balancing with nginx
** 217 requests/sec
** outstanding problems
*** static files read from disk every time
*** and they’re being read/written from python
** solutions:
*** memcached
*** combine with memcached w/ nginx
* version 4: caching
** 616 req/sec (benchmarking w/ homegrown tool)
** 1750 req/sec (benchmarking w/ ab)
* other notes:
** don’t forget to index
** without an index, the fourth iteration falls down to 28 requests/sec
http://www.polimetrix.com/pycon/demo.tar.gz
This is somewhat of a pet peeve, so excuse my rambling, but I miss an “outstanding problems” section for caching. Caching is /hard/, especially when dealing with dynamic web-sites such as CMS-driven sites. I couldn’t care less that 95% get a fast response through caching if my 5% of editing interactions are slow.
Caching is not a panacea and while the whole stuff about “optimize only what you need” is absolutely true and I dearly value the development speed that Python gives me, there are faster interpreters out there.
Each tool has its pros & cons. Each problem has its appropriate tools. Wrong mixes of problem & tools = bad results. Caching is hard, it’s not a panacea, Python is easy but slow (compared – for example – to Java).
Do you want fast development & fast code? Use Python, optimize it with your own C extensions. Yes, “optimize only what you need”. Optimize only that, only with C. And only if you can’t make your *algorithms* better, and can’t design your *database* more carefully. And don’t use CGI, if possible.
SHORTLY: Python, no CGI, optimized database & algorithms = fast development & good codebase. ONLY AFTER THAT tune your code & setup with C, caching, load balancing = hard part of development.
“easy but slow… compared to Java”?
I’ll buy the easy part, but I wouldn’t have pointed to Java as my exemplar of speed. I don’t think Java is slow, but then I don’t think Python is slow, either. I think both are “slower than C”, but that’s just a comparative statement, not absolute.