The Renderfarm (how it works)Posted in Development, Production by campbell
One of my tasks here is to make sure the Big Buck Bunny gets rendered, even though Im not crazy about networks, ssh connections and figuring out why the Nth frame didn’t render, some of you have shown interest in the renderfarm so heres a rundown of how it works ;)
Images are rendered on Sun’s Grid at http://network.com, they have a service where you can buy time on their systems for $1 per CPU-Hour, Many tasks can be uploaded to run parallel on their systems, I think they have 600 CPU’s but we’ve only ever managed to hog around 240, with an average of 150 CPU’s at a time.
The systems have 8gig of ram and dual AMD Opteron CPU’s http://www.sun.com/servers/entry/v20z/specs.jsp
Luckily they were generous and gave us 50′000 hours, allowing an optimistic 4-5hrs per frame. Of course we’ll want to re-render a few times because of bugs in blender or adjustments in the artwork so 1-2 hrs per frame is more realistic, We dont do anything too tricky, just render 1 frame on each computer until all the frames are rendered.
One of the big advantages of Suns service is they use a 64bit operating system, this means Blender can use more then 2 gig of ram which is really important to render characters with millions of hairs. Other offers for rendering only ran 32bit systems.
On the flipside, network.com hadn’t ever been used for rendering anything on this scale, the admins at sun weren’t familiar with problems related to this task. Peach is a good way to stress their systems infrastructure.
Installing suns own unix operating system - Solaris was the first task, if you have used Linux or BSD you’ll have some idea of what its like, In fact you can run almost all common Linux applications on Solaris and Blender3D is no exception!
The free Solaris download installed on my PC but the network drivers weren’t available, Ubuntu with a virtual machine worked well allowing for a simple development environment in Solaris to test with.
It seems nobody had compiled a 64bit Blender3D on Solaris, so I had to compile libraries blenders depends on: freetype, zlib, libsdl, openexr, libjpeg, libpng and python, then modify blenders Makefiles and source to add support for solaris-x86_64
(see http://wiki.blender.org/index.php/BlenderDev/BuildingBlender/Solaris for instructions)
With a working 64bit Solaris blender binary it was possible to test blender on the sun grid.
Initially suns web online portal was useful for testing that simple scenes rendered and blender loaded, rendered, saved frames etc without any hiccups.
Their online system as actually pretty cool, you zip all your application and working files, upload them, define the run command and press go! - It tells you how many hours are being used and produces a zip with the generated files for download.
However the peach production files compress to around 2gig, re-uploading would be extremely slow and partial updates would be messy to manage.
Downloading all the frames as big bundle isnt workable either so the usual way of running jobs could not work for us.
This meant running an interactive session, where you submit a job that runs an xterm on network.com, with its display pointing to our modems IP. Our router forwards the X11 packets to our server where the xterm magically appears!
With a dynamic DNS service the IP can be replaced by a normal URL such as blender3d.no-ip.org, this is how you can avoid buying a hostname or having a static IP which you dont get with a cheap internet plan.
To avoid setting in the server room VNC-Server lets me manage the renderfarm via a remote desktop from any PC as well as log off my own computer without interruption.
The first big problem was that out internet connection kept on cutting out, Eventually the ISP admitted it was their problem and we switched providers, then the interactive session would quit and we didn’t know why, this would mean rendering 100’s of images would start, render for hours, then quit before it had finished.
This was insanely frustrating and the cause of many late nights and time waisted with incorrect assumptions about the problem really was! - disk full? out of memory? corrupt temp files? network filesystem timeout?…. each had to be explored and without direct access to the systems and every error closing my only window into their systems (which I needed to see the log files) it was very hard to debug.
I wont go into details but there were a number of problems, some with blender, some with network.com and the guys at sun did their best to support us and managed to use system logs to track down the errors.
So now were on the home stretch and the renderfarm works!
Heres a diagram to show the process we go through to render peach!
More details on the renderfarm can be found here.