I’ll be attending WindyCityRails later this month. If you’re there, be sure to say “hi.”
I’m heading down with a motley mob of RUM members: Andy Atkinson, Sam Schroeder, Tom Brice, Barry Hess, and Chris Schumann. Should be a fun time!
I’ll be attending WindyCityRails later this month. If you’re there, be sure to say “hi.”
I’m heading down with a motley mob of RUM members: Andy Atkinson, Sam Schroeder, Tom Brice, Barry Hess, and Chris Schumann. Should be a fun time!
I’m pleased to announce that FanChatter Stadium has a new customer: University of Oklahoma Sooners Football. We’re providing in-stadium photo sharing on their brand new HD LCD screen. And, as a new feature, we’re also allowing them to collect and display photos on their website between games.
The Sooners’ first game was on Saturday, where they rolled over UT-Chattanooga 57-2 in front of 84,715 fans. FanChatter Stadium performed admirably, and you can see the results here.
I’m proud of this rollout because we signed the contract on Thursday and delivered the goods on Saturday. Not a bad turn around!
Want to build something like FanChatter? Check out my PeepCode on integrating e-mail into your Rails application. Just want to pay us for it? Contact Marty Wetherall.
Whew! That was a hell of a speech last night, wasn’t it? I think I’ll go to BarakObama.com and check out what’s going on…
Splash page?! The tiny red box I’ve highlighted is the only way to escape the splash page and get to the real site.
As a web developer, I’ve believed for years that splash pages are evil. Jakob Nielson wrote back in 1999 that “splash pages are useless and annoying. In general, every time you see a splash page, the reaction is ‘oh no, here comes a site that will be slow and difficult to use and that doesn’t respect my time.’”
And yet, on political sites, the donation splash page – especially after big events – is ubiquitous.
I have only one experience of building a site for electoral politics. In 2006, my friend David Krewinghaus and I won a bid to create a website for Senate candidate Amy Klobuchar (I set up the infrastructure. David designed everything. If you’re looking for a designer, David’s great!)
The campaign was strongly influenced by Hillary Clinton’s website. At the time Hillary Clinton was raking in dough for her puff-ball Senate re-election campaign (she raised so much money that she was able to transfer $10 million to her presidential campaign). Hillary had a donation splash page, and so the Klobuchar people wanted one, too.
We convinced them that this was a bad idea because it would annoy people and hurt the usability and searchability of the website; and couldn’t they put a big donation button on the home page? This satisfied the campaign. I felt I’d done my duty as a conscientious web developer by putting the users first.
But it was only temporary. As the election approached, the campaign pressed us again for a splash page, and this time they couldn’t be persuaded. So we built one. My contribution to splash page usability was that it wouldn’t be shown if you’d already seen it.
I still hate political splash pages. But campaigns use them for one reason: they work.
And Amy Klobuchar? They call her Senator Amy Klobuchar now.
Many Rails applications have this basic structure in their helpers folder:
1 2 3 4 5 6 7 8 9 10 11 |
application_helper.rb accounts_helper.rb audits_helper.rb comments_helper.rb images_helper.rb orders_helper.rb posts_helper.rb sessions_helper.rb users_helper.rb ... etc. |
The most important file, as we all know, is application_helper.rb, because this is where code goes to die. It’s often a few hundred lines of randomly added, unrelated methods. This is a confusing, scary place for methods to be. Here’s a few tips for rescuing them:
Most projects use script/generate to make their controllers. This leaves a ton of empty helper files. Remove them to better focus on the task at hand:
1 2 |
hg remove accounts_helper.rb audits_helper.rb images_helper.rb ... |
Usually this will prune the list down to two or three files.
The easiest way to clean up the ApplicationHelper module is to remove it. This is a great way to ensure methods don’t stay there, or get inserted in the future. But, if they don’t belong in ApplicationHelper, where’s the best place for them?
Helpers are markup generators. If they’re not involved in generating markup, they’re not helpers and can be pushed into a model:
helpers/application_helper.rb1 2 3 4 5 6 |
module ApplicationHelper def birthday_in_words(child, prefix = 'born') "(#{prefix} #{child.birthday_in_words})" if child.birthday? end end |
1 2 3 4 5 6 |
class Child < ActiveRecord::Base def birthday_in_words(prefix = 'born') "(#{prefix} #{birthday})" if birthday? end end |
Unfortunately Rails relies on this ambigious ‘helper’ naming convention internally, making it tricky to change the naming in your own application. (I find the concept of a helper to be… unhelpful, and will be referring to them as ‘markup generators’ for the rest of this post.)
By default, Rails makes all markup generators available to any view via helper :all. The relationship between a model and markup generation tends to be incidental, and script/generate’s ‘ModelNameHelper’ convention is a bit sketchy. Better to name it like anything else, so a module that generates, say, HTML for tables, gets named TableHelper.
1 2 3 4 5 6 7 8 9 10 |
module TableHelper def default_sort_column(title, direction) ... end def sort_column(title, direction) ... end end |
Better yet, if the generation starts getting complex, take a page from one of Ryan Bate’s screencasts and turn it into a class
Well organized code is great. Tested, well organized code? Even better!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
require File.join(File.dirname(__FILE__), '..', 'test_helper') require File.join('action_view', 'test_case') class AlexaThumbnailHelperTest < ActionView::TestCase context "Generating Alexa image tags" do setup do @url = 'http://ted.com' @alexa_image_tag_html = %(<img src="http://ast.amazonaws.com/?...=#{@image_url}"/>) end should "return the image tag as html" do assert_equal @alexa_image_tag_html, alexa_image_tag(@url) end end end |
On a related note, in a few cases it’s useful to allow your templates access to controller methods. Rails provides helper_method to handle this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
class ApplicationController < ActionController::Base # Give views access to these methods: helper_method :current_user, :logged_in? protected def current_user ... end def logged_in? ... end end |
In Rails 2.0 and later, all requests are wrapped in a block that enables query caching.
What this means is that if you execute the exact same query in a single request, the previous results of the query will be returned instead of fetching them from the database again.
Controller actions are wrapped with this automatically, but you can also enable it elsewhere like this:
1 2 3 |
User.cache do # do stuff with caching turned on. end |
However, sometimes you do not want this to happen. For example, if you want to fetch random records from the database, having this cached will cause you to get the same record each time you query.
Fortunately, the cache is easy to disable for parts of your code, with the uncached method (see also):
1 2 3 4 5 6 7 8 9 10 |
class User < ActiveRecord::Base def self.random # query for example purposes only -- # ordering by rand() is slow, see here: # http://jan.kneschke.de/projects/mysql/order-by-rand uncached do find(:first, :order => "rand()") end end end |
Disabling the cache only affects the code within the block, so unlike clearing the cache (which would also work) the rest of your code will still get the benefit of the query cache.
MapReduce is the architecture that Google uses to do things like index the web and calculate PageRank. It’s a somewhat popular topic for developers and bloggers, for two reasons. First, because Google uses it to such dramatic effect, and it’s easy to think that it must be the greatest and most powerful way to handle distributed processing. Second, it is a little hard to understand at first, which means that there is always a market for “intro to MapReduce” blog posts.
The thing is, there is nothing magical about MapReduce. It is fairly simple on the surface, once you understand a few basic concepts (like map and reduce, though MapReduce != map + reduce, as we’ll see soon). It also isn’t the “best” approach to distributed processing, because there are so many types of problems that need distributed processing, and MapReduce is only appropriate for a small subset.
So this post isn’t about DIY MapReduce. Instead, it’s about understanding MapReduce. Specifically, understanding it from a languages standpoint, with reference to map and reduce, rather than understanding it from a systems standpoint.
If you understand map and reduce, you’re about a third of the way to understanding MapReduce. But it is also important to note that MapReduce doesn’t even strictly need map or reduce functions, and is implemented at Google in C++ (not exactly a functional language). So map and reduce are more the conceptual foundation of MapReduce, rather than the underlying code.
But still, the MapReduce framework gets its name from these higher-order functions, and the basic pattern is simple:
map)reduce)Because map just applies a function to an element in an array, with no side effects, order doesn’t matter. You could run map backwards or forwards, and the result would be the same. Therefore, map operations can easily be parallelized over different CPU cores, or across multiple machines. (This is one of the advantages we get from greater abstraction. We could replace map with an each iterator, reduce, or even a for or while loop, but by using map, we know immediately that parallelization is possible.)
MapReduce also parallelizes its reduce stage, even though reduce is not inherently parallelizable. It does this not by parallelizing a single reduce, but by distributing each reduction to a different machine. So while the map stage is a single map, distributed over several computers, the reduce stage is multiple reduces, each operating on a single machine, and each independent of the reduction happening on the next machine.
reduce to MapReduceWhat kind of problem can be solved with MapReduce? Counting words is a basic example. Let’s say you want to count the number of occurrences of each word in War and Peace. You could do this simply in terms of inject, returning a hash of key-value pairs that list a word and its number of instances, like {"war" => 225, "peace" => 341}.
1 2 3 4 5 |
words = File.open("/path/to/war_and_peace.txt", "r").to_a.join(" ").split(" ") word_counts = words.reduce(Hash.new(0)) do |results, word| results[word.to_sym] += 1 results end |
This code is not parallelizable. Fortunately, it only takes Ruby about a second to count the instances of each word in War and Peace, which means that distributed MapReduce is only needed for much larger problems. But what if we wanted word counts of every book in Project Gutenberg? Or of every page on the entire internet? Or what if our calculation function took longer? There are 574,780 total words in the English translation of War and Peace that I’m using; if each word took a second to process, due to a network call or a complex calculation, it would take 6.5 days to process the book. Proust would take three weeks. Yikes!
That’s where MapReduce comes in. Instead of processing the entire list with a single reduce, imagine splitting the text of War and Peace into 200 even chunks. These chunks would then be mapped to 200 different servers, with each server doing its own parallel word counting, like this:
1 2 3 |
word_chunks.map do |chunk| assign_to_server(count_words(chunk)) #[{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}] end |
In reality, MapReduce works slightly differently; each chunk is represented as the value in a key-value pair, with the key being the identifier for that chunk (like 1..200 when using 200 even chunks, or the ID of a Google FileSystem cluster, or a filename when each server gets a different file). This key is useful for managing a MapReduce operation; if server XYZ is goes down, the master program knows that XYZ was handling chunk 19, and we can process chunk 19 again. So what we have is more like this:
1 2 3 |
word_chunks.each do |chunk_key, words| assign_to_server(count_words(chunk_key, words)) end |
You might be asking yourself: if the map phase always creates values of “1” attached to a particular word, so a mapper might end up with [{"the" => 1}, {"the" => 1], {"the" => 1}], why even use a hash? Why not just create a big nested array of words ([["the","the","the"]]), group by word, and count the elements in each array?
Well, first, MapReduce isn’t always used for word counts, so in another use, the values returned by map might be more significant. Second, this lets us introduce a “combiner” stage between map and reduce to optimize the process. This stage creates a local count for a particular server, which reduces bandwidth and makes life easier for the reduce stage. Without this stage, a mapper might return [{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}, {"the" => 1}], leaving it up to which ever reducer handles “the” to sum these numbers (along with other instances of “the”). But the combiner creates local sums, meaning the mapper will actually return [{"the" => 3}, {"cat" => 1}, {"dog" => 1}].
When all the distributed word counts finish, the results are grouped by key. So all the “cat” key/value pairs are grouped into one list, and the “dog” key/value pairs are grouped into another list. These are then reduced for a final word count. Our reduce pseudocode might look something like this.
1 2 3 4 5 |
grouped_results.each do |key, values| #key: "cat" #values: [1,3,12,9,1,2] total[key] = values.sum end |
These reductions can be distributed, so each pass through grouped_results can be handled by a different server, because the summing of instances of “war” is completely independent of the summing of instances of “peace”.
There is more to MapReduce than this; other pieces handle the fault tolerance, the grouping of keys between the map and reduce stages, etc. And the actual distribution of processing introduces a lot more complexity. If you actually want to use MapReduce, take a look at Hadoop, or a few Ruby distributed processing systems inspired by MapReduce (Skynet, Starfish).
Hopefully this post will give you a conceptual understanding of MapReduce; it’s an interesting and powerful architecture. Just remember that it is not the end-all of distributed processing, and just because it’s appropriate for Google doesn’t mean it is appropriate for you. In fact, MapReduce can only handle a certain array of problems; if you want to distribute video transcoding across multiple machines, for example, MapReduce can’t really help you. Keep in mind too that you shouldn’t reinvent the wheel – if Hadoop can help you, it’s already built and built well. But it never hurts to understand something new.
Cluster Computing and MapReduce (video series)
Map and reduce are two of the most important internal iterators in functional programming. But in my experience as a Ruby developer, while map is frequently used, it should be used a bit more; and reduce (== inject) is underused and often misunderstood.
So how do you know when to use map or reduce on a collection? Simple. When iterating through an array, if you don’t want a return value from the operations, use each; and if you’re looking for a return value, use the iterator method that delivers the type of value you want returned. So if you want to take a collection and return a subset of that collection based on some criteria, use select. (See an earlier article for more.) If you want to return a transformed version of each element, use map. And if you want to return any value whatsoever, or a value that doesn’t match another iterator method, use reduce.
As an aside, do reduce and map have anything to do with the MapReduce architecture for distributed processing? Not surprisingly, the answer is “yes,” and I’ll talk more about that later this week.
inject, reduce, foldOne function, three names. If you’re a Ruby user and have access to Ruby 1.8.7, I suggest you forget the name inject altogether; I find it confusing, personally, and moving forward, inject has another name: reduce. This is much better, and I’ll discuss terminology in a minute (along with a third common name for this function: fold).
reduce takes in an array and reduces it to a single value. It does this by iterating through a list, keeping and transforming a running total along the way. This running total can be a single value (0, 3.7, “abcdefg”), a collection ([], {}), or anything else, really. Each iteration starts with the return value of the previous iteration and does something with it.
Formally, reduce takes three arguments: a collection, an initial value (which is used on the first pass), and a function to apply at each pass through the collection. Here is a Ruby example that uses reduce to sum a series of numbers:
1 2 3 |
(5..10).reduce(0) do |sum, value| sum + value end |
Let’s walk through this example in more detail. Here are the three arguments passed to reduce in this example:
| Pass # | Collection Value | Running Total | Return Value |
| 1 | 5 | 0 (initializer value) | 5 |
| 2 | 6 | 5 | 11 |
| 3 | 7 | 11 | 18 |
| 4 | 8 | 18 | 26 |
| 5 | 9 | 26 | 35 |
| 6 | 10 | 35 | 45 |
The return value from this function will be 45. At each pass, the function takes two values: the current element in the array, and the return value from the previous return value (or the initializer value for the first pass). (This is the |sum, value| part of the Ruby example.)
What would this example look like using each instead of reduce?
1 2 3 4 5 6 |
sum = 0 (5..10).each do |value| sum += value end sum |
Any time you see this (anti)pattern – initializing a variable, looping to change the variable, and returning the variable – you know you need a new collection function. In this case, reduce does the trick.
Personally, while Ruby’s block syntax makes code beautifully readable, I sometimes have trouble keeping track of how this syntax relates to a straightforward functional syntax. After all, I described reduce as taking three arguments: a collection, a starter value, and a function. But in the Ruby example above, I’m only passing one argument (0) to reduce. So if it helps, here is a another way to think about reduce, in pseudo-scheme.
(reduce + 0 (range 5 10)) |
Here we’re explicitly passing three arguments to reduce: + (the addition operator), 0 (the seed value), and the range of numbers from 5 to 10 (our collection). Remember that (5..10).reduce(0) {|sum, value| sum + value } does exactly the same thing, just rearranged a bit.
Let’s look at a slightly more complicated case. reduce can be used to implement just about any other collection function, from map to sort to select. Here is a way to emulate select using reduce.
1 2 3 4 5 |
(1..10).reduce([]) do |result, value| result << value if value > 5 result end # [6, 7, 8, 9, 10] |
You can also emulate map with reduce, like this:
1 2 3 4 5 |
(1..10).reduce([]) do |result, value| result << value * value result end # [1, 4, 9, 25, 36, 49, 64, 81, 100] |
Of course, you wouldn’t want to do this. Whenever possible, you’re generally better off using a more specific function, like map in this case. If you want to sum numbers, use a sum function instead of reduce. If you want a hash, try build_hash. (I say “generally”, because there are also diminishing returns – creating a new reduce-style iterator for every possible use of reduce is overkill. Use your judgment.)
But this shows you the power of reduce; reduce can be used to implement any other internal iterator. Any time you want to take a collection return something else – a value, another collection, etc. – reduce is capable.
This function has three names: “inject”, “reduce”, and “fold”. All make sense from one perspective.
reduce function called on a 10 element array could return a 100 element array, or it could return a single integer, or a hash, or something else.So that’s reduce. If you’re having trouble getting your mind around it, I recommend reading up a bit more, because it is an important concept. It is also important to understanding MapReduce.
mapmap takes an array, applies a function to each element, and returns a new array with the results. Here is its equivalent using each.
1 2 3 4 5 |
email_addresses = [] users.each do |user| email_addresses << user.email end email_addresses |
We can improve upon this using map.
1 2 3 |
users.map do |user| user.email end |
This is quite a bit simpler than reduce, and I’m not going to spend much time on it. If you’re an experienced Ruby programmer, you’ve probably used map hundreds of times. If it’s new to you, just remember that map takes an array and returns an array of exactly the same size. And think of some practical uses of map:
These aren’t the only important iterator functions, by any means. But map, reduce, and select are among the most important. Get them solidly under your belt, and you’ll write better code. They’ll also help you from a conceptual standpoint; MapReduce isn’t exactly map + reduce; it can even be implemented in languages that don’t have map or reduce capabilities. But it forms the conceptual foundation of MapReduce, and MapReduce works because of specific properties of map and reduce. More on that later this week.
If you’re a programmer, you’ve probably worked through one or more books teaching you the syntax of a new language. I’ve had this experience with half a dozen languages, like C, Javascript, and Perl. These books are typically introduce loops midway through the syntax discussion, after datatypes and control flow, but before I/O and advanced features.
Loops are almost always presented according to this formula.
while loop, with difference between do while and while do.for loop, the while loop’s crazy cousin.foreach loop if language is sufficiently high-level.
And that’s it – you know how to loop through code; time to move on.Not so fast. If you’re lucky enough to use a language that draws from functional programming, you shouldn’t loop like this.
From now on, I’m going to use Ruby for examples, but this article isn’t about Ruby. It is about transitioning from primitive loops to iterating through collections, and from generic collection functions (like each) to more specific functions (like map).
For the last several months, I’ve been working on Tumblon, a medium-sized Rails application. I’ve worked on 15-20 Ruby applications over the last three years, probably totaling 50,000 lines of Ruby code.
I’ve only used a primitive loop once.
That primitive loop was a loop {} loop, forever polling a task list looking for jobs. In other words, a loop with no exit condition beyond ^C or a server crash. As far as I know, Ruby doesn’t have a for loop at all, which would explain why I haven’t used it. It has a foreach loop (for item in arr), but that’s syntactic sugar for arr.each {}.
So the first reason why I’ve only used a simple loop in one case: the each concept usually a better option. Its Ruby implementation will be familiar to anyone who’s seen Ruby code before:
1 2 3 |
["horse", "pig", "cow"].each do |animal| puts "Old MacDonald has a #{animal}" end |
(Yes, I have a small child.)
This is far cleaner than its for or while loop alternatives. And it is a better abstract representation of what we’re doing: we aren’t looping with an exit condition, we are iterating through an array. But what if you want to do something a fixed number of times? Even that can be understood as traversing a list, like [1,2,3,4,5,6,7,8,9,10].each {}. Of course, Ruby provides a cleaner version: 10.times {}.
So if your loop is working through a list of some sort, each is a better abstraction of the problem. And in my experience building Ruby applications, every loop but one has been traversing a list. Parsing XML? Traversing a collection. Summing numbers? Traversing a collection. Reading in a textfile? Listening to STDIN? Working with rows in a database? Traversing a collection. That’s what each loops do well.
arr.eachBut each isn’t the final word. It is a step up from a primitive for or while loop when working with a collection of values, but many each loops should be replaced with other array methods, like map, inject, and select.
When is each useful? Simple: when you want to create side-effects, like saving to the database, printing a result, or sending a web service call. In these cases, you’re not concerned with the return value; you want to change state on the screen, the disk, the database, or something else. Take a look at this code.
1 2 3 |
User.find(:all).each do |user| Notification.deliver_email_newsletter(user) end |
You don’t need a return value from this – you need emails to be delivered.
But don’t use each if you want to extract some new value from an array. That’s not what it’s for. Instead, take a look at three other powerful functions: map, inject, or select. To see why, let’s take a look at select. Here is code that takes in an array, and creates a new array from elements that match a certain condition, using each.
1 2 3 4 5 |
active_users = [] users.each do |user| active_users << user if user.active? end active_users |
Man, the first and last lines are ugly. Why do you have to initialize and return active_users? Answer: because this is a misuse of each. You are much better off using select (or its equivalent, find_all):
1 2 3 |
users.select do |user| user.active? end |
Using select is shorter, easier to understand, and less bug-prone. And more importantly, it clearly encapsulates one common use of each (and looping in general).
Two other key functions – map and inject (or reduce) – complement select and follow a similar pattern. And not surprisingly, they form the foundation of the mapreduce approach to distributed processing. I’ve written more about map and reduce in another article, and here is shorthand for knowing which of these functions to use:
| Desired Return Value | Function |
|---|---|
| New array with same number of values | map |
| New array composed of part of the old array | select |
| Single value (though this value can be an array) | inject |
| none | each |
Use each for changing state. Otherwise, avoid side-effects and use “functional” array methods that return a value. Simple. Your code will be cleaner and less bug prone.
And remember the dead giveaway:
new_arr = [])arr.each, changing the initialized valuereturn new_arr)Whenever you see this pattern, you know you’ve got an each loop that needs swapping out.
(Edit: I’ve posted a follow-up article with more about map and reduce.)
Lanyards. If I had to sum up RubyFringe in one word, that would be it. It wasn’t the most profound or even an interesting part of the RubyFringe experience, but when I picked up my registration packet, the lanyards immediately caught my attention. That’s when I knew RubyFringe was going to be different. Why? No ads.
Conference badge lanyards are usually emblazoned with some sponsor’s logo or slogan, making the attendees unwilling billboards for whatever the sponsor is shilling. Not so at RubyFringe. They were simple black twine.
As you’ve probably heard by now, RubyFringe was awesome. It was definitely the best conference I’ve ever been to. The only competition I can think of is is the super-high energy local BarCamps here in town.
Not taking sponsorship played a big part in RubyFringe’s success. It made the conference expensive (not more than RailsConf, which has a price that makes me expect caviar for lunch) but at every event, you knew you paid for it. It was all for you, not part of some sponsor’s largess.
RubyFringe was also single-track both in sessions and in parties. This also made a big difference. There was no “cool kids” party. Everyone was at the same party. And the parties were awesome. Great food, lots of drinks, fun people.
The “girlfriend babysitting” sightseeing track also made the conference more fun. Let’s face it, we are in a heavily male-dominated industry, and there is nothing less fun than partying with a bunch of dudes. Adding some women to the mix makes it more human. Plus, the girlfriends got a good taste of Toronto while we were nerding out. My wife sometimes comes with me to conferences, but I don’t think she usually has as much fun as she did at RubyFringe.
The talks were also of a universally high quality. Keeping the talks to 30 minutes really focused the speakers. Pete Forde and his fellow “curators” did an excellent job of selecting interesting speakers – who were then able to talk about whatever they wanted: Jazz. Philosophy. Life lessons. Entrepreneurship. Selling. The talks were surprisingly non-technical for the most part, but it worked out well. Much of the time, it’s more interesting to hear about someone’s experiences than Yet Another Ruby Library.
Keep your eyes peeled for the talks on InfoQ. They are all worth watching.
I’ve already posted a preliminary version of what I talked about. The full video is coming on InfoQ soon, and I will link to it when it does.
Additionally, here are my slides and the handout that I made.
Whew. That’s a long title. I am talking about taking an existing Rails application deployed with Apache and Mongrel and upgrading it to use Phusion Passenger and Ruby Enterprise Edition.
Here’s what I did.
First, if you’re going to use Phusion Ruby Enterprise Edition (I wanted to because of the recent Ruby security problems) I recommend starting with that first. Installation is straight-forward and does not conflict with your existing Ruby installation.
It supposedly can pick up gems installed with your old copy of Ruby, but I couldn’t figure out how to get that to work, so I had to install some gems again. You do that like this:
sudo /opt/ruby-enterprise-1.8.6-20080709/bin/gem install gem-name
We vendor all our gems except for a few that require native compilation or I just plain can’t get working in the vendor directory. I had to install hpricot, mime-types, and image_science.
The reason to start with Ruby Enterprise Edition is that you’ll have to reconfigure Passenger to use it if it’s not installed already.
Next, install Passenger. Again, this is straight-forward. However, make sure you install it with the Ruby Enterprise Edition binary.
sudo /opt/ruby-enterprise-1.8.6-20080709/bin/gem install passenger
Then build the Apache module. You may need to install the Apache development header files if they’re not already there.
sudo /opt/ruby-enterprise-1.8.6-20080709/bin/passenger-install-apache2-module
This will give you some lines of code to put in your httpd.conf file. Since you used Ruby Enterprise Edition to run the command, the PassengerRuby variable, the PassengerRoot, and the location of the mod_passenger.so will be inside your /opt/ruby-enterprise-X.X.X-YYYYMMDD directory tree.
Now you need to remove mod_proxy_balancer and mod_rewrite from your application’s Apache config file. The DocumentRoot you had before ought to be fine—Passenger can detect when Apache is serving up a Rails application.
This is what I had in my config file before:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# Configure mongrel_cluster <Proxy balancer://tumblon_cluster> BalancerMember http://127.0.0.1:8000 BalancerMember http://127.0.0.1:8001 </Proxy> RewriteEngine On # Prevent access to .svn directories RewriteRule ^(.*/)?\.svn/ - [F,L] ErrorDocument 403 "Access Forbidden" # Check for asset hosts RewriteRule %{REMOTE_HOST} ^assets\d.* [L] # Check for maintenance file and redirect all requests RewriteCond %{DOCUMENT_ROOT}/system/maintenance.html -f RewriteCond %{SCRIPT_FILENAME} !maintenance.html RewriteRule ^.*$ /system/maintenance.html [L] # Rewrite index to check for static RewriteRule ^/$ /index.html [QSA] # Rewrite to check for Rails cached page RewriteRule ^([^.]+)$ $1.html [QSA] # Redirect all non-static requests to cluster RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f RewriteRule ^/(.*)$ balancer://tumblon_cluster%{REQUEST_URI} [P,QSA,L] |
Obviously, we are not using mod_proxy_balancer at all, so that can just go (also watch out for the last RewriteRule, wich uses the balancer).
The RewriteRules – for the most part – aren’t needed because Passenger handles cached pages automatically. Also, Passenger conflicts with mod_rewrite and turns it off by default, so they don’t work anyway.
I don’t have an alternative for the maintenance page, but I don’t use that anyway.
Bonus tip: Set your Rails environment. If you application runs in anything other than production, you need to set it here. I have a staging server that runs in its own environment, so I set it like so:
RailsEnv staging
Only one more thing: update your deployment recipe. You don’t need to restart mongrel_cluster because there won’t be one.
I’m still in the dark ages using Capistrano 1.4.1, so this is the task I came up with:
1 2 3 4 5 6 7 8 9 10 |
desc "Restart Phusion Passenger (restarts Apache)" task :restart_app, :roles => :app do sudo "#{apache_ctl} restart" end desc "Override the built-in restart task to restart Apache" task :restart, :roles => :app do restart_app end |
Overriding restart is important because that’s what Capistrano will run. The default deprec version I was using restarted mongrel_cluster. This one restarts Apache.
You can restart Passenger by touching RAILS_ROOT/tmp/restart.txt but I prefer just to restart Apache.
And that’s about all there is to it.
Go is an ancient strategy game with simple rules and a profound degree of complexity.
Software development is the art of managing complexity using a limited number of rules, structures, and patterns.
Programmers should play Go.
The beauty of Go is its combination of simplicity and complexity. On the one hand, go has only a handful of rules. Place stones, don’t get completely surrounded, control territory. Like chess, the mechanics can be picked up in a few minutes, though Go only has a single type of “move”, and only one edge case (the ko rule). And like chess, one can spend a lifetime discovering the strategic and tactical layers of the game.
While chess is quite complex and rich, such that it took a 30-node supercomputer to defeat the reining chess champion, no computer comes close to defeating even a skilled amateur Go player. There are 361 positions on a Go board, and with two players, there are 2.08168199382×10170 valid positions. That’s quite a bit bigger than a googol (yes, that is the correct spelling). Realistically, there are something on the order of 10400 possible ways that a typical game could play out. And the number of possible moves roughly follows 361!, which means that only 20 moves in, there are many googols of possible ways that the game could shake down. (As a fun exercise, try plugging 361! into an online factorial calculator.)
So how does one play Go, given this near-infinite complexity? On a tactical level, a player approaches Go like chess, thinking several moves ahead. But this only works in small spaces, like a tight battle in a small sector of the board. Beyond there, there are just too many possibilities. So on a strategic level, a player must think in shapes or patterns. These shapes provide shorthand ways of managing the complexity of Go. As a non-master, I may have no idea how things will proceed in one sector of the board, but I may be able to recognize strong and weak patterns of stones, vulnerable shapes and effective formations.
But there’s more: Go has several sorts of patterns. Beyond shapes, there are Go proverbs. These can be general: “Your opponent’s good move is your good move”; specific: “Don’t try to cut the one-point jump”; funny: “Even a moron connects against a peep”; and meta: “Don’t follow proverbs blindly.” These proverbs are principles which help a player make good decisions. They are less specific than shapes, and so they provide guidelines for whatever situations may arise on the Go board. Proverbs often conflict, and a player must determine when and how to apply them.
Finally, there are joseki. Joseki are patterns of play that are considered even for both sides. They typically happen in the corners of the board, and typically at the beginning of the game. Interestingly, there is a Go proverb that says “Learning joeski costs two stones,” meaning that memorizing these patterns isn’t helpful. Instead, a player should learn from joseki by understanding what is going on in each move.
Each of these Go patterns has a rough programming analogue.
Shapes in Go aren’t unlike software design patterns. While there is nothing preventing you from placing logic in your views, this shape is recognized to be a weak one. Think of Gang-of-Four design patterns: the MVC, Adapter, and Factory patterns are recognized to be helpful in some circumstances (and not appropriate in others). On a lower level, iteration and recursion have commonly recognized shapes, as do database normalization vs. denormalization. Even if you can’t hold an entire program or algorithm in your head at once, recognizing common shapes helps you to understand what is going on.
Go proverbs are like another type of pattern in software: CapitalizedPrinciples (for lack of a better term) made popular by Extreme Programming. Think DontRepeatYourself, YouArentGonnaNeedIt, CollectiveCodeOwnership, DailyBuild, TestFirst. These aren’t specific code “shapes”, like a singleton class – they are general principles that guide the practice of programming.
Because joseki is about exchange between competing parties, its programming parallel is a little less clear. The closest comparison, in my mind, is programming exercises. This article, for instance, suggests 9 exercises to help you become a better OO programmer, like:
In a real-world program, you’re unlikely to stick to these principles 100% of the time. But forcing yourself to write code in this way can be an eye-opening experience and can make you a better developer.
Obviously, these parallels are structural. Specific Go proverbs (“Your opponent’s good move is your good move”) may not have direct relevance to software development. So can Go really make you a better developer?
I think it can, and I’ll go one further. I think Go can make you smarter. There is a lot of anecdotal evidence to this effect [1] [2] [3], for example [4]:
In fact, all of our minds can benefit from playing Go, which officially has the capacity to make you smarter. Research has shown that that children who play Go have the potential for greater intelligence, since it motivates both the right and left sides of the brain.
The research mentioned isn’t footnoted, unfortunately, so take statements like this with a grain of salt.
But it makes sense: like chess, Go requires pattern recognition, a mix of strategic and tactical thinking, and comprehension of complex structures, though in Go the patterns are larger and the complexity is greater. A mind trained to think in these ways is going to have an easier time attacking similar problems in other spheres.
Like software development.
Image by andres_colmen: http://flickr.com/photos/andres-colmen/2539473895/
Next week at RubyFringe, I’ll be taking on one of the programming world’s favorite topics: testing.
Hear me out. Like everyone who’s had their bacon saved by a unit test, I think testing is great. In a dynamic language like Ruby, tests are especially important to give us the confidence our code works. And once written, unit tests provide a regression framework that helps catch future errors.
However, testing is over-emphasized. If our goal is high-quality software, developer testing is not enough.
This is important because of what Steve McConnell calls The General Principle of Software Quality. Most development time is spent debugging. “Therefore, the most obvious method of shortening a development schedule is to improve the quality of the product.” (Code Complete 2, p. 474.)
Developer testing has some limitations. Here are a few that I’ve noticed.
Programmers tend write “clean” tests that verify the code works, not “dirty” tests that test error conditions. Steve McConnell reports, “Immature testing organizations tend to have about five clean tests for every dirty test. Mature testing organizations tend to have five dirty tests for every clean test. This ratio is not reversed by reducing the clean tests; it’s done by creating 25 times as many dirty tests.” (Code Complete 2, p. 504)
Robert L. Glass discusses this several times in his book Facts and Fallacies of Software Engineering. Missing requirements are the hardest errors to correct, because often times only the customer can detect them. Unit tests with total code coverage (and even code inspections) can easily fail to detect missing code. Therefore, these errors can slip into production (or your iteration release).
Tests alone won’t solve this problem, but I have found that writing tests is often a good way to suss out missing requirements.
Numerous studies have found that test cases are as likely to have errors as the code they’re testing (see Code Complete 2, p. 522).
So who tests the tests? Only review of the tests can find deficiencies in the tests themselves.
To cap it all off, developer testing isn’t all that effective at finding defects.
| Defect-Detection Rates of Selected Techniques (Code Complete 2, p. 470) | |||
|---|---|---|---|
| Removal Step | Lowest Rate | Modal Rate | Highest Rate |
| Informal design reviews | 25% | 35% | 40% |
| Formal design inspections | 45% | 55% | 65% |
| Informal code reviews | 20% | 25% | 35% |
| Modeling or prototyping | 35% | 65% | 80% |
| Formal code inspections | 45% | 60% | 70% |
| Unit test | 15% | 30% | 50% |
| System test | 25% | 40% | 55% |
The most interesting thing about these defect detection techniques is that they tend to find different errors. Unit testing finds certain errors; manual testing others; usability testing and code reviews still others.

As mentioned above, programmers tend to test the “clean” path through their code. A human tester can quickly make mincemeat of the developer’s fairy world.
Good QA testers are worth their weight in gold. I once worked with a guy who was incredibly skilled at finding the most obscure bugs. He could describe exactly how to replicate the problem, and he would dig into the log files for a better error report, and to get an indication of the location of the defect.
Joel Spolsky wrote a great article on the Top Five (Wrong) Reasons You Don’t Have Testers—and why you shouldn’t put developers on this task. We’re just not that good at it.
Code reviews and formal code inspections are incredibly effective at finding defects (studies show they are more effective at finding defects than developer testing, and cheaper too), and the peer pressure of knowing your code will be scrutinized helps ensure higher quality right off the bat.
I still remember my first code review. I was doing the ArsDigita Boot Camp which was a 2-week course on building web applications. At the end of the first week, we had to walk through our code in front of the group and face questions from the instructor. It was incredibly nerve-wracking! But I worked hard to make the code as good as I could.
This stresses the importance of what Robert L. Glass calls the “sociological aspects” of peer review. Reviewing code is a delicate activity. Remember to review the code…not the author.
Another huge problem with developer tests is that they won’t tell you if your software sucks. You can have 1500% test coverage and no known defects and your software can still be an unusable mess.
Jeff Atwood calls this the ultimate unit test failure:
I often get frustrated with the depth of our obsession over things like code coverage. Unit testing and code coverage are good things. But perfectly executed code coverage doesn’t mean users will use your program. Or that it’s even worth using in the first place. When users can’t figure out how to use your app, when users pass over your app in favor of something easier or simpler to use, that’s the ultimate unit test failure. That’s the problem you should be trying to solve.
Fortunately, usability tests are easy and cheap to run. Don’t Make Me Think is your Bible here (the chapters about usability testing are available online). For Tumblon, we’ve been conducting usability tests with screen recording software that costs $20. The problems we’ve found with usability tests have been amazing. It punctures your ego, while at the same time giving you the motivation to fix the problems.
Unit testing forces us to think about our code. Michael Feathers gets at this in his post The Flawed Theory Behind Unit Testing:
One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense….It’s a nice theory, but it’s wrong….
In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.
All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.
That’s the magic, and it’s why unit testing works also. When you write unit tests, TDD-style or after your development, you scrutinize, you think, and often you prevent problems without even encountering a test failure.
So: adapt practices that make you think about your code; and supplement them with other defect detection techniques.
Why do we developers read, hear, and write so much about (developer) testing?
I think it’s because it’s something that we can control. Most programmers can’t hire a QA person or conduct even a $50 usability test. And perhaps most places don’t have a culture of code reviews. But they can write tests. Unit tests! Specs! Mocks! Stubs! Integration tests! Fuzz tests!
But the truth is, no single technique is effective at detecting all defects. We need manual testing, peer reviews, usability testing and developer testing (and that’s just the start) if we want to produce high-quality software.
This September, I’ll be presenting at RailsConf Europe on EC2, MapReduce, and Distributed Processing. The talk will explain the MapReduce approach to distributed processing, will show a few example implementations, and will discuss MapReduce vs. other distributed processing techniques.
Whether you’ll be there or not, if you’re interested in learning more about MapReduce, here are some resources. I’ll write a few more posts on the subject before the conference, so watch this space as well.
Cluster Computing and MapReduce is a great series of video lectures given to Google interns in 2007. The first two are the most appropriate: the first introduces distributed processing concept, while the second covers MapReduce itself.
MapReduce: Simplified Data Processing on Large Clusters is the paper by Jeffrey Dean and Sanjay Ghemawat of Google that got things going in the first place.
MapReduce for Ruby: Ridiculously Easy Distributed Programming discusses MapReduce and introduces Starfish, a Ruby library for distributed processing. Starfish is not a MapReduce implementation, however – it takes a somewhat different approach to distributed processing.
Skynet (a few writeups: InfoQ, Dion Almaer) is another Ruby-based distributed processing system inspired by MapReduce.
Writing Ruby Map-Reduce programs for Hadoop discusses using Ruby to wrap Hadoop, a MapReduce-like system built in Java.
Introduction to Parallel Programming and MapReduce at Google Code University, a good overview of distributed processing and the MapReduce approach.
And finally, one article that you should avoid:
MapReduce: A major step backwards compares MapReduce to relational databases, and says that MapReduces loses out because it doesn’t support database indices, database views, Crystal reports, etc. Basically, the complaint is that MapReduce isn’t SQL compliant. WTF? Clearly, the author(s) didn’t understand what MapReduce is. The problem, as explained elsewhere, is that the authors thought that MapReduce == CouchDB/SimpleDB. Which is obviously not true. %s/MapReduce/SimpleDB the original article and it makes some sense. But long story short, this article will teach you nothing about MapReduce, and will likely confuse you further. So stay away.
If you’re just going to get one thing out of this article, it’s this: Vegas is a bad idea for RailsConf ‘09.
Here’s the story. At the end of RailsConf ‘08, in Portland for its second year, Chad Fowler tentatively announced the location of next year’s RailsConf, saying something like “We’re not sure what you’ll think about this, but what about Las Vegas?” Upon which the crowd erupted in cheering, thereby supposedly confirming the Vegas idea. But afterwards, I talked to a dozen people who said they thought Vegas was a bad idea. And in the month since then, I haven’t talked to a single person who was excited about it. That’s how crowds work, I guess – the 10% of Rails developers who enjoy gambling, strippers, and steak applauded loudly, and the 30% of don’t have a strong opinion one way or another got swept up in the excitement.
Of course, it doesn’t really matter – holding the next RailsConf in Las Vegas won’t kill Rails, and won’t set the Rails community on a future of drunkenness, adultery, and gambling. I’m sure most conference-goers will fly in, attend sessions, have dinner, and hack in their hotel lobbies, just like any other conference.
But it’s still a bad idea.
If you haven’t already, check out Giles Bowkett’s recent post on the situation. He talks about Ruby Central’s desire to “keep RailsConf weird.” What’s especially confusing, as Giles points out, is that the Vegas announcement came after DHH’s keynote, which said that we should get more sleep and use our advantages for good, not evil (in the form of hookers and fur coats). Charles Nutter made a similar point, saying that Rails + enterprise doesn’t have to mean steak and strippers.
As Giles puts it,
DHH is saying, “No hookers! Choose a life well-lived!” And RailsConf is like, “Screw Portland! We’re going to Vegas!”
Portland was a great place for a conference. Portland has great food & beer, cool hotels, the world’s biggest bookstore, and a dozen movie-theater-pubs. Every night after the conference, a thousand Rails geeks would descend upon the city’s first-rate brewpubs and coffee shops to meet people, eat and drink, and discuss programming, politics, philosophy, or whatever. $5.99 steak buffets just won’t be the same. I really enjoyed my twice annual trip to Portland, and would love to go back again. If Ruby Central wants to keep RailsConf edgy, it can’t do much better than Portland.
Of course, if we don’t want to do RailsConfs ‘09, ‘10, ‘11, and ‘12 in the same place, there is no lack of great cities to consider. What about Seattle? Boston? Austin? San Diego? San Francisco? New York? Kansas City? Toronto? Vancouver? Minneapolis?
So enough about Vegas. I get asked from time to time if RailsConf is worthwhile. My answer is: “Yes (I think).”
The first RailsConf was small(er), high energy, and novel. Rails was just on the brink of mainstream acceptance, which is a fun time in the life of a technology – it’s growing rapidly but is still edgy. The sessions were generally high quality, and the keynotes were excellent (Dave Thomas, Martin Fowler, David Heinemeier Hansson, Paul Graham, and _why).
RailsConf 2007 was a completely different conference, what with O’Reilly and 1600 attendees. Ze Frank was great, but the overall session quality was pretty weak. There were some great talks, to be sure; but there weren’t enough really deep technical talks, and some of the presenters didn’t seem to have really practiced. Things were probably set off on the wrong note from the beginning: Day 0’s 3-hour tutorials were mostly disappointing. Also, there seemed to be an abundance of non-programmer business folks, probably there to check out this new thing called Rails.
This year’s conference was quite a bit better. It had the same polished feel that O’Reilly brings, which is both good and bad. Keynotes were mixed. But mainly, the sessions were mostly really good. Whatever David Black, Chad Fowler, and Rich Kilmer did to improve the session quality, it worked. Interestingly enough, it seemed like the non-programmers were gone this year.
RubyConf 2007 (the only one I’ve attended) felt a lot like RailsConf 2006. It had a similar size, a similar venue, and somewhat similar atmosphere. Sessions were good, and I learned quite a bit. Most of the people there were Rails developers, but they were the ones who were interested in Ruby as a language and not just as the technology behind Rails.
So my answer is that RailsConf is worth it, as long as it follows the 2008 path. It will never look like 2006 again; big RailsConfs are here to stay. But 2008 was reasonably graceful for a big conference. I just hope it doesn’t fall into the 2007 trap, which will happen if it tries to cater to managers and the mainstream. It’s impossible to know what next year’s conference will look like, though I think the success of 2008 was a conscious rejection of some of the failures of 2007. And if you’re looking for something smaller and edgier, there’s always RubyConf, RubyFringe, and the regional conferences.
(Two updates.
First, the title of this article was ambiguous, so I changed it from “Just say ‘no’ to RailsConf Las Vegas” to “Just say ‘no’ to Vegas, RailsConf”. The title was supposed to say “Vegas is a bad idea for RailsConf,” not “Don’t attend if it is in Las Vegas.” If you decide not to attend because of the location, that’s fine – but I’ll consider being there either way.
Second, thanks to Chad Fowler and David A. Black for weighing in. Conferences are a lot of work, as Luke knows on a smaller scale, and I can’t imagine what kind of work Ruby Central puts into RailsConf each year. Especially when they’d probably be hacking. :) So thanks for the work, and the conferences that result.)

As the top of the RubyFringe site says: “Only 0 days left to register!”
If you’ve been thinking about attending, now’s the time to pull the trigger. Registration ends tonight at midnight, EDT.