Notes on “Counting Tree Nodes”
Contents
Process
Before I get my hands dirty with the actual programming stuff, I’d like to note down a couple of things that occurred to me about how I work. (Feel free to skip ahead if you’re not interested in process — you won’t miss anything essential.)
Tools
I jumped straight into solving Drew’s problem without saying anything about my choice of tools.
The problem — counting nodes in trees — is fundamentally a simple one, so I reached for my usual equipment and didn’t try anything clever:
- Ruby is my go-to language for everyday programming, so that’s what I used to express my solution. A language with algebraic data types and pattern matching (e.g. OCaml or Haskell) would’ve made it easier to simply write out the answer, but I wanted to show my normal workflow and look for interesting things along the way rather than racing straight to the finish, so Ruby it was.
- I recently ended years of frustration with TextMate by going cold turkey on Vim, so I used Vim as my editor despite still being very clumsy with it. ctrlp.vim, vim-rspec, vim-endwise, vim-commentary and vim-gitgutter made things a bit easier.
- All of my projects live inside Git repositories to which I commit early and often, so I tried to do that here. (Performance anxiety made me forget about Git altogether for the first few minutes, which got me off to a shaky start.)
I also said almost nothing about my way of approaching the problem: I wrote a few simple acceptance tests in Cucumber, then dropped down to RSpec and wrote unit tests. I wish I’d taken a few minutes to explain that decision, but at the time it didn’t seem worth mentioning because it’s as much a part of my routine as Git, Vim and Ruby are.
I realise in hindsight that some people, particularly those who usually encounter Cucumber as part of a larger development environment (Rails + MySQL + Redis + Cucumber + Capybara + Selenium + …), may consider it overkill for a simple project like this, but I don’t find it to be. By itself, Cucumber is a simple tool for describing and organising acceptance tests — it began life as a feature of RSpec — and I encounter very little friction in using it to describe the outward behaviour of whatever I’m building.
For me, Cucumber works more like a mind hack than a testing tool. It reminds and invites me to think about the experience of using software separately from the details of its implementation. On this occasion I found it useful to have a dedicated space to talk to Drew about what the code was supposed to do and come up with some representative inputs for us to work with. There’s no reason why I had to use Cucumber for this — plain Ruby or RSpec acceptance tests would’ve been fine — but I’m comfortable with Cucumber so it felt natural to me.
Test doubles
When I’m writing unit tests I typically use test doubles (almost exclusively stubs and mocks) to isolate the code under test so that I can explore the problem in small, controlled pieces. That usually works well, but for some reason I didn’t feel comfortable doing it while working on this problem, and I’d like to better understand why.
It seems likely that my notion of what “isolate” means is too narrow-minded, too fine-grained, or just flat-out inappropriate for problems like this.
I mostly work on programs that can naturally be thought of as systems of collaborating objects, so it often makes sense to aim for “collaborator isolation”: treat a single object (and sometimes a single method) as the unit that’s being tested, and avoid exercising the rest of the system by using test doubles to stand in for any other objects that would otherwise be involved.
But in the case of the #descendant_count
and #leaf_count
methods I implemented in the episode, my instinct is that it’s less useful to think of isolation in those terms. That might mean that I need to change how I think about isolation, or it might mean that unit testing is less appropriate than other approaches (namely integration and acceptance testing) for this kind of code.
Off the top of my head, I can think of three possible reasons for that:
-
The methods are written in a purely functional style that doesn’t rely on mutating the internal state of objects, and it feels weird to use mocks to verify interactions that have no effect.
Perhaps that’s a result of the tension between object-oriented and functional programming: OO implicitly favours the “tell, don’t ask” style in which objects avoid asking other objects for information, but functional programming is all about functions calling other functions and making decisions based on the values they return.
So, alright, mocks are a bad fit for stateless functional code. But that doesn’t necessarily mean that other kinds of test double are a bad fit too; I’ve used stubs successfully in larger functional programs.
-
The methods are defined recursively: when they collaborate with another object, they’re calling themselves, just in a different context. In a sense the method is already “isolated” without replacing those collaborators with doubles, because the recursive calls don’t lead the flow of control outside of the code under test. Trying to force more isolation than this might be an unhelpful, dogmatic step too far.
-
The behaviour of one invocation of
#descendant_count
or#leaf_count
is so simple as to be uninteresting. What I really wanted to know was whether the behaviour across many recursive invocations was correct, which invited more of a classical (vs mockist) unit test, even straying into integration test territory.I generally find this kind of test less useful for designing programs and preventing bugs, although its black-box flavour does make it easier to refactor method internals without breaking the tests (this comes up in “Catamorphisms” below).
I don’t know if any of this is right — I’m just thinking out loud. There must be clever people out there who’ve already thought about this and come up with something more concrete than my vague speculations; if you’re one of them, or know someone who is, please tell me!
Programming
Alright, that’s enough hand-waving about process. Let’s get stuck into some actual code.
Immutable trees
I didn’t explain why I wanted the tree structure to be immutable, beyond some vague suggestion that I disliked the idea of clients messing with it. Like my decision to use Cucumber, this was really just an instinct that I didn’t feel the immediate need to discuss, but I now think some elaboration would’ve been useful.
Immutability is generally desirable for various reasons, but here it elegantly imposes an important constraint: a tree shouldn’t contain cycles.
Let’s say that we have a MutableNode
class which exposes an array of its children
:
>>class MutableNode attr_accessor :children def initialize self.children = [] end end
=> … >>tree = MutableNode.new
=> #<MutableNode @children=[]> >>tree.children = 2.times.map { MutableNode.new }
=> [#<MutableNode @children=[]>, #<MutableNode @children=[]>] >>tree.children.last.children = [MutableNode.new]
=> [#<MutableNode @children=[]>] >>tree
=> #<MutableNode @children=[ #<MutableNode @children=[]>, #<MutableNode @children=[ #<MutableNode @children=[]> ]> ]>
It’s easy to write a #height
method that recursively walks down a tree to find the number of nodes (versus edges, per Drew’s problem statement) in the longest path from its root to a leaf:
>>module Parent def leaf? children.none? end def height if leaf? 1 else children.map(&:height).max + 1 end end end
=> … >>MutableNode.include(Parent)
=> … >>tree.height
=> 3
But there’s nothing to stop us creating (perhaps accidentally) an infinitely-deep “tree” which contains itself:
>> tree.children.push(tree)
=> [
#<MutableNode @children=[]>,
#<MutableNode @children=[
#<MutableNode @children=[]>
]>,
#<MutableNode @children=[...]>
]
Now #height
will try to walk down the tree forever, eventually exhausting the Ruby interpreter’s call stack:
>> tree.height
SystemStackError: stack level too deep
This crash is symptomatic of a larger problem. We’d like to write straightforward code that makes reasonable assumptions about the structures it manipulates, but our MutableNode
implementation allows one such assumption to be violated.
An easy way of avoiding this is to make it more difficult to change the children
of a node after it’s been created:
>> class ImmutableNode
include Parent
def initialize(children)
self.children = children.dup
end
private
attr_accessor :children
end
=> …
(This isn’t true immutability, but it’s as close as we can conveniently get in such a permissive language; “Ruby is not a language to keep people away from horror”.)
Immutability solves the problem by forcing a particular order of operations. To build a tree out of immutable pieces, we must start by creating the leaves, then use those leaves to construct their parent nodes, and so on until we finish by creating the root node:
>>a, b = 2.times.map { ImmutableNode.new([]) }
=> [#<ImmutableNode @children=[]>, #<ImmutableNode @children=[]>] >>c = ImmutableNode.new([b])
=> #<ImmutableNode @children=[#<ImmutableNode @children=[]>]> >>tree = ImmutableNode.new([a, c])
=> #<ImmutableNode @children=[ #<ImmutableNode @children=[]>, #<ImmutableNode @children=[ #<ImmutableNode @children=[]> ]> ]> >>tree.height
=> 3
There’s no way to pass an ImmutableNode
or any of its ancestors into its own constructor, because neither the node nor its ancestors can possibly exist at the moment when the constructor is being called. The resulting tree is cycle-free by construction; the arrow of time prevents us from tying the knot.
Note that our #height
implementation treats the children
collection as Enumerable
, but of course Enumerable
makes no guarantees about finiteness, so we can still break #height
by supplying a collection of children
that goes on forever:
>>tree = ImmutableNode.new([ImmutableNode.new([])].cycle)
=> #<ImmutableNode @children=#<Enumerator: [#<ImmutableNode @children=[]>]:cycle>> >>require 'timeout'
=> … >>Timeout.timeout(60) { tree.height }
Timeout::Error: execution expired
I don’t know an elegant way of avoiding that problem in Ruby, although it seems less likely to be a problem in practice.
In other situations it may also be important to ensure that no children are shared between nodes — this is what differentiates trees from polytrees, multitrees and other kinds of directed acyclic graph — but that didn’t matter for any of the code I wrote this time.
Enumerators
The problem with ImmutableNode
is that it doesn’t let you see its children at all! We could fix that by adding a simple #children
getter, but then a client of an ImmutableNode
would be able to get hold of its children
array and mutate it directly. We could fix that by freezing children
, but freezing is a clumsy solution which introduces its own problems.
A more expressive alternative is to expose the children as an Enumerator
. An Enumerator
lets us generate a collection dynamically with code:
>>fruits = Enumerator.new do |yielder| yielder.yield 'apple' yielder.yield 'banana' yielder.yield 'cherry' end
=> #<Enumerator: #<Enumerator::Generator>:each> >>fruits.each do |fruit| puts fruit end
apple banana cherry => …
Because the Enumerator
class has the Enumerable
module mixed in, it supports all the usual Ruby collection methods:
>>fruits.count
=> 3 >>fruits.take(2)
=> ["apple", "banana"] >>fruits.map(&:upcase)
=> ["APPLE", "BANANA", "CHERRY"]
We can wrap any Enumerable
object (e.g. an array) with an Enumerator
by writing a block that explicitly iterates over it…
>>vegetables_array = %w(asparagus broccoli carrot)
=> ["asparagus", "broccoli", "carrot"] >>vegetables_enumerator = Enumerator.new do |yielder| vegetables_array.each do |vegetable| yielder.yield vegetable end end
=> #<Enumerator: #<Enumerator::Generator>:each> >>vegetables_enumerator.map(&:chop)
=> ["asparagu", "broccol", "carro"]
…but it’s much easier to use Ruby’s built-in #to_enum
method to achieve the same thing:
>>vegetables_enumerator = vegetables_array.to_enum
=> #<Enumerator: ["asparagus", "broccoli", "carrot"]:each> >>vegetables_enumerator.map(&:reverse)
=> ["sugarapsa", "iloccorb", "torrac"]
The point of all this is that an Enumerator
provides no way for a caller to modify its contents. Exposing an Enumerator
sends a strong message: “you may iterate over these objects, but you may not remove, add to or reorder them”:
>>vegetables_enumerator.push('daikon')
NoMethodError: undefined method `push' for #<Enumerator: ["asparagus", "broccoli", "carrot"]:each> >>vegetables_enumerator.delete('broccoli')
NoMethodError: undefined method `delete' for #<Enumerator: ["asparagus", "broccoli", "carrot"]:each>
(Note, however, that anyone with access to vegetables_array
can still mutate it, thereby indirectly changing the contents of vegetables_enumerator
. Again, Ruby’s permissiveness makes it hard to entirely prevent this sort of thing, so we have to content ourselves with designing interfaces that encourage the behaviour we want.)
A convenient and idiomatic way of exposing a node’s children is to provide an #each_child
method that takes a block argument. #each_child
calls #to_enum
on children
to create an Enumerator
, then immediately passes the block to the Enumerator
’s #each
method:
>> class EnumerableNode
def initialize(children)
@children = children.dup
end
def each_child(&block)
@children.to_enum.each(&block)
end
end
=> …
When #each_child
is called with a block, it’ll iterate over the children
in the obvious way:
>>tree = EnumerableNode.new([ EnumerableNode.new([]), EnumerableNode.new([ EnumerableNode.new([]) ]) ])
=> #<EnumerableNode …> >>tree.each_child do |child| puts child.class end
EnumerableNode EnumerableNode => …
But if we don’t provide a block, the #each
inside #each_child
will just return the Enumerator
to the caller untouched:
>> tree.each_child
=> #<Enumerator: …>
Once we have the Enumerator
containing all the children, we can call any Enumerable
method on it:
>> tree.each_child.count
=> 2
For our #height
method to work we only need children
to respond to #none?
and #map
, both of which are from Enumerable
, so we can implement #children
to simply return the Enumerator
and everything will be fine:
>>class EnumerableNode include Parent def children each_child end end
=> … >>tree.children
=> #<Enumerator: …:each> >>tree.height
=> 3
Adding and removing children
Although Enumerator
s are immutable, we can still use one as the basis for a modified collection by creating another Enumerator
:
>> more_fruits = Enumerator.new do |yielder|
fruits.each do |fruit|
yielder.yield fruit
end
yielder.yield 'damson'
end
=> #<Enumerator: #<Enumerator::Generator>:each>
This new Enumerator
iterates over fruits
, yielding every value it finds, and afterwards yields the extra value 'damson'
. As a result, more_fruits
behaves like fruits
with an extra member:
>>fruits.include?('damson')
=> false >>more_fruits.include?('damson')
=> true >>more_fruits.count
=> 4 >>more_fruits.drop(2).take(2)
=> ["cherry", "damson"]
We can do something similar to remove a member from a collection:
>> fewer_fruits = Enumerator.new do |yielder|
more_fruits.each do |fruit|
yielder.yield fruit unless fruit == 'banana'
end
end
=> #<Enumerator: #<Enumerator::Generator>:each>
This Enumerator
iterates over more_fruits
and yields all of its members except for 'banana'
. So fewer_fruits
behaves like more_fruits
with 'banana'
removed:
>>more_fruits.include?('banana')
=> true >>fewer_fruits.include?('banana')
=> false >>fewer_fruits.count
=> 3 >>fewer_fruits.each do |fruit| puts fruit end
apple cherry damson => …
Here the 'apple'
and 'cherry'
members are coming from the original fruits
collection, and 'damson'
is coming from more_fruits
, though we’re not getting its 'banana'
.
We can use this idea to support adding a child to an immutable node:
>> class EnumerableNode
def add_child(extra_child)
children = Enumerator.new do |yielder|
each_child do |child|
yielder.yield child
end
yielder.yield extra_child
end
self.class.new(children)
end
end
=> …
>>extra_child = EnumerableNode.new([ EnumerableNode.new([ EnumerableNode.new([]) ]) ])
=> #<EnumerableNode …> >>bigger_tree = tree.add_child(extra_child)
=> #<EnumerableNode @children=#<Enumerator: #<Enumerator::Generator>:each>> >>tree.children.include?(extra_child)
=> false >>bigger_tree.children.include?(extra_child)
=> true >>bigger_tree.height
=> 4
And similarly, we can remove a child too:
>> class EnumerableNode
def remove_child(unwanted_child)
children = Enumerator.new do |yielder|
each_child do |child|
yielder.yield child unless child == unwanted_child
end
end
self.class.new(children)
end
end
=> …
>>unwanted_child = bigger_tree.children.drop(1).first
=> #<EnumerableNode @children=[#<EnumerableNode @children=[]>]> >>smaller_tree = bigger_tree.remove_child(unwanted_child)
=> #<EnumerableNode @children=#<Enumerator: #<Enumerator::Generator>:each>> >>bigger_tree.children.include?(unwanted_child)
=> true >>smaller_tree.children.include?(unwanted_child)
=> false >>bigger_tree.children.count
=> 3 >>smaller_tree.children.count
=> 2
Whether this technique is actually more sensible or efficient than a more conventional approach (e.g. calling #entries
on the Enumerator
and mutating the resulting array) depends on the particular application, but it’s worth knowing about, because it’s one way of achieving the “structural sharing” used by many persistent data structures.
Catamorphisms
One of the methods I wrote during the episode was #descendant_count
:
def descendant_count
if leaf?
0
else
children.map(&:descendant_count).inject(:+) + children.count
end
end
Compare this to the definition of #height
from earlier:
def height
if leaf?
1
else
children.map(&:height).max + 1
end
end
There are obvious structural similarities between these two methods. They’re both recursive, and they both do basically the same job with only minor variations:
-
if the node is a leaf:
-
return a predetermined value
- for
#descendant_count
, that value is0
- for
#height
, it’s1
- for
-
return a predetermined value
-
if the node isn’t a leaf:
- call the current method on every child node and collect the resulting values
-
combine those values into an overall result
- for
#descendant_count
, sum the values and add the number of children - for
#height
, find the largest value and add1
to it
- for
- return the overall result
This kind of similarity is a strong signal that our program has some underlying structure waiting to be discovered. There’s value in identifying that structure: at the very least it’ll tell us something interesting about the inherent nature of the problem we’re trying to solve; on a good day it might help us better understand how to solve it; and ideally it’ll suggest a way to make our implementation more consistent and expressive.
In this case, we can extract the general pattern of recursive tree-walking into a method called #fold
:
>> module Parent
def fold(leaf_result, &combine_results)
if leaf?
leaf_result
else
child_results = children.map { |child|
child.fold(leaf_result, &combine_results)
}
combine_results.call(child_results)
end
end
end
=> …
Once we have #fold
, it’s easier to write methods like #descendant_count
and #height
. We just have to plug in a leaf value and a block for combining child results:
>>module Parent def descendant_count fold(0) { |counts| counts.inject(:+) + counts.count } end def height fold(1) { |heights| heights.max + 1 } end end
=> … >>[tree, bigger_tree, smaller_tree].map(&:descendant_count)
=> [3, 6, 4] >>[tree, bigger_tree, smaller_tree].map(&:height)
=> [3, 4, 4]
(Incidentally, this is an example of how using test doubles would’ve made the tests more brittle, because now we’re recursively calling #fold
on each child rather than #descendant_count
/#height
, even though the actual work being done hasn’t changed. I don’t know how to solve this without sacrificing some test isolation.)
And of course it’s just as easy to write other methods in a similar vein:
>>module Parent def leaf_count fold(1) { |counts| counts.inject(:+) } end end
=> … >>[tree, bigger_tree, smaller_tree].map(&:leaf_count)
=> [2, 3, 2]
The episode uses a version of #leaf_count
which takes a depth
argument, and fitting that into this pattern requires a bit more work. Firstly, we can use #fold
to write a method that returns a count of all leaves grouped by depth:
>>module Parent def leaf_count_by_depth fold(1 => 1) { |counts| increment_keys(sum_hashes(counts)) } end private def sum_hashes(hashes) hashes.inject(&method(:add_values)) end def add_values(h1, h2) h1.merge(h2) { |k, v1, v2| v1 + v2 } end def increment_keys(hash) hash.map { |k, v| [k + 1, v] }.to_h end end
=> … >>another_tree = EnumerableNode.new([ EnumerableNode.new([ EnumerableNode.new([]) ]), EnumerableNode.new([]), EnumerableNode.new([ EnumerableNode.new([]), EnumerableNode.new([]) ]) ])
=> #<EnumerableNode …> >>another_tree.leaf_count_by_depth
=> {3=>3, 2=>1}
Then #leaf_count
can simply pick the relevant value out of that hash:
>>module Parent def leaf_count(depth:) leaf_count_by_depth.fetch(depth, 0) end end
=> … >>another_tree.leaf_count(depth: 2)
=> 1 >>another_tree.leaf_count(depth: 3)
=> 3 >>another_tree.leaf_count(depth: 4)
=> 0
This version of #leaf_count
contains more code than the one in the episode, but it’s composed from simpler parts, which makes it easier to check that it’s doing the right thing.
#fold
’s job might remind you of Enumerable#inject
(a.k.a. #reduce
), which isn’t a coincidence; methods like #inject
are usually called folds in functional programming languages. Our #fold
takes the idea of folding a block over a flat collection of values and generalises it to a more complex data structure, in this case a tree. This pattern of using the shape of a data structure to guide the processing of its contents is called a catamorphism.
You might also be justifiably sceptical of this abstraction’s complexity, since it’s possible to write both #descendant_count
and #height
without even bothering to check whether the node is a leaf, by taking advantage of the behaviour of #inject
and #max
when children
is empty:
>>module Parent def descendant_count children.map(&:descendant_count).inject(0, :+) + children.count end def height (children.map(&:height).max || 0) + 1 end end
=> … >>[tree, bigger_tree, smaller_tree].map(&:descendant_count)
=> [3, 6, 4] >>[tree, bigger_tree, smaller_tree].map(&:height)
=> [3, 4, 4]
And yes, in this case I’m mentioning catamorphisms mostly because they’re interesting rather than because they’re necessarily an appropriate response to the task in hand. Abstractions like this must be used with care, because their sometimes esoteric benefits can so easily be outweighed by the cost of understanding them; naïve and obvious programs are usually better than clever and obfuscated ones.
On the other hand, the usefulness of this particular abstraction scales with the complexity of the data structure we’re processing. The trees in the episode don’t store any data, but imagine a different kind of tree built from instances of two classes, Leaf
(a node containing a value) and Branch
(a node with children):
>>class Leaf attr_reader :value def initialize(value) @value = value end end
=> … >>class Branch def initialize(children) @children = children.dup end def each_child(&block) @children.to_enum.each(&block) end end
=> …
A #fold
for this structure might look like this:
>>class Leaf def fold(make_result, combine_results) make_result.to_proc.call(value) end end
=> … >>class Branch def fold(make_result, combine_results) child_results = each_child.map { |child| child.fold(make_result, combine_results) } combine_results.to_proc.call(child_results) end end
=> …
We could use these trees to store instances of a Fruit
class:
>>Fruit = Struct.new(:name)
=> … >>fruit_tree = Branch.new([ Branch.new([ Leaf.new(Fruit.new('apple')), Leaf.new(Fruit.new('banana')) ]), Leaf.new(Fruit.new('cherry')), Branch.new([ Leaf.new(Fruit.new('damson')), Leaf.new(Fruit.new('elderberry')) ]) ])
=> #<Branch …>
And then we’d be able to easily write code to process the tree’s Fruit
in different ways:
>>fruit_tree.fold :name, :join
=> "applebananacherrydamsonelderberry" >>fruit_tree.fold \ -> fruit { fruit.name.length }, -> lengths { lengths.inject(:+) }
=> 33 >>fruit_tree.fold \ -> fruit { fruit if fruit.name.length == 6 }, -> fruits { fruits.flatten.compact }
=> [ #<struct Fruit name="banana">, #<struct Fruit name="cherry">, #<struct Fruit name="damson"> ]
There’s no reason why #fold
can’t also produce a tree. For example, we could reverse the order of every Branch
’s children:
>> fruit_tree.fold \
Leaf.method(:new),
-> children { Branch.new(children.reverse) }
=> #<Branch @children=[
#<Branch @children=[
#<Leaf @value=#<struct Fruit name="elderberry">>,
#<Leaf @value=#<struct Fruit name="damson">>
]>,
#<Leaf @value=#<struct Fruit name="cherry">>,
#<Branch @children=[
#<Leaf @value=#<struct Fruit name="banana">>,
#<Leaf @value=#<struct Fruit name="apple">>
]>
]>
Or we could #dry
all the tree’s Fruit
while leaving its structure unchanged:
>>class Fruit def dry self.class.new("dried #{name}") end end
=> … >>fruit_tree.fold \ -> fruit { Leaf.new(fruit.dry) }, Branch.method(:new)
=> #<Branch @children=[ #<Branch @children=[ #<Leaf @value=#<struct Fruit name="dried apple">>, #<Leaf @value=#<struct Fruit name="dried banana">> ]>, #<Leaf @value=#<struct Fruit name="dried cherry">>, #<Branch @children=[ #<Leaf @value=#<struct Fruit name="dried damson">>, #<Leaf @value=#<struct Fruit name="dried elderberry">> ]> ]>
So catamorphisms can save effort in more complex cases. When done judiciously, the extra convenience of using an abstraction may well justify the cost of understanding how it works.
Conclusion
I had a lot of fun working on this problem. At its best, programming is an adventure into the unknown; the goal is to unlock the trove of interesting gubbins hidden inside even the simplest task, and this was a great chance to do that. I enjoyed having an excuse to focus on something small and dig out fun things to think about. (I suppose this is what I enjoyed about writing a book about simple machines.)
I did find it surprisingly difficult to work while being recorded, and only partly because of the challenge of making my brain do several things at once. The situation is not exactly like a job interview, but it’s similarly nerve-racking: being watched heightens self-consciousness, which in turn inhibits clear thinking and makes silly mistakes more likely. I’d like to get better at doing this so that I can program in public without feeling nervous.
In hindsight it would’ve been nice to get further with solving the actual problem, too, but I’m hopeful that my diversions away from the destination were interesting enough to justify the slow progress.
Oh, and I definitely need to get better at Vim.