Part 4: Trees

This article is part of a series on computer science fundamentals. If you’re new here, you should probably catch up with previous pieces before reading on!

‍

Finally. We’re getting to the good stuff.

Let’s recap: recently, we introduced Linked Lists, which allowed us to link (see?) multiple values in a sequential manner. In a linked list, each value has one successive value, except for the final value, which points to nothing, a.k.a. nil or null.

Now, what if we took our Linked List, and allowed its nodes to have more than one successive value? For starters, let’s not get too wild and allow only two values instead of one. What would that look like?

If you squint really hard, you can imagine a real tree in that figure. Here, let me help you a bit.

In practice, trees are usually depicted from top to bottom, because that’s intuitively how hierarchies work. And trees are really good for modelling hierarchical relations. We’ll talk about that in a bit, but first, let’s look at some code.

tree TreeNode struct {
  value int       // the actual value we store in this node
  left *TreeNode  // pointer to the left sub-tree
  right *TreeNode // pointer to the right sub-tree
}

I’m once again using Go to make the typing more obvious.

A TreeNode bears a remarkable resemblance to a LinkedListNode:

type LinkedListNode struct {
  value bool           // the actual boolean we want to store
  next *LinkedListNode // pointer to the next number in our array
}

The obvious difference is that instead of a single next pointer, we have two, left and right. And that little difference opens up a myriad of possibilities.

👉 An astute reader might have noticed that left and right point to sub-trees. That’s right, if you look really close, you can see that the big tree consists of several smaller ones. Algorithms for working with trees are usually recursive and heavily utilise this handy trait.

‍

How can a binary tree help with implementing a binary search?

👉This is what we have introduced - a tree where nodes have two children, hence binary. Corollary to this, non-binary trees exist too.

What on earth is a binary search? Where did that come from?

Don’t worry, you didn’t miss anything. We haven’t talked about it before. I’m only introducing it now because it’s a very useful demonstration of what trees are capable of.

If you ever ran git bisect, that uses the same idea.

First, let me explain the idea behind the binary search.

Suppose we had an array filled with numbers, and we wanted to find a particular value in that array. A naïve implementation would have us go through each item in the array, compare it with the search value, and returning it as soon as we found it. Because at worst we’d have to look at every item, we know that this is O(n) - worst case, we go through the entire array and not find our value. See this post for an in-depth reminder.

However, if our array were sorted (as it conveniently is in the figure), we can significantly improve our search.

Let’s follow an example. Say, we’re searching for 69.

Instead of starting in the beginning, we start right in the middle and ask, “is this value equal to, greater than, or less than 69?”

In our case, we land on array index 3, 66. 66 is less than 69. That tells us that 69 can only be somewhere to the right (yes, it’s the same right that we have already seen in trees! we’ll talk more about their similarities in a moment) of 66.

Okay, that’s great information, because now we only need to search the sub-array contained between array indices 3 (not inclusive) and 6. So what do we do? The same exact thing. Recursion, baby.

Once again, we look in the middle, index 5, and get 99. 99 is greater than 69. That tells us that 69 can be somewhere to the left of 99.

So what do we do? The same exact thing. Recursion, baby.

This time, we search between indices 3 (not inclusive) and 5 (not inclusive). It’s an array of size 1, so we look in the middle of it, and find our 69. Nice.

The complexity of this algorithm is O(log_2 n). It’s very good. Let’s compare some numbers:

It would take only 24 iterations to find the number in a 10M array, as opposed to, well, 10M iterations. Logarithmic complexity is very good, have I said that already?

👉 I suppose I should explain why the complexity is logarithmic. With each iteration, we reduce the remaining search area by a factor of 2. E.g., we go from 128→64→32→16→8→4→2→1 in 7 steps. If we look at it from another angle, 27=128. So, loga(n) is an operation inverse to an. We know that an is very bad, so loga(n) is very good.

Notice how we replace 2 with a there. That’s because when we’re dealing with exponents (that’s the fancy name for taking a number to the power of n) and logarithms (that’s the fancy name for doing the reverse), we don’t really care about the precise value of a, as long as a>1. Exponential complexities are very bad either way, and logarithmic complexities are very good either way.

The a>1 condition is very important. We’re dealing with exponents, a.k.a. powers here, which is a case of multiplication, and 1 is the identity value for the operation of multiplication. Identity means that if we perform the operation of multiplication on any value a≠0 and I (the identity value, or 1), we will get the same value a back.

E.g., 3*1=3. Colloquially, 1n = 1, because it doesn’t matter how many times you multiply by 1, you get the same value back.

When you multiply by a value greater than 1, no matter how small it is, you get a bigger value. For a somewhat surprising example, when a=1.01, then 1.011000 ≈ 21000.

Conversely, when you multiply by a value less than 1 (in absolute terms, we’re not talking about negative values here), no matter how close to 1 it is, you get a smaller value. For example, when a=0.99, then 0.991000 ≈ 0.00004.

Now, we could implement binary search with a binary tree instead of a sorted array. I think a picture is worth a thousand words here:

Notice how we start at 66, then move to the right sub-tree, look at 99, then move to the left sub-tree, and look at 69. We’ve just done the same exact thing, but with a binary tree!

‍

How’s a tree better than a sorted array?

In a static world, it’s not.

However, trees offer more consistent performance when it comes to updating the data contained in them. Suppose we needed to insert a new value in our sorted array - we would have to move $n$ values in memory. And when you store millions or billions or values, you don’t want to move them around all the time.

And where might you want to store billions of values in a single data structure? Databases, of course.

For example, one of our tables, payment_decisions , has 1.4B records. Databases tend to make sure that the ID values of those records are unique, and what’s the best way to find if a value is present in a ginormous list? Binary search, of course! It would take mere 31 iterations to do it with 1.4B.

Now, the cool thing about trees is that inserting, updating and deleting values is just as fast as looking them up, i.e. O(log n). That’s because to insert a value, we could simply try to look it up, which takes O(logn), and then add the new value in O(1): it’s as simple as creating a new TreeNode and adding a link from the (former) tree leaf to the new leaf.

👉 That’s a naïve algorithm for insertion, which can, in worst case, devolve our tree to a linked list. E.g., If we started with an empty tree, and each new value that we inserted was greater than the last one (that would happen if, say, we were inserting IDs of new records: 1, 2, 3 and so on), we would always insert into the right sub-tree. Which would essentially be a linked list, with linked list performance implications.

I will say that it’s possible to implement an O(logn) insertion algorithm that does not have that kind of drawbacks, but I will not actually show it. I will only say that it involves tree rotations, which is an O(1) operation.

This enables transactional databases like Postgres or MySQL to be highly efficient when dealing with single record operations, like lookups, insertions, updates or deletions.

👉 Now, those DB engines do not use our simple binary trees. They use B-trees, where B does not stand fir binary. We will not go into the details of B-trees, that’s wildly beyond the scope of the 101 level course, but it helps to be aware.

Our data structures are getting pretty interesting, aren’t they?

They will get even more interesting. See you next time!

Part 4: Trees

Finally. We’re getting to the good stuff.

How can a binary tree help with implementing a binary search?

How’s a tree better than a sorted array?

Try Cleo for free

Getting started takes less than 2 minutes