Lambda reflections

Reflections on Learning "How to Grow (Almost) Anything"

May 12, 2022 • synbio, course, mooc
Wow, what a journey this has been!

Yesterday, Wednesday 11th of May, was the final project presentations for committed listeners in MIT’s synthetic biology course, How To Grow (Almost) Anything, class of 2022.

When I started the course, sometime in the end of January, I had no idea what a blast this would be (nor did I have any idea of how much pressure this would apply on me, but hey, I survived this, and then some 😁 ).

I credit this class with taking me from rudimentary biology knowledge to a fairly solid understanding of molecular biology and immunology, all in about 3-4 months.

As you can probably imagine, this was… intense.

But before I dive more into the experience and what I learnt in this course, let me give some background into the course and the context of my participation.

Some background

Anyone who knows me is aware of my awe at the potential of synthetic biology. There’s no doubt in my mind that it’s going to be the next exponential growth sector after the computer industry. The field itself is fairly nascent, but it’s going to pick up the pace in the next 5-10 years. And although I believe that all its subdomains hold massive potential, I’m especially keen on synthetic biology applications in human therapeutics.

Now, I keep it no secret that I want to found a synthetic biology company in the (hopefully, not so distant) future.

Given that, you may guess that in general I was observant of any opportunities to dig a bit deeper into the field.

With that in mind (along with a hefty bit of luck - fate? 😁 ), I came across a tweet by the organiser of the course, David Kong, advertising the course, along with the fact that this year it would admit a small cohort of students who want to attend remotely.

Wasting no time, I shot an email towards David, and was surprised when I got an email a couple days later admitting me to the course, and sharing zoom links for the course’s first lecture.

The Attendance Experience

The course was setup in a set of two classes per week:
- one three-hour lecture, in which the first hour was spent reviewing homework from last week, and the next two hours had different presenters give presentations on various topics in synthetic biology, and
- a two-hour class the next day, that included a recitation session - which was previewing next week’s homework - along with a homework review for online listeners.
To my surprise, there was a relatively high number of online students participating (~ 45, IIRC - about 3 times the number of MIT students physically attending), but only about a third of the online students made it through to the end.

Attendance was taken every lecture, for both online listeners and the students physically present at MIT. We were also expected to do all the exercises for the week - with the TAs ensuring that this was the case.

In the middle of the course’s duration, we also had a handful of hands-on workshops (of which we had to attend some), aptly named the Festival of Growing.

The Lectures

The course was setup around a number of themes (therapeutics, biofabrication/biomaterials, DNA/RNA editing, microfluidics, etc), and every week we had two (sometimes three) different presenters (which ranged from university professors to company CEOs) present on the topic at hand.

The first couple of weeks were focused on teaching the foundations of molecular biology (but in a very, very accelerated timeframe), and then the following weeks assumed competence in molecular biology and built on top of that in terms of presenting the state of the art in applications in various domains.

It’s probably worth noting that the course attracted people from a wide variety of backgrounds - you could find architects, designers, computer scientists, chemists, etc. As you can imagine, a lot of these people didn’t have a background in biology (or had a very shallow one, myself included), and the intensity of the ramp up of the first few weeks might have contributed to the dropouts.

For the MIT/Harvard students, most of these lectures were also followed by a lab component. For online listeners, we would partially follow along those processes (given the lack of wet lab access for some of us), or, at other times, we would be given substitute exercises that involved using online tools (such as Benchling to simulate the in-vitro experiments in-silico.

The Festival of Growing

Sometime around the half of the semester, we had a pause in the online lectures for a couple of weeks in favour of attending various workshops. Those workshops were structured around a number of different themes. Examples of those were growing mycelium at home, or performing DNA sequencing using a nanopore device.

From these workshops, some of them were on-premise and some of them were online. We did manage to watch one of the on-premise workshops (the nanopore sequencing one) via a TA streaming the workshop, which was nice. In retrospect, this would be a good idea for all of the on-premise workshops - maybe that will happen from next year onwards?

For both online and MIT/Harvard students, we had to attend at least 1 of the workshops (attendance was taken), with no upper limit on how many you could attend.

I think the Festival of Growing was a nice little fragment of the course’s structure, and it also contributed in making the course feel a lot more dynamic in nature (compared to an alternative, hypothetical course consisting only of lectures).

The Final Project

A relatively big part of the course (and the main focus of the last two weeks) was the the preparation of the final project. It had been made clear from the start that this was an integral part of the course, and that everyone who stuck to the end (either online listener or on-premise student) would have to present one.

We had a lot of leeway in terms of the project that we would choose to do - though you were strongly encouraged to pick a project topic that fell within the experience/knowledge range of available TAs. For the on-premise students, the idea is that the project would include a lab component but for online students this was considered optional (given that only some people had access to a lab), but we still had to present on a topic that was considered feasible (if speculative), and the expectation was that the project would be fairly well-researched.

For each one of us doing a final project, we would be assigned a mentor (a TA or industry professional) who would end up as an endpoint for answering questions/providing guidance on our project.

My Own Experience

I have to start with the fact that I enjoyed all of the course, and that I’m grateful to MIT/the organisers for admitting us online listeners to the course.

Having said that, the first couple of weeks were very hard - it required a lot of outside studying (we’re talking several textbooks worth of reading) to catch up to the required knowledge. If you have to juggle a job as well, and you’re not a full-time student, that can definitely feel pretty overwhelming at times.

(By the way, this is coming from someone who already commits 3-4 hours of study on a daily basis, if that helps underline the above paragraph - though to be fair, I think that a lot of the pressure was coming from the tight weekly deadlines for exercises.)

I recall that at one point during the first two weeks I felt the presenters were speaking something other than English. 😓 The nice thing is that persistence does pay of, however, and by the end, you are also capable of fluently conversing with everyone else in this weird little dialect of English.

The festival of growing was also very nice, but, personally, I was a little bit miffed by the fact that the workshop I was most looking forward to (metabolic simulation in Python) got cancelled without any explanation a day before it was to be presented.

For me the most enjoyable part of the course was the final project because by that point I already knew enough to be able to do some competent background research on the project.

In Conclusion

In this final bit, I want to say that I wholeheartedly recommend this course to anyone who is keen on synthetic biology, if only to see some of the people on the forefront of the domain discuss the possibilities and what they are currently exploring.

I also have a feeling that David and his course are going to be a future kingmaker of synthetic biology 😉 You would hear very often during the presentations (person X, CEO of Y, and alumni of HTGAA few years ago). I think this trend is going to continue for many more years to come.

Again many thanks to the organisers (David in particular), the lecturers, TAs, and everyone else who contributed to this course.

Special thanks from my end to the following people as well:
- Jessica Weber, for entertaining a barrage of questions I posed during one of the lectures.
- Karolina Sulich, for providing some early assistance during my project, and for introducing me to Amy Holt,
- Amy Holt, who provided me with specific answers to some immunology related questions I had, with regard to my final project.
You can find my course notes, exercises, and final project at my course notes Notion page.
Nucleotide Counting the TDD Way - A Bioinformatics Stronghold Story

Mar 28, 2022 • ruby, bioinformatics, tdd
A few days ago, I was going through my code archives and came across my old solutions of the Bioinformatics Stronghold by Project Rosalind. For those of you who don’t know, Project Rosalind is one of those problem-based skill development exercises, akin to Project Euler but aimed at nascent bioinformaticians.

Coincidentally, I finished a book on Test-Driven Development (TDD) in Ruby around the same time. I particularly enjoyed TDD the way it was presented in the book, and I was looking for some problems to apply it to and the Stronghold Project is the perfect lab space for me to apply those new ideas (because of its relatively simple from an algorithmic perspective problems).

My initial foray into the Stronghold project was in an attempt at comparative solutions, wherein I attempted the exercises in a number of different programming languages (Python, Go, OCaml, Racket, etc.) while observing any differences in the style I chose, refactoring them, benchmarking them, and all around having some good fun with those 😄

One thing I’m embarrassed to admit about that first attempt though, is that while my solutions worked (or so I can conveniently recall), they didn’t contain any reproducible documentation of their satisfying of the requirements of the exercise - there were no assertions, no tests, nothing. 😅 I can only attribute it to my enthusiasm in getting each solution done to move on to the next one.

Bad me.

I’m now revisiting these exercises to atone for my insolence. I’m going to go through the exercises again, but this time I’m going to be approaching them in a TDD/BDD style.

But before I move on to actual code, what on Earth is TDD (and its cousin, BDD)? We already saw that TDD stands for Test-Driven Development. By that, we mean a programming approach that encourages writing tests for the feature code before the actual feature code is written, so that the tests guide the design of the code itself.

Behaviour-Driven Development (BDD for short), is a sister approach to TDD, but for the purposes of this article, BDD is the approach of writing the tests in a way that they reflect prose specification of the behaviour of the module under test.

Okay, are you ready now? Let’s roll.

I’m going to start with the first exercise, which is about counting nucleotides in a DNA strand, which is given as an input of a string type.

Now, if you visit the problem page, you will see that it also gives us a sample dataset, along with the expected output for that.

(You may have noticed that the above sentence is just a particularly verbose way of describing a test case. 😉 )

Our implementation language for this exercise is going to be Ruby, for two reasons.

1) I started learning Ruby recently and I’ve been enjoying it a lot, and 2) Ruby comes with excellent built-in support for testing in the form of the Minitest library.

Let’s quickly make a file called counting_nucleotides.rb, and add an empty test specification:
```
require 'minitest/autorun'

describe 'nucleotide counting function' do
end
```
At this point it’s worth having a pause to think about our test cases before we move forward.

We already have been given a sample input, as we mentioned above, that we could use as our test case. The problem, however, with that particular input is that it describes the functionality of the module when it’s finished.

That’s probably a bit too elaborate for us to use now that we start designing our function.

We probably want something much simpler - indeed, this is what TDD as an approach is advocating. What might be simpler for us?

What about a strand with a very small length? Say, 8 bases long?

Sounds like it should work.
```
require 'minitest/autorun'

describe 'nucleotide counting function' do
  it 'returns a count of 2 2 2 2 for strand "ATCGATCG"' do
    strand = 'ATCGATCG'
    nucleotide_count = '2 2 2 2'
    result = count_nucleotides(strand)

    assert_equal nucleotide_count, result
  end
end
```
Looks good for a first test. Let’s run it and see what happens.
```
$ ruby counting_nucleotides.rb
Run options: --seed 64068

# Running:

E

Finished in 0.000265s, 3773.5860 runs/s, 0.0000 assertions/s.

  1) Error:
nucleotide counting function#test_0001_returns a count of 2 2 2 2 for strand "ATCGATCG":
NoMethodError: undefined method `count_nucleotides' for #<#<Class:0x000000015e831da0>:0x000000015b940028>
    counting_nucleotides.rb:8:in `block (2 levels) in <main>'

1 runs, 0 assertions, 0 failures, 1 errors, 0 skips
```
Ahh, it complains that we “forgot” to define our function count_nucleotides. Easy to fix, let’s add a function called that, taking in one parameter, but with an empty body.
```
require 'minitest/autorun'

def count_nucleotides(strand)
end

describe 'nucleotide counting function' do
  it 'returns a count of 2 2 2 2 for strand "ATCGATCG"' do
    strand = 'ATCGATCG'
    nucleotide_count = '2 2 2 2'
    result = count_nucleotides(strand)

    assert_equal nucleotide_count, result
  end
end
```
Nice, let’s run it again, and see what we get.
```
$ ruby counting_nucleotides.rb
Run options: --seed 20901

# Running:

F

Finished in 0.000287s, 3484.3209 runs/s, 3484.3209 assertions/s.

  1) Failure:
nucleotide counting function#test_0001_returns a count of 2 2 2 2 for strand "ATCGATCG" [counting_nucleotides.rb:12]:
Expected: "2 2 2 2"
  Actual: nil

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips
```
Okay, this seems a bit more intriguing. Now it doesn’t come back to us with an error. Rather, it comes back with a failure, indicating that the test got executed, but the expected and the actual results differ. In our case, we asserted in our test that we expect the result to be a string with the contents "2 2 2 2" in exactly that form (spaces and everything), but we got back nil from the actual execution.

The reason for the nil in particular is that ruby is an expression-based language.

In an expression-based language every program fragment is an expression, meaning that it will return a value upon execution of that program fragment (we call that, in more technical terms, expression evaluation).

A function, thus, is also an expression, and will upon evaluation return the value of the last expression in its body. For a function with an empty body, there are no such expressions, so a default of nil is returned.

Right, so we run our test, and got back nil for a return value. This is our clue that our test is executing the function as we expect, but our function is not yet implemented (does not contain a function body). Let’s crack on with that.

In Ruby, there’s a very convenient method defined on the String type whose job is to count the presence of particular subsequences (substrings in our case).

To no one’s suprise, it’s called count.

Let’s use that to count the number of nucleotides and return that in a string format.
```
def count_nucleotides(strand)
    strand.count('A') + strand.count('C') + strand.count('G') + strand.count('T')
end
```
Let’s run our test again.
```
$ ruby counting_nucleotides.rb
Run options: --seed 3996

# Running:

F

Finished in 0.000411s, 2433.0901 runs/s, 2433.0901 assertions/s.

  1) Failure:
nucleotide counting function#test_0001_returns a count of 2 2 2 2 for strand "ATCGATCG" [counting_nucleotides.rb:13]:
Expected: "2 2 2 2"
  Actual: 8

1 runs, 1 assertions, 1 failures, 0 errors, 0 skips
```
Whoops!

That looks like it found the number of substrings correctly. However, because of a programming mistake, it looks like it added all the occurences together, instead of presenting them in a formatted string.

Let’s do that in the easiest way that comes to mind - using string concatenation and casting the integer representing the count back to a string, while also adding some spaces (so that we get closer to the expected output):
```
def count_nucleotides(strand)
    strand.count('A').to_s + " " + strand.count('C').to_s + " " + strand.count('G').to_s + " " + strand.count('T').to_s
end
```
Let’s see what happens now…
```
$ ruby counting_nucleotides.rb
Run options: --seed 37259

# Running:

.

Finished in 0.000321s, 3115.2648 runs/s, 3115.2648 assertions/s.

1 runs, 1 assertions, 0 failures, 0 errors, 0 skips
```
It worked! I mean, our code ~~is a bit atrocious~~ could be better, but here we have our first version of it working. 🎉

Now that we have our first unit test passing, let’s think a bit more about the test cases we want.

We want at least a representative sample of each of the following:
- Positive cases,
- Negative cases,
- Degenerate cases.
Positive cases are cases in which we exercise the happy path - the code path we were most anticipating when we were designing our code.

Negative cases are cases in which we divert away from the happy path, and try to exercise error conditions, etc.

Degenerate cases are cases in which we test around boundary conditions, such as empty lists, strings, etc, to see if our function can handle these cases.

We already have a test for a positive case, so right now, it might make more sense for us to test against a negative case. So what would be a negative case for us?

We are being passed a strand in as a string. Given that a string can have a lot more characters than just the four representing nucleobases, what happens if we have a string that has characters that don’t represent a nucleotide base? What happens if we have something like ATCGW for a strand? Let’s see.
```
it 'throws an exception if a non-base encoding character is found in the strand' do
    strand = 'ATCGW'
    
    assert_raises(ArgumentError) { count_nucleotides(strand) }
  end
```
Let’s run it to see what happened this time.
```
$ ruby counting_nucleotides.rb
Run options: --seed 54524

# Running:

.F

Finished in 0.000506s, 3952.5692 runs/s, 3952.5692 assertions/s.

  1) Failure:
nucleotide counting function#test_0002_throws an exception if a non-base encoding character is found in the strand [counting_nucleotides.rb:19]:
ArgumentError expected but nothing was raised.

2 runs, 2 assertions, 1 failures, 0 errors, 0 skips
```
😱

Yikes! We were expecting an exception to be raised, but none was raised. That means that our code had no issue handling the invalid string. Let’s fix that.

Let’s fix that the simplest way we can: by defining a list of illegal characters for the strand string and seeing if they are present in the string.

That gets us with the following version of count_nucleotides:
```
def count_nucleotides(strand)
    illegal_chars = 'BDEFHIJKLNOPQRSUVWXYZ'
    illegal_chars.split('').each do |char|
        if strand.include?(char) then
            raise ArgumentError.new('Illegal character in strand ' + char)
        end
    end
    strand.count('A').to_s + " " + strand.count('C').to_s + " " + strand.count('G').to_s + " " + strand.count('T').to_s
end
```
Let’s see where we stand now:
```
$ ruby counting_nucleotides.rb
Run options: --seed 25460

# Running:

..

Finished in 0.000348s, 5747.1265 runs/s, 5747.1265 assertions/s.

2 runs, 2 assertions, 0 failures, 0 errors, 0 skips
```
Nice. That works, so now both our positive case and our negative test cases pass.

The only downside is that we are left with a counting_nucleotides function that looks a bit hard to read - not to mention a bit wasteful, too.

(Aside: It loops through the string a lot more times than it needs to, as it loops once per every illegal character it’s looking for, and then once for every character it’s searching the count for.)

At this point, it’s worth to pause, and reflect on where we are in the process so far.

TDD is a loop of the following 3 steps:
1. Write a failing test.
2. Make the test pass.
3. Refactor the implementation.
(Refactoring is the reorganisation of the code with the aim of improving it with regard to some metric, say, robustness, readability, performance, etc)

Up until this point, we have been focusing on the first two steps, but did none of the third one.

We usually refactor once we get some of our implementation done, and all our tests are passing.

In other words, now is as good time as any to refactor our code.

Let’s have a look at our feature code, the count_nucleotides function.

What if, instead of looping so many times, we looped just once, and collected both counts and watched out for any illegal character at the same time?

That does sound like it should improve our performance, now, doesn’t it?

Let’s go ahead and do this, and see what happens.
```
def count_nucleotides(strand)
    count_a = 0
    count_t = 0
    count_c = 0
    count_g = 0

    strand.split('').each do |base|
        if base == 'A' then
            count_a += 1
        elsif base == 'T' then
            count_t += 1
        elsif base == 'C' then
            count_c += 1
        elsif base == 'G' then
            count_g += 1
        else
            raise ArgumentError.new('Invalid character in strand ' + base)
        end
    end

    "#{count_a} #{count_t} #{count_c} #{count_g}"
end
```
Looks simpler to me. Does it work, though?
```
$ ruby counting_nucleotides.rb
Run options: --seed 48449

# Running:

..

Finished in 0.000378s, 5291.0053 runs/s, 5291.0053 assertions/s.

2 runs, 2 assertions, 0 failures, 0 errors, 0 skips
```
It does.

And just like this, we saw the massive benefit of having automated tests for this: we did some pretty significant structural changes to our function under test, but even so, we are confident that its observable behaviour remains unchanged given our tests and their coverage.

Now that we have both the positive and negative tests, can we write a test for the degenerate case?

Turns out we can.

A degenerate case for us would be an empty string (""), given that we anticipate the strand to exist (signified by a non-empty string).

Before we go ahead and write our test case, let’s have a bit of a think around the behaviour of our function in the case of an empty string. Should it:
1. Return a count of 0 0 0 0, or
2. Raise an exception?
Usually, in situations like this, a decision like this already forms part of our specification - but in our case, though, the exercise contains no indication as to what is considered canonical, so we can choose either.

Let’s go with expecting a count of 0 0 0 0 for this one (there don’t appear to be any significant benefits whichever one we choose).
```
it 'returns a count of 0 0 0 0 for a strand with zero bases (empty string)' do
    strand = ''
    nucleotide_count = '0 0 0 0'
    result = count_nucleotides(strand)

    assert_equal nucleotide_count, result
  end
```
Let’s check our function’s behaviour:
```
$ ruby counting_nucleotides.rb
Run options: --seed 33926

# Running:

...

Finished in 0.000378s, 7936.5079 runs/s, 7936.5079 assertions/s.

3 runs, 3 assertions, 0 failures, 0 errors, 0 skips
```
Very nice.

Are we done, now?

Not so fast.

There’s a requirement in our specification that we have ignored so far:

Given: A DNA string s of length at most 1000 nt.

(Emphasis mine.)

Let’s quickly add a test case with an invalid length (> 1000 nucleotides) to see how our code behaves with against this requirement:
```
it 'throws an exception if a strand is more than 1000nt long' do
    strand = 'A' * 1005

    assert_raises(ArgumentError) { count_nucleotides(strand) }
end
```
```
$ ruby counting_nucleotides.rb
Run options: --seed 15834

# Running:

...F

Finished in 0.000579s, 6908.4628 runs/s, 6908.4628 assertions/s.

  1) Failure:
nucleotide counting function#test_0004_throws an exception if a strand is more than 1000nt long [counting_nucleotides.rb:52]:
ArgumentError expected but nothing was raised.

4 runs, 4 assertions, 1 failures, 0 errors, 0 skips
```
The test fails, as we were anticipating an exception but none was raised.

Let’s change our function to factor in this new requirement.
```
def count_nucleotides(strand)
    count_a = 0
    count_t = 0
    count_c = 0
    count_g = 0

    if strand.length > 1000 then
        raise ArgumentError.new('A strand of at most 1000nt is expected')
    end

    strand.split('').each do |base|
        if base == 'A' then
            count_a += 1
        elsif base == 'T' then
            count_t += 1
        elsif base == 'C' then
            count_c += 1
        elsif base == 'G' then
            count_g += 1
        else
            raise ArgumentError.new('Invalid character in strand ' + base)
        end
    end

    "#{count_a} #{count_t} #{count_c} #{count_g}"
end
```
Let’s see how our test does now:
```
$ ruby counting_nucleotides.rb
Run options: --seed 14091

# Running:

....

Finished in 0.000355s, 11267.6056 runs/s, 11267.6056 assertions/s.

4 runs, 4 assertions, 0 failures, 0 errors, 0 skips
```
Very nice. Everything passes now.

Before we wrap this up, let’s make one final addition to our test case.

Do you remember how I mentioned that the exercise specification describes a test case already?

We can incorporate this into our test cases. As a matter of fact, we can substitute this one for the simpler positive case we had.
```
it 'returns a count of 20 12 17 21 for the specification strand' do
    strand = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
    nucleotide_count = '20 12 17 21'
    result = count_nucleotides(strand)

    assert_equal nucleotide_count, result
  end
```
Let’s see how we fare against the story test case (the one given to us in the specification).
```
$ ruby counting_nucleotides.rb
Run options: --seed 17159

# Running:

.....

Finished in 0.000428s, 11682.2430 runs/s, 11682.2430 assertions/s.

5 runs, 5 assertions, 0 failures, 0 errors, 0 skips
```
Looks like everything works as expected. 🎉

Before I wrap up, I would be remiss if I did not mention that this particular approach of designing code (Test-Driven Development) works very well when we know the expected output of our code (say, when we know a lot about the domain, or when our specification allows for examples that demonstrate expected input and output).

It doesn’t work as great, however, when we don’t know what the output is (say, for instance, when we do exploratory programming, as in the case of exploring an API that’s given to us).

The complete code for this small exercise is listed below:
```
require 'minitest/autorun'

## IMPLEMENTATION CODE

def count_nucleotides(strand)
    count_a = 0
    count_t = 0
    count_c = 0
    count_g = 0

    if strand.length > 1000 then
        raise ArgumentError.new('A strand of at most 1000nt is expected')
    end

    strand.split('').each do |base|
        if base == 'A' then
            count_a += 1
        elsif base == 'T' then
            count_t += 1
        elsif base == 'C' then
            count_c += 1
        elsif base == 'G' then
            count_g += 1
        else
            raise ArgumentError.new('Invalid character in strand ' + base)
        end
    end

    "#{count_a} #{count_c} #{count_g} #{count_t}"
end


## TEST CODE

describe 'nucleotide counting function' do
  # Positive case
  it 'returns a count of 20 12 17 21 for the specification strand' do
    strand = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
    nucleotide_count = '20 12 17 21'
    result = count_nucleotides(strand)

    assert_equal nucleotide_count, result
  end

  # Negative cases
  it 'throws an exception if a non-base encoding character is found in the strand' do
    strand = 'ATCGW'
    
    assert_raises(ArgumentError) { count_nucleotides(strand) }
  end

  it 'throws an exception if a strand is more than 1000nt long' do
    strand = 'A' * 1005

    assert_raises(ArgumentError) { count_nucleotides(strand) }
  end

  # Degenerate cases
  it 'returns a count of 0 0 0 0 for a strand with zero bases (empty string)' do
    strand = ''
    nucleotide_count = '0 0 0 0'
    result = count_nucleotides(strand)

    assert_equal nucleotide_count, result
  end
end
```
Distro forking 101: How do you fork a Linux distro?

Jul 9, 2014 • linux, distro, fork, debian, arch, fedora
Defining the GNU/Linux distribution

If you are here, we can safely assume that you already know what a GNU/Linux software distribution is, but for completion’s sake, let’s just define so we all have the same context.

A GNU/Linux distribution is a collection of system and application software, packaged together by the distribution’s developers, so that they are distributed in a nicely integrated bundle, ready to be used by users and developers alike. Software typically included in such a distribution ranges from a compiler toolchain, to the C library, to filesystem utilities to text editors.

As you can imagine, from the existence of several different GNU/Linux distributions, there are multiple ways that you could possibly combine all these different applications and their respective configurations, not to mention that you could include even more specialised software, depending on the target audience of the distribution (such as multimedia software for a distribution like Ubuntu Studio or penetration testing tools for a distribution such as Kali Linux)

The “f” word

But even with such a great number of different software collections and their respective configurations there still may not be one that appeals to your specific needs. That’s ok though, as you can still customize each and every distribution to your specific liking. Extensive customization is known to create a differentiation point known as a potential forking point.

Forking is a term that has been known to carry negative connotations. As wikipedia puts it,

the term often implies not merely a development branch, but a split in the developer community a form of schism.

Historically, it has also been used as a leverage to coerce a project’s developers into merging code into their master branches that they didn’t originally want to, or otherwise take a decision that they wouldn’t have taken if not under the pressure of a “fork”. But why is it so?

You see, traditionally, forking a project meant a couple of things: For starters, there were now two, identical projects, competing in the same solution space. Those two projects had different development hours and features or bug fixes going into them, and eventually, one of the two ended up being obsolete. Apart from that forking also created an atmosphere of intense competition among the two projects.

However, in 2014, and the advent of the distributed version control systems such as git and mercurial and of the social coding websites such as Github and Bitbucket, the term is finally taking on a more lax meaning, as just another code repository that may or may not enjoy major (or even minor, for that matter) development.

Forking a GNU/Linux distribution

So, up until now we have discussed what a GNU/Linux distribution is, and what a fork is. However, we haven’t discussed yet what it means to fork a GNU/Linux distribution.

You see, what differentiates each distro from the other ones, apart from the software collection that they contain, is the way in which they provide (and deploy) that software. Yes, we are talking about software packages and their respective package managers. Distributions from the Debian (.deb) family are using dpkg along with apt or synaptic or aptitude or some other higher level tool. RPM (.rpm) based distributions may use rpm with yum or dnf or zypper or another higher level tool. Other distributions, not based on the aforementioned may choose to roll their own configuration of packages and package managers, with Arch Linux using its own pacman, Sabayon uses its entropy package manager, etc.

Now, naturally, if you want to customize an application to your liking, you have many ways in which you could do that. One of them is downloading the tarball from the upstream’s website or ftp server, ./configure it and then make and make install it. But if you do start customizing lots of applications this way, it can become tedious and unwieldy too soon. After all, what did that make install install exactly? Will the new update replace those files? What were your configuration options? Did they replace the files the package manager installed?

In this case, it really pays off to learn packaging software for your distribution of choice. What this means is to learn the format of packages your distribution’s package manager accepts as well as how you could produce them. This way, instead of the ./configure && make && make install cycle, you just have beautiful software packages, that you can control more tightly, you can update more easily and you can also distribute them to your friends if you so desire. As an added bonus, now the package manager also knows about those files, and you can install, remove or update them much more easily. What’s not to like?

After you have created some custom packages, you may also wish to create a repository to contain them and update straight from that. Congratulations, you have created your custom distribution, and a potential fork. While you are at it, if you really want to fork the distribution, you could as well take the distribution’s base packages, customize them, rebuild them, and then distribute them. Congratulations again, now you have your true GNU/Linux distribution fork.

That seems easy. More specifically?

Yes of course. Let’s take a look at how you might want to go about forking some well known GNU/Linux distribution.

Debian

In Debian, your usual procedure if you wish to customize a package is the below:
- First, you make sure you have the essential building software installed. apt-get install build-essential devscripts debhelper
- Then you need to download the package’s build dependencies. apt-get build-dep $package_name
- Now it’s time to download it’s sources, via apt-get source $package_name
- Proceed with customizing it to your liking (update it, patch the sources, etc)
- Now it’s time to rebuild it. debuild -us -uc
Assuming all went fine, you should now have an $package_name.deb file in your current directory ready to be installed with dpkg -i $package_name.deb.

Please take note that the above is not an extensive treatise into debian packaging by any means. If you want to build custom debian packages, here are some links:
Now that you have your custom packages, it’s time to build a repository to contain them. There are many tools you can use to do that, including the official debian package archiving tool known as dak, but if you want a personal repository without too much hassle, it’s better if you use reprepro. I won’t go to full length on that here, but instead I will lead you to a very good guide to do so if you so desire

Fedora

Building packages for fedora is a procedure similar to the debian one. Fedora however is more convenient in one aspect: It allows you to download a DVD image with all the sources in .rpm form ready for you to customize and rebuild to your tastes.

Apart from that, the usual procedure is the following:
- Download the SRPM (source RPM) via any means. You could do that using the yumdownloader utility, likewise yumdownloader $package_name. To use yumdownloader, you need to have yum-utils installed.
- After you have downloaded the SRPM, next you have to unpack it: rpm -i $package_name
- Next up, you customize the package to your liking (patch the sources, etc)
- Finally, you cd to the SPECS folder, and then rpmbuild -ba $package.spec
Again the above mentioned steps may not be 100% correct. If you want to go down this route, see the following links for more information:
Next up, is the repository creation step.
- To create a yum repository, you need to yum install createrepo. After that you need to create a directory to use as the repository, likewise mkdir /var/ftp/repo/Fedora/19/{SRPMS, i386,x86_64).
- After that you move your i386 packages to /var/ftp/repo/Fedora/19/i386, and the rest of the packages to their respective folders.
- Next step is adding a configuration file to /etc/yum.repos.d/ that describes your repository to yum.
Again, not definitive, and for more information, take a look at these links:
Arch Linux

Arch Linux, at least in comparison to .deb and .rpm package distribution families is very easy to customize to your liking. That’s to be expected though as Arch Linux is a distribution that sells itself of the customization capabilities it offers to its user.

In essence, if you want to customize a package, the general process is this:
- Download Arch tarball that contains the PKGBUILD file
- untar the tarball
- (Optional) download the upstream tarball referenced in the PKGBUILD, and modify it to your liking
- run makepkg in the folder containing the PKGBUILD
- install (using pacman) the .xz file produced after the whole process is finished.
In order to download the official {core | extra | community} packages, you need to run as root abs. This will create a directory tree that contains the files required for building any package in the official repositories.

Next up, you can create a custom local repository with the repo-add tool, and then proceeding with editing /etc/pacman.conf and adding an entry for your repository there. For more information:
To fork or not to fork?

Well, that’s not an easy question to answer. My opinion is that it’s extremely educational to do a soft fork, clone the distribution’s core repository, and for some time maintain your own distribution based on it, that is, update and customize all the repositories. Do that for some months, then go back to using your distribution of choice now that you are enlightened with how it works under the hood. The reason this is very educational is that it will teach you the ins and outs of your distribution, teach you about all the software in it, how it integrates, what its role is. It will teach you packaging which is a tremendously undervalued skill, as you can customize your experience to your liking, and it will make you appreciate the effort going into maintaining the distribution.

As for doing a hard fork, that is creating your own distribution, that you commit to maintaining it for a long time, my opinion is that it’s simply not worth it. Maintaining a distribution, be it by yourself, or with your friends, is a tremendous amount of work, that’s not worth it unless you have other goals you want to achieve by that. If all you want to do is to customize your distribution of choice to your liking, then go ahead, learn packaging for it, customize-package the applications you want, then create your own repo - but always track the upstream. Diverging too much from the upstream is not worth the hassle, as you will end up spending more time maintaining than using the distribution in the end.

tl;dr:

If you want to do a small scale, private fork in order to see what’s under the hood of your Linux distro; by all means go ahead.

If you want to do a large scale, public fork, then take your time to calculate the effort, if it’s worth it, and if you could just help the upstream distribution implement the features you want.
How the Compiler, the Library and the Kernel work - Part 3

Jun 20, 2014 • compiler, kernel, libc

In the last part of this series, we talked about the compiler’s composition, including the assembler and the linker. We showed what happens when the compiler runs, and what’s the output of translation software such as cc1 or as etc. In this final part of the series, we are going to talk about the C library, how our programs interface with it, and how it interfaces with the kernel.

The C Standard Library

The C Standard Library is pretty much a part of every UNIX like operating system. It’s basically a collection of code, including functions, macros, type definitions etc, in order to provide facilities such as string handling (string.h), mathematical computations (math.h), input and output (stdio.h), etc.

GNU/Linux operating systems are generally using the GNU C Library implementation(GLIBC), but it’s common to find other C libraries being used (especially in embedded systems) such as uClibC, newlib, or in the case of Android/Linux systems Bionic. BSD style operating systems usually have their own implementation of a C library.

So, how does one “use” the C Standard Library?

So, now that we are acquainted with the C Library, how do you make use of it, you ask? The answer is: automagically :). Hold on right there; that’s not exactly a hyperbole. You see, when you write a basic C program, you usually #include <some_header.h> and then continue with using the code declared in that header. We have explained in the previous part of this series that when we use a function, say printf(), in reality it’s the linker that does the hard work and allows us to use this function, by linking our program against the libc’s so (shared object). So in essence, when you need to use the C Standard Library, you just #include headers that belong to it, and the linker will resolve the references to the code included.

Apart from the functions that are defined in the Standards however, a C Library might also implement further functionality. For example, the Standards don’t say anything about networking. As a matter of fact, most libraries today may implement not only what’s in the C Standards, but may also choose to comply with the requirements of the POSIX C library, which is a superset of the C Standard library.

Ok, and how does the C Library manage to provide these services?

The answer to this question is simple: Some of the services that the library provides, it does so without needing any sort of special privileges, being normal, userspace C code, while others need to ask the Operating’s system Kernel to provide these facilities for the library.

How does it do so? By calling some functions exported by the kernel to provide certain functionality named system calls. System calls are the fundamental interface between a userspace application and the Operating System Kernel. For example consider this:

You might have a program that has code like this at one point: fd = open("log.txt", "w+");. That open function is provided by the C Library, but the C Library itself can not execute all of the functionality that’s required to open a file, so it may call a sys_open() system call that will ask the kernel to do what’s required to load the file. In this case we say that the library’s open call acts as a wrapper function of the system call.

Epilogue

In this final part of our series, we saw how our applications interface with the C Standard Library available in our system, and how the Library itself interfaces with the Operating system kernel to provide the required services needed by the userspace applications.

Further Reading:

If you want to take a look at the System Call interface in the Linux Operating System, you could always see the man page for the Linux system calls
Introduction to xv6: Adding a new system call.

Apr 28, 2014 • kernel, ensidia, xv6
xv6: An introduction

If you are like me, a low level pc programmer, it’s hard not to have heard of xv6. xv6, for those who haven’t really heard of it, is a UNIX version 6 clone, designed at MIT to help teach operating systems.

The reasoning behind doing this was fairly simple: Up until that point, MIT had used John Lions’ famous commentary on the Sixth Edition of UNIX. But V6 was challenging due to a number of reasons. To begin with, it was written in a near ancient version of C (pre K&R), and apart from that, it contained PDP-11 assembly (a legendary machine for us UNIX lovers, but ancient nonetheless), which didn’t really help the students that had to study both PDP-11 and the (more common) x86 architecture to develop another (exokernel) operating system on.

So, to make things much more simpler, professors there decided to roll with a clone of UNIX version 6, that was x86 specific, written in ANSI C and supported multiprocessor machines.

For a student (or a programmer interested in operating systems), xv6 is a unique opportunity to introduce himself to kernel hacking and to the architecture of UNIX like systems. At about 15k lines of code (iirc), including the (primitive) libraries, the userland and the kernel, it’s very easy (or well, at least easier than production scale UNIX like systems) to grok, and it’s also very easy to expand on. It also helps tremendously that xv6 as a whole has magnificent documentation, not only from MIT, but from other universities that have adopted xv6 for use in their operating systems syllabus.

An introduction to Ensidia: my very personal xv6 fork

When I first discovered xv6 I was ecstatic. For the reasons mentioned above I couldn’t lose on the opportunity to fork xv6 and use it as a personal testbed for anything I could feel like exploring or testing out.

As a matter of fact, when I first discovered xv6, I had just finished implementing (the base of) my own UNIX like operating system, named fotix, and the timing of my discovery was great. xv6 had done what I had done, and also implemented most of what I was planning to work on fotix (for example, elf file loading), and it was a solid base for further development. It also had a userland, which fotix at the time didn’t have.

After I forked xv6, I spent some time familiriazing myself with the code. I also cleaned up the source code quite a bit, structuring the code in a BSD like folder structure, instead of having all of the code in the same folder and made various small scale changes.

After that for quite some time, I had left ensidia alone and didn’t touch it much. However, I always felt like I wanted to develop it a bit more and get to play with its code in interesting ways. I was trying to think of a great way to get started with kernel hacking on it, in a simple way, to get more acquainted with the kernel, and found an interesting pdf with interesting project ideas for it. One of them was to add a system call. I figured out that would be an interesting and quick hack, so hey, why not?

Getting started with kernel hacking on xv6: Adding the system call.

The system call I decided to introduce was the suggested one. It was fairly simple sounding too. You have to introduce a new system call that returns the number of total system calls that have taken place so far. So let’s see how I went about implementing it:

An introduction to system calls in xv6

First of all, we should provide some context about what system calls are, how they are used, and how they are implemented in xv6.

A system call is a function that a userspace application will use, so as to ask for a specific service to be provided by the operating system. For instance with an sbrk(n) system call, a process can ask the kernel to grow its heap space by n bytes. Another example is the well known fork() system call in the UNIX world, that’s used to create a new process by cloning the caller process.

The way applications signal the kernel that they need that service is by issueing a software interrupt. An interrupt is a signal generated that notifies the processor that it needs to stop what its currently doing, and handle the interrupt. This mechanism is also used to notify the processor that information it was seeking from the disks is in some buffer, ready to be extracted and processed, or, that a key was pressed in the keyboard. This is called a hardware interrupt.

Before the processor stops to handle the interrupt generated, it needs to save the current state, so that it can resume the execution in this context after the interrupt has been handled.

The code that calls a system call in xv6 looks like this:
```
# exec(init, argv)
 .globl start
 start:
   pushl $argv
   pushl $init
   pushl $0  // where caller pc would be
   movl $SYS_exec, %eax
   int $T_SYSCALL
```
In essence, it pushes the argument of the call to the stack, and puts the system call number (in the above code, that’s $SYS_exec) into %eax. The number is used to match the entry in an array that holds pointers to all the system calls. After that, it generates a software interrupt, with a code (in this case $T_SYSCALL) that’s used to index the interrupt descriptor tables and find the appropriate interrupt handler.

The code that is specific to find the appropriate interrupt handler is called trap() and is available in the file trap.c. If trap() check’s out the trapnumber in the generated trapframe (a structure that represents the processor’s state at the time that the trap happened) to be equal to T_SYSCALL, it calls syscall() (the software interrupt handler) that’s available in syscall.c
```
// This is the part of trap that
// calls syscall()
void
trap(struct trapframe *tf)
{
  if(tf->trapno == T_SYSCALL){
    if(proc->killed)
      exit();
    proc->tf = tf;
    syscall();
    if(proc->killed)
      exit();
    return;
  }
```
syscall() is finally the function that checks out %eax to get the number of the system call (to index the array with the system call pointers), and execute the code corresponding to that system call.

The implementation of system calls in xv6 is under two files. The first one is sysproc.c, and is the one containing the implementation of system calls correspondent to processes, and sysfile.c that contains the implementation of system calls regarding the file system.

The specific implementation of the numcalls() system call

To implement the system call itself is simple. I did so with a global variable in syscall.c called syscallnum, that’s incremented everytime syscall(), calls a system call function, that is, the system call is valid.

Next we just need a function, the system call implementation that returns that number to the userspace program that asks for it. Below is the function itself, and syscall() after our change.
```
// return the number of system calls that have taken place in
// the system
int
sys_numcalls(void)
{
    return syscallnum;
}
```
```
// The syscall() implementation after
// our change
void
syscall(void)
{
  int num;

  num = proc->tf->eax;
  if(num > 0 && num < NELEM(syscalls) && syscalls[num]) {
    syscallnum++; // increment the syscall counter
    proc->tf->eax = syscalls[num]();
  } else {
    cprintf("%d %s: unknown sys call %d\n",
            proc->pid, proc->name, num);
    proc->tf->eax = -1;
  }
}
```
After that was done, the next few things that were needed to be done were fairly straight forward. We had to add an index number for the new system call in syscall.h, expose it to user proccesses via user.h, and add a new macro to usys.S that defines an asm routine that calls that specific system call, and change the makefile to facilitate our change . After doing so we had to write a userspace testing program to test our changes.

The result after doing all this is below :)
```
cpu1: starting
cpu0: starting
init: starting sh
$ ls
.              1 1 512
..             1 1 512
README         2 2 2209
cat            2 3 9725
echo           2 4 9254
forktest       2 5 5986
grep           2 6 10873
init           2 7 9579
kill           2 8 9246
ln             2 9 9240
ls             2 10 10832
mkdir          2 11 9315
rm             2 12 9308
sh             2 13 16600
stressfs       2 14 9790
usertests      2 15 37633
wc             2 16 10207
zombie         2 17 9028
syscallnum     2 18 9144
console        3 19 0
$ syscallnum
The total number of syscalls so far is 643
$ syscallnum
The total number of syscalls so far is 705
$ syscallnum
The total number of syscalls so far is 767
$ syscallnum
The total number of syscalls so far is 829
```
Epilogue

I usually end my blog posts with an epilogue. Although this is a post that doesn’t necesarilly need one, I wanted to write one just to say to you that you should try kernel hacking, that is programming jargon for programming an operating system kernel, because it’s an experience that undoubtedly will teach you a great deal of things about how your computer actually works.

Last but not least, take a look at the ongoing work on Ensidia, my fork of xv6. To see this particular work, take a look at the syscall branch.

References

« Older Newer »

Reflections on Learning "How to Grow (Almost) Anything"

Some background

The Attendance Experience

The Lectures

The Festival of Growing

The Final Project

My Own Experience

In Conclusion

Nucleotide Counting the TDD Way - A Bioinformatics Stronghold Story

Distro forking 101: How do you fork a Linux distro?

Defining the GNU/Linux distribution

The “f” word

Forking a GNU/Linux distribution

That seems easy. More specifically?

Debian

Fedora

Arch Linux

To fork or not to fork?

tl;dr:

How the Compiler, the Library and the Kernel work - Part 3

The C Standard Library

So, how does one “use” the C Standard Library?

Ok, and how does the C Library manage to provide these services?

Epilogue

Further Reading:

Introduction to xv6: Adding a new system call.

xv6: An introduction

An introduction to Ensidia: my very personal xv6 fork

Getting started with kernel hacking on xv6: Adding the system call.

An introduction to system calls in xv6

The specific implementation of the `numcalls()` system call

Epilogue

References

Some background

The Attendance Experience

The Lectures

The Festival of Growing

The Final Project

My Own Experience

In Conclusion

Defining the GNU/Linux distribution

The “f” word

Forking a GNU/Linux distribution

That seems easy. More specifically?

Debian

Fedora

Arch Linux

To fork or not to fork?

tl;dr:

The C Standard Library

So, how does one “use” the C Standard Library?

Ok, and how does the C Library manage to provide these services?

Epilogue

Further Reading:

xv6: An introduction

An introduction to Ensidia: my very personal xv6 fork

Getting started with kernel hacking on xv6: Adding the system call.

An introduction to system calls in xv6

The specific implementation of the numcalls() system call

Epilogue

References

The specific implementation of the `numcalls()` system call