The Role Playing programmer

Tuesday, October 13, 2009

Moving out

It will be polished. I promise. Eventually.

Edit: it is polished. At last.

Monday, June 29, 2009

In his essay "Go To Statement Consedered Harmful" Edsger W. Dijkstra demonstrated how the use of goto made programming harder. Now, goto is considered harmful, and has been replaced by more reasonable constructs. This essay is an attempt to demonstrate that assignment is harmful in the same way. But first, just to be clear:

A variable is a place in which a value is stored. Not to be confused with the value itself.
An assignment is the replacement of the value currently stored in a variable by another. Not to be confused with initialisation, wich is the the placement of a value in a new variable.

Simply put, assigment isn't mandatory, confuses the word "variable", and impedes program analysis, variable naming, refactoring, and performance.

Assignment is not mandatory

Assignment has two purposes. The first is storing a value for later use. Initializing a new variable can do that, and is less disruptive. The second is the construction of loops. Recursive function calls can do that, but many programmers find plain loops more readable. Fortunately, even loops are rarely mandatory. The factorial is a case in point (here in pseudo-code):

fac n = product [1..n]

This is just a definition. The natural way to implement it in an imperative language is to use a loop (here in C):

int fac(int n)
{
  int acc = 1;
  while (n > 0)
    acc *= n--;
  return acc;
}

It hardly looks like the definition of a factorial. The product and the sequence of integers are there, but interleaved, somewhat hidden. There is a better way (here in Haskell):

fac n = product [1..n]

This is actual, runnable code. [1..n] denotes a list of integers, ranging from 1 to n. product is a function (not a primitive), which takes a list as argument and returns the product of its elements.

This was just an example. In real code, there is all sorts of loops. However, they follow a few well known patterns, just like goto did. These patterns have been captured in recent programming languages like Haskell, just like the patterns of goto had been captured in imperative languages.

Now, in a reasonable programming language, loops are hardly needed, and assignment is not needed at all.

Assignment makes the term "variable" confusing

Compare these two programs:

(* Ocaml *)           |     /* C */
let x = ref 1  in     |     int x = 1;
let y = ref 42 in     |     int y = 42;
  x := !y;            |     x = y;
  print_int !x;       |     printf("%d", x);

In the Ocaml program, the "ref" keyword and the "!" operator make a clear distinction between a variable and its value. In C, such disambiguation is made from context.

All popular imperative programming languages are like C in this respect. This leads to many language abuses, like "x is equal to 1, until it is changed to be equal to y". Taking this sentence literally is making three mistakes:

x is a variable. It can't be equal to a value (1). A glass is not the drink it contains.
x and y are not equal, and will never be. They represent two distinct places. They can hold the same value, though. That two different glasses contain the same drink doesn't make them one and the same.
x doesn't change. Ever. The value it holds is merely replaced by another. A glass doesn't change when you replace its water by wine.

The gap between language abuse and actual misconception is small. If we have any misconception about variables, even temporary, how can we hope to write correct programs?

Assignment makes program analysis harder

Compiler writers have understood that for quite some time. Now, a typical compiler for an imperative language will first transform the source code to SSA, an intermediate form where assignment is basically banned. This makes optimization simpler.

This also apply to manual analysis. Imagine the everyday situation of trying to understand the code of a colleague;

int x = 42;
/*
 * Big Blob of Code
 */
printf("%d", x); /* What does it print? */

Maybe it prints 42. Maybe not, because x may have changed (whoops, sorry, may not contain 42 any more). To be sure, we have to look at that big blob of code. Forgetting that may introduce a bug. Without assignment, the dependency chain is obvious, and can't be ignored.

Assignment makes variable naming harder

When you know a variable will allways have the same value, you name it after that value. If this value can change, you have to consider all possible values. A name rarely scale that well. That makes code harder to understand.

For instance, in my C implementation of the factorial, I named the accumulator of the loop "acc". As a name for an accumulator, this is accurate. However, the last value of acc was the factorial of n. A good name to reflect that would have been "fac_n". Neither name is satisfactory because each misses something important.

Assignment makes refactoring harder

There is a very important, often overlooked, rule about programming: the more you allow, the more you prevent. For each thing you allow in a program, you have to drop a set of assumptions about it. As a result, some manipulations become unsafe or impossible.

In high school, a definition like "let a = x + 1" meant any occurence of "a" or "(x + 1)", can be replaced by the other without changing the meaning of what is written. They are equivalent, and therefore substitutable. Imperative programs are more complicated:

int x = 42;
x = 1;
printf("%d", x); // try and replace x by 42!

Without assignment, "x" and "42" would be equivalent and substitutable. Because they are not, refactoring is harder.

Assignment hurts performance

Optimization during compilation can be seen as a form of refactoring. Harder refactoring means harder optimizations. It complicates the compiler and make it generate less efficient code. SSA form mitigate this problem, but don't eliminate it.

Another thing you lose when you allow assignment is sharing. It becomes important when you manipulate relatively complicated data structures, such as associative maps. There are three obvious ways to manipulate a data structure:

Directly modifying it (assignment allows that).
Create a new structure by copying the whole thing.
Create a new structure by referencing the old one. (The unchanged parts are shared).

Each way have a specific problem. Way 1 is effectively an assignment, with all the disadvantages mentioned above. Way 2 wastes time and memory. Way 3 is unsafe if you ever use way 1. If assignment is allowed, way 2 is often your only safe choice. If it is not, way 3 is safe and convenient and efficient.

These problems are pervasive

Using assignment sparringly is one thing. Compromises must be made, for instance to achive the best possible performance on a critical section of the code, or in high performance applications like device drivers. Using assignment everywhere is another thing. Often a big ball of mud.

A piece of advice

Now I made my point about how assignment is harmful, I would like you to take action so it is less used. So please:

Learn a purely functionnal language —I suggest Haskell. It will show you how you can do without assignment, and what are the advantages of not having it. Beware, though: it may be addictive.
Push for better languages. At the very least, demand garbage collection. Manual memory management is a big consumer of assignments.
Avoid making functions which modify their arguments, or object methods which modify the object. They're not easily composable and often lead to verbose programs. In short, don't make assignment happy APIs —often difficult without automatic memory management.

Thursday, March 12, 2009

The danger of "language" in programming

I recently stumbled upon this blog entry from Steve Riley. There are a number of things in this that can be agreed with or argued about, but one in particular striked me as especially dangerous:

"[C++] has a million nuances (like the English language) that somehow make expressing exactly what you want to say seem easier than other languages."

I can interpret this sentence in three ways. The first is meaningless. Just some pretty wording to impress people, only to say "C++ is good". I doubt Steve was that sloppy. The second is litteral. C++ share a trait with natural languages: the "million nuances". I don't think that was the intended meaning, though it was how I first understood it. The last (and most likely) is the analogy. C++ and English have more nuances than other languages —programming and spoken, respectively. I call it biased towards his native tongue, but his point remains.

Anyway, comparing natural and programming languages disturbs me, because drawing the line between similarities and differences is hard. That makes the jump from right premises to wrong conclusions way too easy. Analogies are a great tool in many cases, but when talking about computers, they just don't work

There is a fundamental difference between the two kind of languages wich is often overlooked: their usage. Natural languages are used to talk to people, while programming languages are used to talk to machines —and people too, if not primarily. The difference between machines and people is this: a machine does as it's told, a person does what's needed.

In virtually each use of a computer, we have some specific need, about which we tell the computer, using some programming language(s). (This need may evolve, but that's not the point.) The resulting description of this need is the program. What often happens at this point is a disagreement between the computer and the human about the meaning of this program —wich is a bug.

There are two ways to deal with bugs. Correcting them, and avoiding them. The production of a program of any kind always combine both aproaches. Most of the time, the (explicit) focus is on correction. Sometimes, it is on avoidance.

Of these two, the only one that is significantly influenced by the structure of programming languages is avoidance. It is therefore crucial to insist on it when choosing (or designing) one. Several characteristics of programming languages can help or impede bug avoidance. Conciseness can help, for it is known that the defect per line of code is pretty constant accross languages. A paranoid type system can help, if it is still convenient. But I think what helps most is obviousness. The more obvious, the better. Ideally no special case, no subtlety, no nuance. By itself, the language should be boring.

That leads us to why Steve's above sentence is so dangerous. Not because it's wrong (although it is), but because it's right. C++ does have many nuances. It is a very interesting and very subtle language, to the point even machines (namely compilers) disagree about its meaning. That is not good. That is a recipe for disaster.

Sunday, January 25, 2009

Version control for small projects

Basically, as a developper, you care about one thing and one thing alone: your code. You want to modify it, archive it, share it, and many other wonderful things.

Without special tools, archiving is often achieved by copying the source code to some safe place from time to time. Can be dealt with but definitly not ideal. Sharing is achieved the same way, the safe place being whatever email adress your fellow developpers happen to use. Way messier:

You: (happy): Hello, Bud? I just made some bugfixes on the GUI. I have changed display.c, button.c, and main.c.
Bud (cautious): Just these files?
You: Oh, and the corresponding .h, of course
Bud (doubtful): Are you sure? what about window.c?
You (panicked): ...
Bud (resigned): OK, just send me the whole code, I'll deal with it.

Sounds familliar? You just had to keep track the exact changes you made since last time you gave your code to Bud, yet you didn't. Neither could I.

Sorting it out

Fortunately, version control systems do exactly this. So let's start with centralized version control systems, like SVN. These are the easiest to grasp. When using SVN, your code is stored in two places: the central repository, and your working copy. These have two major differences: number and persistence.

The central repository is one, the reference to which which every modifications will go. The working copies are many (typically one per developper),sandboxes in wich you create and test your modifications.

Nothing is ever lost in the central repository. The entire history of the code base is stored there. Need a safe place to store erlier version of your code? This is it. Time Machine 20 years before Apple. The working copy, on the other hand, is changed directly. If you want those change to be recorded, you must push them to the repository.

The workflow in a centralized version control system is quite simple:

Pull the modifications from the repository to your working copy.
Make some changes to your working copy (and test them).
Push the changes from your working copy to the repository.
Repeat the cycle.

If you and your teammates try to change the same piece of code, the version control system will tell, and tell where. No more need for perfect memory nor guesswork. You get straight to the point and solve that conflict.

Now the problem is setting it up. You need a central server, which ideally should be online 24/7. Not an option for many small projects. Even when this service is provided for free, it can be too much of a hassle.

DCVS to the rescue

Distributed version control systems like Darcs differ from centralized ones in one respect: repositories are now many. Yes, everyone has a fully fledged repository, with all history and such. This gives distributed CVSes two big advantages over centralized ones. First, they are esier to set up. No need for a central server, and a few commands are enough to make a repository out of an existing project. Second, you can keep the way you worked before you had them. It is just easier. Basically, the above phone conversation would become this:

You (happy): Hello, Bud? I just made some bugfixes on the GUI.
Bud (cautious): Just the GUI?
You (happy): Yup.
Bud (happy): OK, send me the patch, then.

Just easy. You just have to type one command to get the said patch, so you can email it to Bud. No need to keep track of each tiny little change you made. Your DCVS knows. If you and Bud changed the same thing, your DCVS will tell, and tell where. Use them.

Wednesday, January 31, 2007

Is Functional Programming Feasible?

I find Functional Programming quite natural. Recursion is trivial, and higher order functions are not such a big deal. Except I have forgotten how long it took me to fully grasp these concepts. I forgot that the first time I saw map and fold in action, I had a hard time understanding what the program actually did.

I often read that functional programs are easier to read, maintain, and debug. I also hear the same about imperative programs (Object Oriented as well as simply procedural ones). Currently, I am not sure which side is right. Personally, if I am given a powerful functional language, I will program faster and cleaner. End of discussion. But what about my fellow programmers? Will they do any good? Do they think they can do any good? I think there is some kind of trade-of: functional programs often use more powerful abstractions than imperative ones. The result is more concise code. On one hand, that mean less code to write, maintain, and debug. On the other hand, each piece of code carries more meaning, and is therefore likely to be more difficult to write, maintain, and debug. My point is to find which is more practical. By "practical", I mean usable in the "real world", where the big money is.

Let's start with an overview of the abstractions used by the imperative, procedural programming style, that the functional style lack.

Side effect. A side effect is something that defines a before, and an after. For example, printing "Hello World!" on the screen is a side effect: before, nothing is displayed. After, the message is visible on the screen. The state of the screen has been modified. (Note that side effects are ill-named : more often than not, we mean them).

Flow. An imperative program do things, one at a time, in a precise given order. The first thing to do comes first, the second comes second, an so on. When the program has to do things repeatedly, it is still in a precise given order. We need the notion of flow because these things are mostly side effects.

Assignment. An assignment is a particular kind of side effect: storing a new value in a variable, overwriting the old. It is the most important thing imperative programs do. (This is different from initialization, were the value is stored in a new variable as it is created.)

Flow control structures. Basically, we have three of them : "if", "while", and "for". I will not insult you by explaining those. (Functional languages have an "if", actually: it works the same as the C ternary operator "?:".)

To sum up quickly, Functional style is an imperative style without side effect. No flow of time, each variable is constant, and we don't have loops. Now, what can be done under such drastic restrictions? Computations. Given an input (a value), I can produce an output (another value). I just have to call some function or write some expression. Of course, if a program need to perform IO or user interaction, we need imperative features. However, large parts of many programs are mere computations.

Conclusion: there is a room for functional style. Now, is it a good idea to actually use it were we can? To answer this, I remember the most important questions: Is my code correct? Is it readable? Correctness and readability are always an issue when writing software, for code is rarely bug free, and programmers rarely work alone. Readability is often considered most important. For instance, If some piece of code looks scary, common wisdom say it can't be readable. Then, we can't check it's correctness. Conclusion, it belongs to the garbage, for we can't risk money and jobs in scary code. So, if functional programming is scary, we can't use it. To many, functional programming is scary: to actually compute anything interesting, functional programming encourages the use of features which are, for some reason and many people, scary. (These people often say: impractical.)

The first of these features is recursion. Functions can be defined in terms of themselves. Recursion looks quite foreign, almost magic: how the hell can I use something I have not defined yet? If we need to check the correctness of a recursive function, for example, we first have to assume it is correct. This kind of "bootstrapped" proof often look insane, although it is common in mathematics (I know it since high school).

Currently, any decent programming language, functional or not, support recursion. I think this is because sometimes, recursion is the only reasonable option (when implementing a sort, for example). Anyway, scary recursion is not encouraged by imperative languages. Some people still say it is inefficient (sadly, some implementations still prove them right). We have while loops, and we know them for a long time so why bother?

The second of these features is first class functions. That is, a function is a value like any other. Like any other value, it can be computed, stored in a variable (a new one, unless the language has imperative features) or passed as argument. Very few imperative languages support that. Who would ever want to pass a function as argument? Actually, we do : we have listeners and callbacks, data structures are accessed through iterators, objects are used to mimic closures... We just don't think about these as functions. Because of this, higher order functions are still perceived as an alien concept good for research, experts, or the garbage (remember the big money).

I remember the first time I really wondered if any programmer can handle a powerful language. My brother, who were studying set theory at the time, came to me to solve a problem. He haven't grasped the difference between "A belongs to B" and "A is included in B". First, it was easy to explain that in the former, "A is an element of the set B" and in later, "A is a subset of the set B". There, one can imagine the world as little crosses, the elements, surrounded by circles, the sets. Then, we saw in horror this description: "the set of the subsets of A". So we have some sets (the subsets of A), each of those included in another set (A), while being an element of some "superset". My brother, still thinking in terms of crosses and circles (utterly inappropriate in this case), had a hard time understanding these "higher order sets". As I have no problems with those, I tried to explain. I failed.

I think the main problem with recursion and higher order functions is self reference. With recursion, we have objects referencing themselves. With higher order functions, we have a concept referencing itself. Self reference is scary. Self reference leads to paradoxes and contradictions. I really hope I am wrong here, because self reference is a powerful tool for computing. It leads to recursion, which is convenient. It leads to higher order functions, which are powerful. It leads to macros (of the lisp kind: programs that write programs), which may be the best way to factor out code.

OK, Functional programming is scary. Why imperative, procedural programming is not? Because each line of a procedural code is very simple. The most complicated are the assignment and the function call. A procedural program do one thing at a time, so each step can be understood easily. We don't even have to talk about the flow or the state of the program. These are implicit, intuitive. To represent them, one can imagine a little worker, or robot, or whatever, performing each action described by the code, one by one. To understand the hole program, one has just to step from the beginning to the end, step by step. If the program is carefully documented, calls to procedures and functions can be understood without looking at their code. And most important, we can avoid scary features. We don't even have to think about them.

However, not being scary does not mean being better : the flow of time, while intuitive, is quite complicated to describe mathematically. Because of that, proving the correctness of even the simplest imperative program is a pain (try to prove some while loop). Variables must be checked against unexpected changes, and even expected changes must occur in the right order. Each line is very simple, but one have to embrace the whole program in her head to understand it anyway.

My conclusion is simple. As long as powerful abstractions are scary, functional programming will be considered impractical. Whether they are scary for good reasons or not, I do not know.

Friday, January 12, 2007

Does OOP really suck?

Foreword: smaltalk zealots, don't be upset, I am bashing mainly C++ and Java.

Object Oriented Programming was quite a revolution. It even became mandatory in any project large enough. If you're not using OOP for your current project, then it is doomed. Or is it? Why?

Sure OOP has advantages. Two majors ones I heard of are : code reuse, and encapsulation. That sounds great. Code reuse means I will have less code to write, less code to read, and less code to debug. Encapsulation means I will not know irrelevant implementation details about what I use, which is quite useful for teamwork.

Just a few words about intuitiveness. Some people say OOP is meant to be intuitive, but I think such a concern is irrelevant. For example, representing numbers with a sequence of figures is most counter intuitive (yes it is, for children). A much more intuitive way would be to represent them with a set of small balls. Or with a set of different things (the Egyptian numeration system was based on that : one symbol for the units, another for the dozens, an so on for each power of ten). However, the positional system we use today is far more concise and easier to use, once we have learnt it. What I think is important is ease of use, and, to some extent, ease of learning. Not intuitiveness.

Before OOP, we were doing structured programming. Remember the old days when we had the data on one hand, and the code on the other hand. A program was roughly a set of modules. A module is just some "place" (typically a couple of files), in which we store the definition of a data structure and the code which process it (a set of functions). The nominal way to use a module is to call it's functions on variables of it's data-type.


/* How to use a module */

/* 1: create some instance of the type shmilblick*/

shmilblick data;


/* 2: Use it */

process(data);

process2(data, stuff);

/* ... */

Then, OOP provided us a new way to divide our programs. A program is roughly a set of classes. A class is some place (typically a couple of files), in which we store the definition of some data (the member data), and the code which process it (a set of methods). We can now turn our code above into:


// How to use a class

// 1: create some instance of the class shmilblick

class shmilblick data;



// 2: Use it

data.process();

data.process2(stuf);

// ...

I have deliberately chosen an example where the procedural code and the OO one were isomorphic. This can look unfair, but the reality is often like this: OO syntax with procedural semantics. So let's start with more serious stuff.

When doing OOP, directly accessing the data is considered evil. So OO languages provided us a way to forbid programmers to do so : the "private" members. I have an interface (the "public" methods), so I have nothing to do with implementation. Actually, it's fine. It is now more difficult for me to damage the program. Or is it? Encapsulated data is data accessible from a given place (a class, a module, or a file) and no other. We can achieve that even with languages which don't enforce such discipline. Data encapsulation may have been introduced by OOP, but it is not specifically OO. Sure, providing support for encapsulation is more convenient than not to, but a language does not have to be OO do do this.

Besides, Object Oriented Programming is meant to promote code reuse. That is The Big Thing. The more I reuse my code (or other's), the less code I write. Less code have less bug, is produced faster, and is likely to be more maintainable. Code reuse can mean two things : the code I write will be easier to reuse, or It will be easier for me to write code that can be reused. All other things being equals, I prefer the former over the later, since the later require a conscious effort when writing my code, or at least a fair bit of refactoring. It is also possible to have a mix of the two, depending on what mechanisms are available. When we talk about OO, we generally talk about these: inheritance, design patterns, polymorphism (ad-hoc), and genericity.

If not carefully used, inheritance is dangerous : some implementation changes in the base class can compromise the correctness of the inherited ones. We can loose the benefit of encapsulation. (I don't remember the toy example I was told about. It had something to do with list processing.) Code sharing through inheritance is then quite difficult, unless a certain level of stability can be assumed from the base class. However, the combination of inheritance and late binding lead to ad-hoc polymorphism, which is quite useful when one wants to have a set of object of the same type that behave differently. This feature is powerful, but not exclusively OO : the same behaviour can be achieved with dynamic typing. Just imagine a print function which behave appropriately (and differently) with integers and strings. Then, the real advantage of Ad-hoc polymorphism through inheritance is binding the code to the behaviour (and the class) rather than to the functionality. Apart from that, it looks like just a way of getting around a static type system.

Design patterns are more a way of programming than a language feature. They just fit well with the current usage of the most popular OO languages these days. A design pattern is roughly a incomplete snippet of code that solve some class of problems. Each time one have a similar problem, she copy (and completes) the appropriate snippet. At the expense of writing a few time (if not many) the same kind of code, we are supposed to write more maintainable and less buggy code. My concern here is : can't we factor out these patterns? I am personally more and more annoyed when I see similar pieces of code again and again (a good example being "for" loops, especially those verbose ones from C). Worse : some design patterns are factored out : those which mimic functional programming, just because the language lacks closures and anonymous functions.

Genericity is probably the most important: we often need to apply some behaviour on data of several (if not all) datatypes. Writing code for each datatype is almost nonsense, if we have a way around this. Genericity is such a way. Great, but definitely not an OO concept. Only true Object languages, in which everything is an object, can naturally achieve this through mere inheritance. Otherwise, we have the ugly C++ templates, or a type inference like in ML and Haskell, or just a dynamic type system.

Conclusion : I don't know if OO really suck. I don't know OO well enough. However, I am under the impression that OO designs are often a way to empower an otherwise weak language, but without giving it full power. For example, I am currently coding a neural network, for an application in the Real World. A neural network takes an sequence of real numbers as input, and return a real number as output. This is a function. More accurately, this is a parametrized function. The parameters (a number of thresholds) are used to define the exact behaviour of the network. And I want to work with several networks at once. So the function we need happens to be a closure. I use a language which don't support closures, so I am forced to wrap each network into an object, using a quite verbose pattern. That suck.

Sunday, January 7, 2007

First Time

My first ever entry in a blog. so far, so good.