Monday, August 29, 2011

Listing Elements

Now that I discussed a bit about bits and boolean operators, I will now start a series of talks surrounding data structures. In this first "episode", we will focus on different types of lists. Throughout the next few posts, if you find a class name, unless otherwise specified, I will be referring to the Java class with that given name (for example, today you might hear quite often about classes such as ArrayList and LinkedList). Even if you don't use or know Java, read these posts anyway... the stuff contained in these should not be too Java-centered (in the sense that most of the stuff we will discuss should apply to other languages/platforms that contain similar classes). As with all my written e-mails, forum and blog posts, I will try to keep this as short as possible (FYI, more often than not, I fail miserably at this "keep it short" policy... just see the previous post as an example). This article is part of a series of at least 3 posts that you should expect in the following few days. If everything works out as planed, I will talk about lists (current), sets and maps (the last one is still in "draft" mode and topic might change drastically - but maps will definitely be covered at some point).

First of all, some of you might be wondering why I mentioned before the ArrayList class, but not the basic arrays from Java. Indeed, Java uses for it's ArrayList and Vector classes arrays as the means of storing the data you add in the list (in case you didn't already know, the Java Vector class is just a synchronized version of ArrayList - in other words, only Vector is thread safe between the two). They are of course, just wrappers that manage two of the array's properties: the array length and the used portion of the array. If for example you needed a list of numbers, you could just use an array, remember how big the array is (the capacity - in some languages this is also needed... in Java for example, the array length is contained in the array object itself, but in C it is not) and also how many of the array elements have you already filled (the list length). Besides managing these parameters, the ArrayList class also handles array re-allocation whenever necessary (if no more elements fit in the array currently used by the list, a bigger array is allocated and the elements are copied from the old array to the new one). This re-allocation step is actually the downside of using arrays as support for a list: if you need to re-allocate too often, you will get a slower run time (we will get to the pros and cons of ArrayList vs LinkedList after a brief review of the linked list approach).

With LinkedLists however, you only need to allocate space for a new object every time you need to add to the list. This is why adding to a linked list is always done in constant time, no matter how many elements are in the list (as opposed to array based lists, where adding an element to a full array can lead to linear time - depending on implementation of course). As an implementation detail, for each element in the list, the linked list also contains a "pointer" to the next element of the list (in Java, LinkedList also contains a pointer to the previous element, thus making it a doubly linked list).

Now, before we discuss when to use an array based list vs a linked list, we will list the cost of a few basic operations performed on lists (by 'n' we will denote the number of elements currently in the list and by 'k' a random position within the list):
OperationPositionArrayListLinkedList
ADDstartO(n)O(1)
kO(n)O(k)
endO(1) amortized
O(n) worst case
O(1)
GETstartO(1)O(1)
kO(1)O(k)
endO(1)O(1)
REMOVEstartO(n)O(1)
kO(n-k)O(k)
endO(1)O(1)

The first thing you might notice is the "amortized" cost of the add operation. For those of you that are unfamiliar with the term, it's just a fancy way of saying "the average of the first n operations". This average, of course, depends on the so called 'grow policy' which is implemented in your array list implementation (by how much to expand the array if no more elements fit in it). The two most common policies are:
1) growing by a fixed amount (for example, grow by 10 more elements)
2) growing proportional to the existing number of elements (for example, always double the array size)
Out of the two, the second one is the most common, since it provides the costs mentioned above. The first policy gives O(n) amortized cost as well as worst case. However, always doubling the array size can be a waste of memory, therefore some implementations offer you the possibility to chose between the two (the 'Vector' class from Java, as well as 'vector' from the C++ STL to name a few).

Based on the table above, you can deduce a few situations where an array based list should be used:
  • when you need fast access to random elements from the list (some sort algorithms will work equally well on linked lists as on array lists if you can use iterators, but some, like HeapSort, may be a lot slower on linked lists - although there are ways to avoid this issue in this particular case)
  • when you don't need too many ADD operations (or you know a reasonable upper bound of ADD operations); e.g, you need to construct a sublist of a given list containing only the elements with a certain property (just set the initial array list capacity to the original list size)
  • when you have a reasonable balance between ADD and REMOVE operations (this is actually why the Stack class from Java actually extends Vector - therefore, making Stack an array based list)

Also, a few rules you should consider when choosing a linked list instead of an array list (this is actually why I wrote this entire article... I know a lot of people that use ArrayList in Java regardless of the situation they use it in and forget that LinkedList performs better in these situations):
  • when you need a lot of add operations (or an unknown number of add operations, then you just need to iterate over the elements of the list)
  • when you need add or remove operations at the beginning of the list (in fact, in Java, LinkedList implements the Queue and Deque interfaces - thus highlighting the fact that the addFirst and removeFirst operations can be used efficiently on this class)

I feel that I should give more details about when to use array vs linked lists, but most of the situations that pop in my head right now are just derived from the rules above. Just to summarize:
ArrayList: random access, OK for reasonable (or known) amount of addLast operations (also performs well when add operations are countered by removeLast operations, hence making it a decent implementation for Stack), NOT recommended when you have a lot of add operations
LinkedList: addFirst and removeFirst support (making it the ideal choice for almost any Queue), constant speed for addLast operations (no matter how many such operations occur), NOT recommended for some algorithms that require random access to the list elements.

I also wanted to talk a little about stacks and queues, since they are just special cases of lists, but most of it was already covered. A special mention goes to the fact that linked lists can be easily used for both stack and queue operations. Just a reminder: stacks are lists that use add and remove operations at the same end of the list (doesn't matter if this is the beginning or the end of the list... the important part is the "last in-first out" behavior) while queues have a "first in-first out" behavior which is accomplished by adding at one end, while removing from the other end.
As for when to use stacks and queues (and when to use regular lists), it's hard to formulate some rules. However, here's a couple of tips:
  • Stacks:
    • recursion: if you can describe the algorithm easily using recursion, CONSIDER a stack, since the recursion mechanism in all programming languages uses stacks internally (it's not necessary to use stacks in this situation, but it is a high probability)
    • parentheses: if you have any mathematical expression or "open elements" and "close elements" whose meaning depend on the relative positions (not only () {} [], but also XML-style elements); in this case, stacks are almost MANDATORY (except in the case when you have a single, simple open and close pair of elements (e.g, only one type of parentheses), when actually the height of the stack is sufficient
  • Queues:
    • "fewest number of steps" - this may sound a little too general, but I think this covers all the situations I ever used a queue; if you need the number of state transitions or you are just interested in the "first object" that has a specific property (starting from a set of objects and using the fewest number of transformations), then you will most likely need a queue

Well, I guess that about sums it up. If you want to remember anything from this post, remember this: for any two given solutions for a problem (in this case, lists), one of them will *always* have situations when it's better than the other. This, however, doesn't mean that if solution A has advantages over solution B, then solution B also has advantages over solution A. Sometimes, solution B is just worse than A on all counts... and should only be used for small cases (if it's easier to implement, of course) - e.g, heap sort vs selection sort (heap sort is faster, but harder to implement, so on small enough cases, selection sort can be considered).
Coming back to lists, try to remember that linked lists *also* exist, even if array based lists are ok for most situations (and especially remember when linked lists are better than array lists and use them accordingly).

If you have anything to add, feel free to leave a comment.

Friday, July 1, 2011

Boolean Logic

One of the first topics I wanted to discuss was boolean arithmetic. You will probably find out that there's quite a few things to say, even if it comes down to only two values: "true" or "false". Most of you are already familiar with boolean functions and how to combine operands to form any expression, so this topic will be a bit shorter than the rest.
For the purpose of this post, I will denote by "~A", the opposite of the boolean expression A ("not" A). It is a bit easier to see than what I suspect most of you are used to: "!A" (the C-style boolean negation).

First, a few rules I wanted to write down:
  • ~(~A) = A (double negation cancels itself out)
  • A OR true = true
  • A OR false = A
  • A AND true = A
  • A AND false = false
  • A AND ~A = false
  • A OR ~A = true
  • ~(A AND B) = ~A OR ~B (De Morgan's laws)
  • ~(A OR B) = ~A AND ~B 

Of course, these are not the only relations out there. But they are probably the most used. Next, I will give you a few tables. I've seen this happen on more than one occasion, so pay attention, because it might come in handy sooner than you think. Suppose you have two boolean expressions (A and B) and you have to call one piece of code if a certain combination of values applies and another piece when it doesn't (the classic "if-then-else" block). More specifically, in some languages it might look like this:
if ( A ??? B ) {
   // code 1
} else {
  // code 2
}
You might think that the operand you place in that ??? is always simple if you know what the code should do. It is (or it should be a lot easier to see by the time you finish reading this post). Some cases might seem more complicated than others.
It's pretty common to stumble on a situation when you don't know what to put as that operator (this is why I'm writing this). What I usually do is to try to put in a table (see below) when I should go into code block 1 and spot the necessary operator. In the following, I will try to show you how to recognize each boolean operation, depending on the layout of that table.

Case 1: all the same value
(a) A  ~A 
 B 11
 ~B 11
(b) A  ~A 
 B 00
 ~B 00
In this case, you obviously never enter one of the "then" or "else" branches. The condition is "if (true)", respectively "if (false)". You might as well ignore writing the "if" statement completely and just write the "then" block for subcase (a), respectively the "else" block for subcase (b).

For the following cases, I will not put the zero (false) values in the table, so the true values are easier to see.


Case 2: two 'true' values on the same line/column
(a) A  ~A 
 B 11
 ~B 
(b) A  ~A 
 B 
 ~B 11
(c) A  ~A 
 B 1
 ~B 1
(d) A  ~A 
 B 1
 ~B 1
In the first two tables, you can see that the value of the condition A doesn't affect the outcome of the evaluation (for A and ~A, the columns are identical). Therefore, it's only logical that the entire condition you place on the "if" statement depends only on the condition that influences the lines and ignores the boolean values on the columns. Similarly, on the last two cases, the condition will depend only on the column, ignoring the rows. The conditions for each table are as follows:
  1. if (B)
  2. if (~B)
  3. if (A)
  4. if (~A)

Case 3: only one 'true' value
(a) A  ~A 
 B 1
 ~B 
(b) A  ~A 
 B 
 ~B 1
(c) A  ~A 
 B 1
 ~B 
(d) A  ~A 
 B 
 ~B 1
Probably the most common of all situations revolving around two boolean variables. You only enter the "true" block if both variables are exactly the values you want. The easiest way to explain it is: "enter only if ((A has the value I want) AND (B has the value I want))". Visually, just put the AND between the values corresponding to the row and column that intersect at that "true" value. The conditions for each table are as follows:
  1. if (A AND B)
  2. if (A AND ~B)
  3. if (~A AND B)
  4. if (~A AND ~B)

Case 4: three 'true' values (only one out of four excluded)
(a) A  ~A 
 B 11
 ~B 1
(b) A  ~A 
 B 1
 ~B 11
(c) A  ~A 
 B 11
 ~B 1
(d) A  ~A 
 B 1
 ~B 11
If the AND operator is the most common boolean operator, this one definitely is a close second. You enter the "true" block if either one of the variables has the value you want. So, if you try to put it into words, it would go like "enter if at least one of them has the value I want". Visually, just put the OR between the values corresponding to the row and column that don't intersect at that "false" value (or, how I sometimes think it: OR the column and row that intersect in the cell that sits diagonally opposite of the "false" value). The conditions for each table are as follows:
  1. if (A OR B)
  2. if (A OR ~B)
  3. if (~A OR B)
  4. if (~A OR ~B)
As an exercise, try to verify the De Morgan laws using the tables in cases 3 and 4 (AND and OR tables). If you apply the NOT (~) operator to an entire table, all the values in it change.


Case 5: only two 'true' values, on opposite corners
(a) A  ~A 
 B 1
 ~B 1
(b) A  ~A 
 B 1
 ~B 1
If I said that the AND and OR cases are in the middle of a close fight for first place, this one comes in last. Maybe because it's so rarely encountered, but most people don't handle it efficiently. I've seen this over and over again, and although it is in fact correct, it can be done in a shorter version. For sub-case (a), one of the most common conditions is "if ((A AND B) OR (~A AND ~B))". Again, this is correct and for most people, it is the most logical way to write it. Ok, by now, most of you are probably thinking: "hey, there was that other operator I only heard about, but never used... XOR (exclusive OR)". In many languages, for those who are unfamiliar with it, it is represented either as "XOR" or "^". Indeed, this operator should definitely get a little more credit.

As a side story that happened in highscool, when I learned about the boolean operators in Pascal; my teacher was intentionally going to overlook it because he considered nobody uses it. When I tried to correct him and said that there is another operator for boolean variables, he was a bit upset (probably because it was the third time that class when I interrupted him, but that's not relevant here) and he gave me a special assignment to find at least three useful applications for the XOR operator. I could probably do an entire post only on this subject, but maybe later.

Back to the topic at hand, the English translation of the condition would be something like: "if only one of the conditions apply (but not both)". In order to apply the operator properly, just put XOR between a row containing one of the 'true' values and the column containing the other 'true' value. Or similarly, XOR a row and a column that have a 'false' at their intersection. The conditions for the two tables pictured above are:
  1. if (A XOR B)     or    if (~A XOR ~B)
  2. if (A XOR ~B)    or    if (~A XOR B)
A few properties of the XOR operator (some of them can be verified with the above tables, others may be easier to check after we discuss an alternative operator for XOR expressions):
  • ~(A XOR B) = ~A XOR B = A XOR ~B (negating a XOR expression is equivalent to negating one of the operands)
  • ~A XOR ~B = A XOR B (if you negate both operands, is the same as leaving them intact - use this to simplify your expressions)
  • A XOR A = false
  • A XOR ~A = true
  • A XOR true = ~A
  • A XOR false = A
One more thing to add regarding this case. Sometimes, there is another way to write this conditions without using the XOR operator. However, they apply only if your language has exactly one value for 'true' and one value for 'false' (or if you always use only one of the possible values for 'true' and only one of the possible values for 'false'). A few examples might clarify this: Java, C#, Pascal have only one value for 'true'/'false'.
On the other side, in C/C++, NULL == 0 == false and anything else can be evaluated as 'true' if placed as a condition ("int x = 7; if (x) {...}"). Similarly, in Javascript, "if (!some_var)" can mean either of:
- "if (some_var == false)"
- "if (some_var != undefined)"
- "if (some_var != null)"
So, as I said before, you can write a condition like "A XOR B" in another way if you use a language that has only one 'true' and only one 'false' value (or you are 100% sure you are using one value every time - for example, you write your boolean functions in C with return type int, and always return zero or one, never any other values). There's no need to keep you in suspense, so: "A XOR B" if and only if "A != B" (or "A <> B" if you prefer). Similarly, "A XOR ~B" is the same as "A == B".



Conclusion: like I said in the beginning of this post, there's quite a lot to talk about even if it comes down to something simple as 1 and 0 (I guess the length of this post stands as proof). Also, if you were a bit confused by the previous post (that long "introduction" in which I tried to explain what this blog is going to be about), well... stuff like this. Small tricks I've picked up over the years, stuff that make me the programmer I am today. For some, it may all be obvious and they are not going to bother reading it. Even so, reading this might help you remember stuff you have long forgotten.

PS: I know I said this post is going to be shorter than the rest... I was actually referring to the useful information in it, not the actual length (I got sidetracked along the line and it came out a lot longer than I anticipated).

Monday, June 27, 2011

Introduction

Most of you will probably skip this entire post, since you don't really care why I am doing this blog. Since it doesn't contain any useful information, most of you will probably find this pretty boring (I am not reliable for any of you who fall asleep in their chairs while reading this). 

I guess there are already many blogs about programming topics, so you may be wondering why this guy (aka "me") started one too. At first, to be honest, I didn't want to do a blog, but I will get to that a bit later. Before I do that, I want to tell you how this blog is going to be different from other ones on the same topic.
I intend to discuss some common topics that every programmer sooner or later is bound to come across. Low to mid level stuff. If you are expecting dynamic programming and minimum cost maximal network flow, you may be looking in the wrong place (even if at some point I may get to those topics too, it's not anywhere in the near future). To get a better idea, one of the first few topics you will find here is going to be about simple data structures like lists, arrays and hash tables. Mostly, tips and tricks I picked up over the years, or just plain stuff that most people (think they) know, but since almost everybody takes these things for granted, almost no one writes them down.
I read quite a few books in the computer science field, but none so far approached programming from this perspective. While some of them covered a particular programming language from a beginners point of view (and assume you will learn a few programming tricks while learning the language itself), others already assumed that you have a good knowledge of a programming language and give a good algorithmic programming background. Even if I am pro algorithms and I consider them the basis for any good programmer, I don't intend to write much about any specific algorithms. At one point, I wanted to write this blog in order to convince a people to engage in programming contests. I cannot help myself, but here is a good argument in that direction: I consider that participating in algorithmic contests doesn't only polish your algorithmic skills. There are several skills that also benefit from that competitive environment. Practice enough and you will see that debugging takes a lot less of your time, particular cases won't sneak by you so often... overall, your coding certainly improves because in programming contests you need to implement code that is very sensitive to details. Most programmers out there will never require that level of detail in their day to day work. Or at least, programming contests will require an "expertise" in a very wide area, while a programming job is usually a much smaller area (most of the times, depending on the job, that small area is not even covered in programming contests).

Back to the topic of the post. Initially, I wanted to put all these thoughts in a book, but I changed my mind while looking for the software to write it. Not because I couldn't find any suitable software. It's because I realized a couple of things. First one is that I don't really have the discipline to write a book (trust me: sooner or later, you will agree with me on this point). Second, if there are people that will benefit from these paragraphs, they will most likely appreciate a blog more than a book. A book will imply publishing costs (and people to buy it in order to cover those costs) and it has to be all in one piece. That last part was the decider: you can write a book over a year, or spread it out over a dozen blog posts so it's available to those who are interested as soon as it's ready (bonus: faster feedback from your readers). Plus, you can write a blog in a more relaxed manner. A book has several structural rules that you must obey. A blog is a more informal conversation with your reader.

Anyway, this post is long enough... I hope that some of you will get a few helpful insights from this blog. Since this is the first time I write a blog, any hints are useful, especially constructive criticism.