02.20.07

Artificial Stupidity

Posted in Uncategorized at 7:05 pm by Fernando Cacciola

I’m interrupting work to post this because I’m stunned.

I understand that spam detection is an art. When you are in the software profession you can appreciate the complexity of it by just reading the interesting articles on the subject.

In fact, given how complex it is, I’m always gracefully surprised to see how good GMail is at it. Granted, from time to time one spam email leaks in, but considering that it correctly detects hundreds of them per week, it’s definitely more that good.

On the other hand, I still have a hotmail account.

If I can only search and replace one for the other in everyone’s contact list.

And I’m just equally surprised about how bad hotmail is at detecting spam. Roughly, 90% of the spam makes it into the inbox.

But I figured that it was mainly my fault for not taking the time to train hotmail by teaching it what was spam and what wasn’t (though I never did such thing for GMail FYI).

Recently, however, I started to think that hotmail uses AS techniques. Now I’m almost convinced: I just received a message from…hmm, a guy named “VIAGRA”!

OK, ok, I shouldn’t be so harsh on it… after all, how would it know that was spam?

12.19.06

Intrinsic Unit Testing

Posted in C#, cplusplus at 9:17 pm by Fernando Cacciola

I have a shameful confession to make: roughly 80% of the code of one of the projects I work on lack unit testing. Not even for regressions.

I could blame it to the fact that there are only two programmers in the entire team and we must act as arquitects, designers, implementers, testers, deployers, coffe makers, etc… but I would be fooling myself.

Truth is, we just don’t master the art of convincing managment that time spent on writting good unit test is much better than time spent hunting the cause of bugs customers found.

Or maybe that isn’t really true.. maybe we don’t want to invest time on proper testing, eager to see results fast.

In any case, I was trying to identify practical issues with unit testing, surely just to convince myself that is is way too difficult, but also, at least to some extent, to find simpler ways to do it.

The way I see it, conventional unit testing has three drawbacks:

  • It takes so much external code to instrument it.

  • It takes so much imagination to come up with significant fabricated input.

  • It takes so much time to fix the expected results that match the fabricated input.

Eventually, some fundamental questions came to me: why do I need to create external testing code whose job is to call the subject functions with some fabricated input and validate the output or resulting state? Aren’t functions supposed to have a contract? Why isn’t it sufficient to validate the preconditions, invariants and postconditions inside each funtion itself.

There are plenty answers to those questions, so yes, testing only by means of contract assertion is a naive idea, but an idea that sparks an interesting insight nevertheless.

Typical unit testing code validates results for fabricated input, yet in most cases, the validity of the results can be tested algorithmically, that is, there is no real need to create a fixed map between input and output. That’s what contract assertion is all about.

So, why do we put most valdation code in the unit test instead?

From my experience, some validation code can be highly involved and complex, specially if it must be expressed algorithmically, so it might be just too time consuming. Thus, if such a code is to be embedded into the function themselves, it needs specialized asserts that can be turn on/off in a much clever way that is normally done to distinguish release vs debug configurations. But such smart asserts are missing in a typical toolbox, so one just uses “conservative” asserts.

And why most validation code uses fixed expected results rather than algorithms?

I guess the cannonical answer would be: because by principle, the testing instrumentation must be itself reliable, so normally you don’t depend on the codebase being tested to obtain expected results.

But I’m unconvinced: can’t fabricated expected results be in error? sure they can. In fact, many times I had to invest effort fixing wrongly calculated expected results.

Also, I wonder, how bad it really is, in practice not theory, to depend on other parts of the code during validation. Consider the most nasty case possible: a cyclic dependency.

object decode(encoding e )
{
  object r = some_decoding_algo(e);   

  TEST( encode(r) == e ) ;   

  return r ;
}
encoding encode ( object o )
{
  encoding e = some_encoding_algo(o);   

 TEST ( decode(e) == o ) ;   

  return e ;
}

Is that really bad? There is a cycle, yes, but only two items are involved. If the test indeed fails, and only after that, I can always resort to conventional external unit testing to break the cycle.

So, I started playing with this thing, which I’m calling “Intrisinc Unit Testing“. That is, I’m using specialized asserts to embeed as much validation code I can in the functions themselves. Of course, there are still the external drivers that put the units at play in a controlled, determined context and with fixed input.

The specialized asserts differ from conventional asserts not only in the way they report errors but in the way they determine whether they need to be executed or not (in some next post I’ll give more details on this).

One upside of Intrinsic Unit Testing is that it allows the application to be tested while it is being used: that is, with real data. Of course you still need to run the automated testing because that would cover, if not all, much more paths; but the embeeded validation code can catch errores from real data just like the automated test can.

Another possible advantage of this is that the specialized assertions, given they nature, can become a good mechanism for collecting interesting input data that can be logged and integreated into the test drivers. For example, it is often recommended to add as interesting input every item that had once found to produce an error. This is particularly important, at least to me, because in my experience, when someone finds a bug, even a manual tester, it often reports just the observable defect, and you seldom get a hand on the input that produced the problem.  Test-oriented validation code, unlike conventional asserts, would log as much data as available when a defect is found, from stack trace to input values.

 

12.14.06

Virtual Virtual Reality

Posted in Blogroll at 5:13 pm by Fernando Cacciola

If two negatives make a positive, do two virtuals make a physical?

Yesterday I came a cross a “site” named DeRegalo.com, clearly aluding to www.DeRemate.com, an online shopping site for the Latin American market. But this “site” has something special: it isn’t online… it’s a physical store! Yet it’s name suggests that it preteneds to be an internet site.

In that case, is it a virtual internet site? how crazy, now don’t only do we simulate reality, we simulate simulated reality too!.

12.13.06

Going back to were we started

Posted in cplusplus at 9:43 pm by Fernando Cacciola

Consider the following straightforward C++ code:

vector v ;
int a[] = {0,1,2,3,4};
int b[] = {5,6,7,8,9};
copy(a,a+5,back_inserter(v));
copy(b,b+5,back_inserter(v));

Now suppose that we need an iterator range for the second batch insertion. That is, we need to know were the first element from ’b’ was inserted.
How could we do it?

Here’s a naive but tempting possibility

vector v ;
int a[] = {0,1,2,3,4};
int b[] = {5,6,7,8,9};
copy(a,a+5,back_inserter(v)); 

// Mark the place were the first ’b’ will be inserted
vector::iterator mark = v.end() ; 

copy(b,b+5,back_inserter(v)); 

// process from the point the first b was inserted
do_something(mark,v.end());

As right as that might appear to be, it has undefined behaviour in the case of a std::vector, even if .reserve() is called. Let me explain:

Inserting elements into a vector invalidates all iterators if and only if reallocation ocurrs. But a proper call to reserve() guarantees that reallocation won’t ocurr, so, to some extent, you don’t have to worry about invalidated iterators when appending into a vector which sufficient capacity. With one exception: the .end() iterator.

The reason is that after c.insert(pos,value), all elements from pos included need to be shifted up, so logically, all iterators >= pos become invalid, whether there is reallocation or not.

c.push_back(value); is equivalent to c.insert(c.end(),value), thus, clearly, end() is always invalid after insert() (thus push_back()) and erase().

OK, but what if the container is a std::list instead. After all list iterators are still valid after insertion/removal.
Well, this is still undefined behaviour because .end(), being still valid for a list, is allowed to keep pointing to the end of the list, so it cannot be assumed to become a valid iterator to the next element to be added.

The correct way to do it becomes ovbious if you consider the simplest case:

vector v ;
v.push_back(0);   

// How do I get an iterator to the lastly added element?
// Like this of course:
vector::iterator last = v.end();
-- last ;

Which extrapolates easily into the general case shown at the beggining:

vector v ;
int a[] = {0,1,2,3,4};
int b[] = {5,6,7,8,9};
copy(a,a+5,back_inserter(v));
copy(b,b+5,back_inserter(v));   

vector::iterator mark = v.end() ;
std::advance(mark,-5);   

// process from the point the first b was inserted
do_something(mark,v.end());

12.07.06

Assignment vs Initialization

Posted in cplusplus at 11:57 pm by Fernando Cacciola

Take a look at the following C++ code snippet

SomeContainer* c_ptr = ..whatever..
for ( some_iterator it = c_ptr->begin()
    ; it != c_ptr->end() && !found
    ; ++ it
    )
{
  while ( some_nested_loop )
  {
    if ( some_condition )
    {
      found = true ;
      c_ptr = c_ptr->OTHER_RELATED_CONTAINER() ;
    }
  }
}

There is very subtle problem here…the loop condition has undefined behavoir.

The problem is that, as you know well, the loop condition is evaluated first and then after each iteration over the loop body, and this body changes c_ptr causing it’s end() iterator to compare against an iterator previously obtained from another container. As it turns out, comparing incompatible iterators is undefined behaviour, and, for instance, the checked iterators feature of Dinkumware’s STL (that ships with VC8) assert that.

But the problem is subtle because the loop is supposed to be aborted right after the assignment, via the boolean flag used to control the loop (a break statement wouldn’t suffice because the assignment ocurrs in a nested loop), so the actual “mistake” here is quite silly and subtle (it touches on a dark corner of the C++ standard nobody knows about): the boolean flag “found” should be tested first to avoid comparing incompatible iterators.

So, here’s a fundamental question: can such a subtle mistake ever be avoided? Well, some mistakes can.

The basic approach is to learn which programming constructs are prone to error and avoid them as much as possible.

One example is using data to control flow. That is, aborting a loop via a boolean flag. There has been an interesting discussion about it, recently, in the ACCU general mailing list.

Another example is assignment, and I mean real assignment, not initialization (that is, when you change the value of a variable). Functional languages just lack assignment, and if you are a functional programming fan, you might even conclude that assignment is unnecesary, even, evil.

I believe there’s some truth in that, after all, assignment changes state and that’s always a source of error… we just saw how a little subtle mistake resulted in a bug because of that.

Of course, in an imperative language like C++ you just can’t avoid assignment altoghether, and there is no reason to do it. It would be like trying to avoid all mutating operations on objects.

Yet many times we do try to avoid mutating operations on objects. In fact, in C++ we even use const to force us to stay on the safe side, because a changing state, neccessary as it is, is error prone.

So, should we avoid assignment? I’d say yes, as much as we can.

Can we write the above code without it then?

How about this?

SomeContainer* c_ptr = ..whatever..
SomeContainer* other_c_ptr = NULL ; // Conceptually uninitialized really
for ( some_iterator it = c_ptr->begin()
    ; other_c_ptr == NULL && it != c_ptr->end()
    ; ++ it
    )
{
  if ( some_condition )
  {
    other_c_ptr = c_ptr->OTHER_RELATED_CONTAINER() ;
  }
}

This comes close: the object determining the search range is held in a variable that never changes (is never assigned), which is great, no more risk of comparing incompatible iterators.

The result, OTOH, is held in a variable that, conceptually, is also just initialized, not really assigned, but with a subtelty: its initialization is conditional.

But yes…there’s no such thing as conditional initialization, of course.

So, can we do better? Let’s see:

SomeContainer* const find ( SomeContainer* const src )
{
  for ( some_iterator it = src->begin()
      ; it != src->end()
      ; ++ it
      )
  {
    if ( some_condition )
      return src->OTHER_RELATED_CONTAINER() ;
  }
  return NULL ;
}
SomeContainer* const source = ..whatever..
SomeContainer* const = find(source);

Voila! No assignment.

Noticed the const qualifiers in the last example? They are not strictly needed (the code would compile anyway), but they enforce the intention at compile time, which is always an excelent idea.

Wrap up: assignment should be used only when it is strictly neccessary to change the value of a variable, that is, when we can’t use a different variable to hold the new value.

12.05.06

Visual C++ debug CRTs and assert()

Posted in cplusplus at 8:17 pm by Fernando Cacciola

If your VC++ console application links to a debug CRT, a failed ‘assert()‘ won’t, by default, print a message to stderr and abort(), as you might expect.

But this is a required behaviour for unit testing which naturally needs to run unattended.

After a little inverstigation, I found out how to control what happens when assert() fails (when a debug CRT is used):

int my_report_hook ( int type, char* msg, int* retval )
{
  if ( type == _CRT_ASSERT )
  {
    fprintf(stderr,msg);
    *retval = 0 ;
    abort();
  }
  return TRUE ;
}
_CrtSetReportHook(my_report_hook);

You should know that abort() calls the hook a second time with the ‘type’ argument as ‘_CRT_ERROR’, so make sure to keep the if clause, unless you want to blow up the stack.

Illegal uses of Singular Iterator Values

Posted in cplusplus at 5:44 pm by Fernando Cacciola

Take a look at the following C++ code snippet:

Container::iterator i ;
Container::iterator j ;
i == j ;

Looks OK doesn’t it?

Well, to many people it does. Let’s see.

What’s the value of the variables i and j?

Formally speaking, a singular value. Practically speaking, it can be anything, because C++ doesn’t force any initialization (like zero, value or default initialization) of uninitialized variables. And yes, these are uninitialized variables for all you know.

A variable of a class type defined without any initializer list is default initialized, and you would be right expecting such a variable to have a non-singular value. At the same time, an iterator can be of a class type, and if it has a default constructor you could be right expecting it to intruduce a non-singular iterator variable. But iterators are not required to do that, so an arbitrary uninitialized iterator variable is not default initialized.

More often than not, people treat uninitialized iterator variables as if they were default initialized iterator variables, ignoring the fact that iterators are not required to be class types. Thus, it is unfortunately more or less common to find errouneous comparisons like the one shown at the beginning.

Why am I bringing this up? Because the STL that ships with Visual C++ 8 has checked iterators by default (in a debug build), and illegal uses of singular iterator values is one of the mistakes it checks. Even though these iterators are of class type, trying to compare two uninitialized iterator variables will assert. And rightly so.

Not incidentally I happen to be figuring out how to fix some illegal uses of uninitialized iterators in a code base. The problem only showed up now thanks to the VC8 STL.

The code base uses lots of custom made iterators, most of them wrapping standard iterators. So at first I thought I could simplify the change by simply disabling all default constructors in all wrapping iterators and change the code on a case by case basis. I figured: why would one need to define an uninitialized iterator variable anyway?, surely there’s always a better construct that avoid the uninitialized variable.

I figured wrong. There is one common idiom-used quite too often to disregard it-that needs uninitialized variables:

SomeIterator b,e ;
for ( boost::tie(b,e) = GetBothBeginAndEndAtOnce(Container); b != e ; ++ b )
...

So the only thing I can do is run the entire project testsuite and hunt each illegal uses (mostly comparisons) one by one.

Fortunately, there is a trick that allows me to run the testsuite unattended:

This line disables (in VC) the pop up windows that by deafult is shown each time an assert fails:

_set_error_mode(_OUT_TO_STDERR);

Or if you are linking to a debug CRT, these lines instead (the above works only in release CRTs):


_CrtSetReportMode( _CRT_ASSERT, _CRTDBG_MODE_FILE );
_CrtSetReportFile( _CRT_ASSERT, _CRTDBG_FILE_STDERR );

Causing the assert message to go to stderr instead, which the test driver nicely captures in a log file.