Polymorphism without Inheritance — or — How I learned to stop worrying and love boost::any

20 May

Often I prefer to use templates in C++ over polymorphism to avoid confining myself to any particular inheritance heirarchy when writing algorithms. The fact that templated code can run on any object that offers the right interface is a power that is greatly underestimated by C++ critics, in my opinion.

However, this design approach has its limitations, as I learned this week, when I needed to store a collection of objects and call the “render()” method on all of them.  This sounds like a task for polymorphism!  Of course, this would be easy if I had designed all of these objects to inherit from a “Renderable” base class, but, alas, they did not.  The only thing they had in common is that they all implemented Renderable concept, i.e. they all had a method named “render()”.  Luckily, there’s a way to solve this problem without retrofitting all of my classes to inherit from Renderable, using the power of Boost.  First, lets review the traditional way to do this with inheritance, and then we’ll talk about how we can solve this without inheritance.

Lets say I have a collection of objects of different types, but they all share the fact that they be rendered. I would like to store and render them in a uniform way. For example:

  RenderableA a; 
RenderableB b;
renderables.push_back(a);
 renderables.push_back(b);
render_all(renderables);
Using polymorphism and inheritance, it is straightforward:
Abstract_renderable { virtual void render() = 0; }
RenderableA : public Abstract_renderable
{
    virtual void render() 
    { 
        cout << "RenderableA" << endl; 
    }
};
Renderable B : public Abstract_renderable
 {
 virtual void render() { cout << "RenderableB" << endl; }
 };
void render_all(const std:vector renderables)
{
     for(int i = 0; i < renderables.size(); i++)
     {
         renderables[i].render();
     }
}

This approach is concise and idiomatic, but it only accomodates classes that inherit from Abstract_renderable. Why is this problematic? There are a number of reasons. Maybe you’re working with classes from several different third-party libraries that you can’t edit. Maybe you want to operate on abstract types in the same way as classes (admittedly, this is less relevant for “renderables”). Or maybe there simply isn’t a natural way to force all your classes into a tree heirarchy.

But you can’t store an array of heterogeneous types that don’t share a common base class, right?? WRONG! With the power of boost::any, I can store an array of, well, anything:

RenderableA a; RenderableB b;
// this vector can store literally any type:
 std::vector renderables;
 renderables.push_back(a);
 renderables.push_back(b);
 ...

Okay, great, I can store the objects, but how do I get them back out and call render() on them? Well, this is where things get tricky. Boost::any doesn’t store any type information (since in general this can’t be known at run-time), so in a typical usage of boost::any, you have to cast it back to its original type before you can use it. The problem is, when you have a vector of boost::any, it’s often difficult to keep track of the type of every single element (even deciding on how to repesent type at run-time is tricky). The solution I use is to create a wrapper class that stores the objects as well as a callback to the functionality you care about. This is easiest to explain using an example:

class Renderable_collection { template <class RenderableType> void push_back(const RenderableType& obj_in) { // step 1: copy into a smart pointer boost::shared_ptr obj(new RenderableType(obj_in)); // step 2: store object renderables_.push_back(obj); // step 3: "strip off" the render callback and store it separately callbacks_.push_back(boost::bind(&RenderableType::render, obj.get())); }
void render_all()
 { // todo
 }
 private:
 std::vector<boost::function0> callbacks_;
 std::vector renderables_;
 };

In the first line, we copy the incoming object into a smart pointer and push it into a vector of boost::any. The reason for this is a bit subtle. Notice that we’re storing two vectors: one vector of callbacks (callbacks_), and one vector of objects (renderables_). The vector of callbacks will contain references to the elements in the object vector, so it’s essential that these references aren’t invalidated if the object vector is re-allocated during a resize. Storing pointers solves this problem, and using a smart pointer instead of a raw pointer avoids a memory leak when the object vector is destructed.

In the third line, we “strip off” the render() method of the newly-constructed object and store it in the callback vector. We accomplish this using boost::bind, which in this context is converting a routine that is called as a method (i.e. “obj.render();” ) into a routine that can be called as a function (i.e. “render();” ). It does this by storing a pointer to obj inside a function object, whose parentheses operator is overloaded to call the method on the object pointer (yep, magic). The first argument to boost:bind() is a pointer a function or method, and the remaining arguments are the parameters to pass to that function. For class methods, the first argument is always the “this” pointer to the object you want to call the method with. We store the result into a boost::function0 object, which is simply a wrapper that can store any function or functor that receives zero parameters and returns void. I recommend becoming familiar with boost::bind if you aren’t already. It’s extremely powerful, and it has been integrated into the newly-finalized C++11 standard.

Now that we’ve stored the object and the necessary callbacks, we can implement the render_all() method:

class Renderable_collection { template void push_back(const RenderableType& obj_in) { ... }
void render_all() const
 {
 for(int i = 0; i < callbacks_,size(); i++)
 {
 // call each render callback in sequence
 callbacks_[i]();
 }
 }
 private:
 std::vector<boost::function0> callbacks_;
 std::vector renderables_;
 };

Thus, our Renderable_collection class provides a way to store arbitrary collections of heterogeneous types, which are related solely by the fact that they all contain a method named “render” that takes zero arguments. Furthermore, it allows us to call the render method of all stored objects, even though their types are no longer known.

Improvements

The approach above requires that all objects being stored are copy constructible. This is a problem if you want to store non-copy-constructible objects (std::cout, anyone?), or if copying is too expensive to be practical. The solution is simple: partially specialize the push_back() method to handle pointers specially.

class Renderable_collection { .... template void push_back(const boost::shared_ptr obj) { // step 1: skipped! // step 2: store object renderables_.push_back(obj); // step 3: "strip off" the render callback and store it separately callbacks_.push_back(boost::bind(&RenderableType::render, obj.get())); }
...
 };

The example above specializes for smart pointers, but if you’re tied to raw pointers (yuck!) you can specialize for that, too. You can probably omit step 2 in that case, but leaving it in wouldn’t break anything.

The approach above assumes there’s only one method that matters: render(). For more complex concepts with additional methods of interest, this approach is easy to extend. You’ll need to repeat step 3 above for each method to “strip off” additional callbacks, and you’ll need a callback vector for each method of interest.

In this example, render() is a method (i.e. member function) that receives zero arguments and returns null. However, approach should work for any function with any number of arguments or return type. For functions with one or more arguments, you’ll need to use boost:bind’s “placeholder” objects to leave the arguments unbound until you call them. For example:

// step 3 callbacks_.push_back(boost::bind(RenderableType::print, obj.get(), boost::_1));

This call to boost::bind has an additional parameter: boost::_1. This is a “placeholder” that leaves this parameter unbound. The resulting callback will have a single input parameter corresponding to this placeholder.

Drawbacks

Although this is a nice approach, it has some significant drawbacks. One drawback is performance-related: all the extra indirection of usng shared_ptr’s, boost::bind, and boost::function will incur some slight overhead compared to using native pointers and virtual functions in a polymorphic approach.

However the biggest drawbacks is that it is impossible to do a deep copy of Renderable_collection, because the true types contained inside the boost::any objects is unknown. However, since we used smart pointers, shallow copies are safe and free of memory leaks.

CUDA_ERROR_LAUNCH_FAILED: Backward migrating CUDA code from 3.2 to 3.1

31 Oct

I spent part of the last three weeks learning CUDA and developing a C++ CUDA framework for my lab’s computer vision library. I must say, aside from the boilerplate-heavy hassle that is the Driver API, I’ve found it surprisingly easy to work with. Most of my work involves rendering a model in OpenGL, reading the pixels back into main memory with glReadPixels(), and doing image analysis on them. Rinse. Repeat. One-hundred thousand times. Obviously, with such a high work load, efficiency is a priority, I quickly noticed that glReadPixels() was a HUGE bottleneck. My image analysis wasn’t that complicated (effectively a per-pixel table-lookup followed by summing the result), and if I could do that computation on the graphics card, I’d only have to download one value from the graphics card every iteration instead of 20 million–goodbye bottleneck!! After a few false starts with GLSL shaders and OpenCL, I settled on CUDA, and have been rocking it ever since. The architecture took a while to learn, but the language is basically C and couldn’t be more intuitive. My early prototypes were showing between 20x and 100x speedup. The sky was blue, the sun was shining, and nothing could go wrong. That is until…

For reasons I can’t even remember, I downgraded CUDA from 3.2 to 3.1, and suddenly none of my kernels wouldn’t run. If I was lucky, the screen would simply flicker and the program would crash with the message:

CUDA_ERROR_LAUNCH_FAILED

Other times, I wasn’t so lucky, and the screen would simply lock up, forcing a hard reboot. “Okay,” I told myself, “THIS is the experience I was expecting when I decided to dive into GPU programming.” You know, black-outs, lock-ups, shooting flames, the ghosts of tortured souls swirling around my monitor. I’m an applications guy–a former PHP programmer for chrissake–and my first real foray into writing low-level down-to-the-metal code was exciting for me. And so upon seeing my code utterly cripple my entire machine, I was almost proud of myself: I was breaking things in new and wonderful ways that were never available to me before! That kid from 7th-grade gym class who insisted that he wrote a virus that would set my computer on fire would be proud. If it weren’t for the fact that I was 10 days from CVPR publication deadline and already behind schedule, I might have sent him a letter.

Anyway, it took almost four hours of trying nearly everything:

  • Pouring over my changes since the last working SVN version (no revelations)
  • Rolling back my code to an earlier SVN revision (still failed)
  • Trying it on a remote machine with CUDA 3.2 installed (success!)
  • Stepping through every bit of code in GDB (it failed inside the CUDA kernel, beyond GDB’s clutches)
  • Silently weeping
  • Loudly screaming
  • …I finally tracked down the problem: the size of (and alignment of) CUdeviceptr changed from 4 to 8 bytes between CUDA 3.1 and CUDA 3.2. That means that when passing device pointers to a kernel in version 3.2, no alignment is necessary, and using sizeof(CUdeviceptr) as the data size and parameter offset is just hunky-dory, just as intuitive as can be! But when you try to run your code on an older version of CUDA, everything goes to shit. I modeled my code after the vectorAddDrv example in the NVidia CUDA 3.2 SDK, which makes absolutely no mention of this fact (matrixAddDrv does, though, go figure). Anyway, the result this little adventure is that this code that is broken in CUDA 3.1 and earlier:

    CUdeviceptr d_in;
    CUdeviceptr d_out;
    ...
    int offset = 0;

    EGC(err = cuParamSetv(function, offset, &d_in, sizeof(d_in)));
    offset += sizeof(d_in);

    EGC(err = cuParamSetv(function, offset, &d_out, sizeof(d_out)));
    offset += sizeof(d_out);

    ALIGN_UP(offset, __alignof(N));
    EGC(err = cuParamSeti(function, offset, N));
    offset += sizeof(N);

    EGC(err = cuParamSetSize(function, offset));
    EGC(err = cuFuncSetBlockShape(function, threads, 1, 1));
    EGC(err = cuLaunchGrid(function, blocks, 1));

    became this code that works in CUDA 3.1 (and presumably earlier):

    int offset = 0;

    void* ptr;

    ptr = (void*)(size_t) d_in;
    ALIGN_UP(offset, __alignof(ptr));

    EGC(err = cuParamSetv(function, offset, &ptr, sizeof(ptr)));
    offset += sizeof(ptr);

    ALIGN_UP(offset, __alignof(ptr));
    EGC(err = cuParamSetv(function, offset, &ptr, sizeof(ptr)));
    offset += sizeof(ptr);

    ALIGN_UP(offset, __alignof(N));
    EGC(err = cuParamSeti(function, offset, N));
    offset += sizeof(N);

    EGC(err = cuParamSetSize(function, offset));
    EGC(err = cuFuncSetBlockShape(function, threads, 1, 1));
    EGC(err = cuLaunchGrid(function, blocks, 1));

    Notice the key changes:

  • All CUdeviceptrs are cast to a void*
  • The alignment macro is added before passing them to the CUDA kernel.
  • The first step is necessary, because even thought sizeof(CUdeviceptr) is four in CUDA 3.1, an actual device pointer inside the CUDA kernel is the same size as a regular pointer (i.e. 8 bytes on a 64-bit system), and this ensures you’re passing an 8-byte value to the kernel, and changing the parameter offset appropriately. The second step ensures the parameter alignment is correct regardless of the size of the pointer, a good programming practice in general, and probably necessary on 32-bit systems (I haven’t tested this, but just do it anyway. And eat your broccoli, it’s good for you!)

    So, let this be a lesson, boys and girls: always cast your device pointers to a (void*) and always align your parameters!!

    Simulating Calibrated Cameras in OpenGL

    3 Aug

    Edit: The article below is out of date and incomplete.   You can find a revised, expanded, and (hopefully) clearer version at my new blog.  — Kyle, April 3, 2013

    If you’ve ever tried to simulate a calibrated camera in opengl, you’ve probably realized it isn’t as straightforward as you might like.  The glFrustum and gluProjection functions can get you most of the way there, but the can pose problems.   First, there isn’t an intuitive mapping to camera parameters like principal point and focal length.  Second, there isn’t any way to represent camera axis skew and non-square pixels (exhibited by CCD arrays in cheap digital cameras).  A more straightforward approach is to build the projection matrix directly, which allows you to represent all of the intrinsic camera parameters:  focal length, pixel aspect ratio, principal point (x,y), and axis skew.

    The glFrustum Matrix

    Lets start by looking at the matrix generated by an OpenGL call to glFrustum(left, right, top, bottom, near, far):

    \text{glFrustum} = \left( \begin{array}{cccc} \frac{2 \; near}{right - left} & 0 & A & 0 \\ 0 & \frac{2 \; near}{top - bottom} & B & 0 \\ 0 & 0 & C & D \\ 0 & 0 & -1 & 0 \end{array} \right)

    where

    A = \frac{right + left}{right - left}
    B = \frac{top + bottom}{top - bottom}
    C = -\frac{far + near}{far - near}
    D = -\frac{2 \; far \; near}{far - near}

    Not everything in this matrix is obvious at first glance, but you should at least start to see similarities between this matrix and the intrinsic camera parameter matrix. Note that opengl’s “projection” preserves the z-depth, which explains the third row. Removing the third row, this has the same structure as an intrinsic camera matrix (a.k.a. “K” in Hartley and Zisserman) with zero axis skew:

    K = \left( \begin{array}{cccc} \alpha & s & x_0 & 0 \\ 0 & \beta & y_0 & 0 \\ 0 & 0 & -1 & 0 \end{array} \right)

    where
    \alpha,\; \beta are the focal length in x and y units (a.k.a. “scaling factor” in x, y directions)
    s is the axis skew.
    x_0, y_0 are the principal point offset.

    However, you can’t just replace elements (0,0) and (1,1) of OpenGL’s projection matrix with your camera’s focal length. The reason is that this matrix actually does two things at once: (1) it performs a perspective projection, and (2) it rescales the coordinates to Normalized Device Coordinatess Part (1) is exactly analogous to multiplying by the intrinsic camera matrix, while part (2) has arguably nothing to do with cameras and is just required by the OpenGL an architecture.

    Decomposing the glFrustum Matrix

    So how can we separate these two operations to get at the real camera matrix? Lets start by looking at projection that doesn’t add any perspective: glOrtho(). The matrix generated by glOrtho only performs part (2) above, namely, conversion into Normalized Device Coordinates. Lets looks at its matrix:

    \text{glOrtho} = \left( \begin{array}{cccc} \frac{2}{right - left} & 0 & 0 & t_x \\ 0 & \frac{2}{top - bottom} & 0 & t_y \\ 0 & 0 & -\frac{2}{far - near} & t_x \\ 0 & 0 & 0 & 1 \end{array} \right)

    where
    t_x = -\frac{right + left}{right - left}
    t_y = -\frac{top + bottom}{top - bottom}
    t_z = -\frac{far + near}{far - near}

    We see a number of similarities between the elements of glOrtho and the elements of glFrustum(). In fact, we can factor out glOrtho from the glFrustum matrix to get:

    \text{glFrustum} = \text{glOrtho} * \left( \begin{array}{cccc} near & 0 & 0 & 0 \\ 0 & near & 0 & 0 \\ 0 & 0 & X & Y \\ 0 & 0 & -1 & 0 \end{array} \right)

    where

    X = near + far
    Y = near \; far

    We can ignore X and Y, as they don’t pertain to the calibrated camera matrix; they’re just used to map z-depths in OpenGL. Notice that the second matrix now looks strikingly like the intrinsic camera matrix, K. Lets spend a moment to interpret this result. A 3D coordinate passing through this matrix is first multiplied by our intrinsic camera matrix, which does a perspective transformation. Then it passes through glOrtho(), which simply scales and translates the point into Normalized Device Coordinates. Alternatively, we can think of the perspective transformation as converting a trapezoidal-prism-shaped viewing volume into a rectangular-prism-shaped viewing volume, which glOrtho() scales and translates into the 2x2x2 cube in Normalized Device Coordinates.

    Relation to Calibrated Camera

    Now that we’ve decomposed the glFrustum matrix we can draw a direct comparison to our calibrated camera. Namely, glFrustrum is setting up a camera with:

    • Focal length of  “near”
    • A pixel aspect ratio of  (near / near) = 1
    • Zero skew
    • A principal point at the exact center (0,0)*

    * I should note that glFrustum can represent a camera with an off-center principal point, but in our decomposition above, this occurs in the glOrtho matrix, not in the camera matrix.  This will make it more convenient when we simulate the calibrated camera shortly.

    Notice that element (3,2) of the projection matrix is ‘-1’. This is because the camera looks in the negative-z direction, which is the opposite of the convention used by Hartley and Zisserman. This inversion of the camera’s z-axis is achieved by left-multiplying the intrinsic camera matrix by a z-inverting matrix:

    \left( \begin{array}{cccc} \alpha_x & s & x_0 & 0 \\ 0 & \alpha_y & y_0 & 0 \\ 0 & 0 & 1 & 0 \end{array} \right) * \left( \begin{array}{cccc} 1&0&0&0 \\ 0&1&0&0 \\ 0&0&-1&0 \\ 0&0&0&1 \end{array} \right) = \left( \begin{array}{cccc} \alpha_x & x & -x_0 & 0 \\ 0 & \alpha_y & -y_0 & 0 \\ 0 & 0 & -1 & 0 \end{array} \right)

    Note that points with negative z-coordinates will now map to a homogeneous 2D point with positive w-coordinate. This is important because OpenGL only renders points whose 2D homogeneous ‘w’ coordinates are positive.

    An alternative interpretation is to obtain this matrix by negating the focal length and skew, and then multiplying the entire matrix by negative 1 (a no-op when dealing with homogenous matrices). We may interpret focal length as the position of the virtual image plane lying in front of the camera, and since the camera’s direction has reversed, the camera focal lengths have become negative. Also, since inverting the z-axis means clockwise is now counter-clockwise, the skew parameter has become negated, too. In other words:

    -1 * \left( \begin{array}{cccc} -\alpha_x & -s & x_0 & 0 \\ 0 & -\alpha_y & y_0 & 0 \\ 0 & 0 & 1 & 0 \end{array} \right) = \left( \begin{array}{cccc} \alpha_x & x & -x_0 & 0 \\ 0 & \alpha_y & -y_0 & 0 \\ 0 & 0 & -1 & 0 \end{array} \right)

    Transforming this intrinsic camera matrix back into OpenGL form gives:

    Proj = glOrtho * \left( \begin{array}{cccc} \alpha & s & -x_0 & 0 \\ 0 & \beta & -y_0 & 0 \\ 0 & 0 & X & Y \\ 0 & 0 & -1 & 0 \end{array} \right)

    where

    X = near + far
    Y = near \; far

    With this matrix, we can simulate all five intrinsic camera parameters as opposed to glFrustum‘s three (a single focal length, and (x,y), principal point). In addition, this representation decouples focal length from the near plane, which were bound to each other in glFrustum(). When skew and principal point offset are zero, pixel aspect ratio is unity, and focal length equals the near plane, the result is exactly the same result as glFrustum().

    Calling glOrtho Correctly

    Before we close, I should note that implementing this correctly requires passing the proper parameters to glOrtho(). Specifically, you should pass the pixel coordinates of the left, right, bottom, and top of the window you used when performing calibration. For example, lets assume you calibrated using a 640×480 image. If you used a pixel coordinate system whose origin is at the top-left, with the y-axis increasing in the downward direction, you would call glOrtho(0, 640, 480, 0, near, far). If you calibrated with an origin at zero and normal leftward/upward x,y axis, you would call glOrtho(-320, 320, -240, 240, near, far).

    If you’d prefer your implementation to be independent of your calibration approach, you should “standardize” your camera calibration matrix in a preprocessing step. Start by translating screen coordinates so the origin is in the center of the image:

    EQUATION HERE

    Then, flip the y-axis if you used a coordinate system where y increases in the downward direction:

    EQUATION HERE

    Now, you can use the same call to glOrtho for all situations: glOrtho(-width / 2, width/2, -height/2, height/2, near. far). Increasing width and height will result in showing more of the scene, as if you used a larger-sized film surface in a pinhole camera, or a larger CCD array in a digital camera. Of course, correct simulation requires the correct using values for width and height corresponding to the image used during calibration, so we still haven’t achieved an implementation that is independent of the specific calibration scenario. The last step is to transform camera coordintes into normalized device coordinates yourself:

    EQUATION HERE

    If you work with your camera matrix in this form, you don’t have to call glOrtho at all. Now, changing the window or viewport size will scale the display, without showing any more or less of the scene. Of course, combining the preprocessing steps above will result in exactly multiplying by the glOrtho matrix with the appropriate parameters described above based on your calibration scenario.

    So that’s it! I hope this is helpful when you’re designing your own simulation of a calibrated camera!


    Reference

    Multiple View Geometry in Computer Vision“, Hartley and Zisserman
    Computer Vision: A Modern Approach“, Forsyth and Ponce
    OpenGL glFrustum() Documentation
    OpenGL glOrtho() Documentation
    OpenGL Projection Matrix“, songho.ca (Very nice: derives the glFrustum matrix)
    Why does sign matter in opengl projection matrix“, StackOverflow.com
    Questions concering the ARToolkit”  ARToolkit Mailing List — Early inspiration for this approach