The reason the algorithm is getting thrown off in the case of the sign is because of parallax. Or rather there is more the one motion on the screen. Thus as the sign gets larger, a greater percentage of the votes go for that motion versus the background which is moving differently. Thus if we could seperate the two motions then we would be able to calculate one or the other's P and hopefully both correctly.
Traditionally this type of problem was solved by providing an initial guess to video orbits in order to give it a starting point. From this initial guess, supplied by a human user through visual inspecion the algorithm would work correctly. Not, an ideal solution.
It would be great if we could apply robust estimation to the scene and calculate P for the sign and the background. Or rather, prevent the smaller percentage of motion from throwing the calculation for the larger percentage of the motion.
I use existing tools and attempt a proof of concept. The idea is to not apply robust estimation directly to the entire image but rather to cut up the image into little piecies and try to calculate P for each section using existing software. How can we compare the results? The metric will be mean square error (MSE). Due to the fact that P is an exact solution, you can ``dechirp'' an image. Therefore, you ``dechirp'' the second image to the reference of the first one and compare the two images directly. Of course, some points will not exist but you take this into consideration (actually there is a complementary technique of ``cementing'' multiple dechirped images together).
Breaking up the image into smaller sections and graphing the MSE allows us to identify areas with different MSE. Firstly, I would like to take the regions of high MSE and mask them out in order to calculate a better P for the rest of the image. Secondly, I would like to be able to take the ares of high MSE and mask out the rest of the entire image and calculate the parameters P. This would allow you to track a second motion, ideally this could be extended to track multiple motions.
Why would this be useful? Say for example we have a mediated reality. As you are sitting at a bench feeding the pigeons you look up and see an horrible billboard advertising some inane politician. You would like to open an xterm over it and check your e-mail. So, as you pan around looking from side to side, video orbits will need to calculate the motion of the entire scene as well as that of the billboard. You need the P for the billboard in order to project the xterm onto it correctly as it moves around in the frame. You wouldn't want that xterm sliding off the billboard and wandering around willy nilly.
The above image was created by James Fung. He manually masked out the rest of the image and calculated the projective parameters P for the two billboards of interest.