Everything you know about computer vision may soon be wrong
Ubicept wants half of the world’s cameras to see things differently
Computer vision could be a lot faster and better if we skip the concept of still frames and instead directly analyze the data stream from a camera. At least, that’s the theory that the newest brainchild spinning out of the MIT Media lab, Ubicept, is operating under.
Most computer vision applications work the same way: A camera takes an image (or a rapid series of images, in the case of video). These still frames are passed to a computer, which then does the analysis to figure out what is in the image. Sounds simple enough.
But there’s a problem: That paradigm assumes that creating still frames is a good idea. As humans who are used to seeing photography and video, that might seem reasonable. Computers don’t care, however, and Ubicept believes it can make computer vision far better and more reliable by ignoring the idea of frames.
The company itself is a collaboration between its co-founders. Sebastian Bauer is the company’s CEO and a postdoc at the University of Wisconsin, where he was working on lidar systems. Tristan Swedish is now Ubicept’s CTO. Before that, he was a research assistant and a master’s and Ph.D. student at the MIT Media Lab for eight years.
“There are 45 billion cameras in the world, and most of them are creating images and video that aren’t really being looked at by a human,” Bauer explained. “These cameras are mostly for perception, for systems to make decisions based on that perception. Think about autonomous driving, for example, as a system where it is about pedestrian recognition. There are all these studies coming out that show that pedestrian detection works great in bright daylight but particularly badly in low light. Other examples are cameras for industrial sorting, inspection and quality assurance. All these cameras are being used for automated decision-making. In sufficiently lit rooms or in daylight, they work well. But in low light, especially in connection with fast motion, problems come up.”
The company’s solution is to bypass the “still frame” as the source of truth for computer vision and instead measure the individual photons that hit an imaging sensor directly. That can be done with a single-photon avalanche diode array (or SPAD array, among friends). This raw stream of data can then be fed into a field-programmable gate array (FPGA, a type of super-specialized processor) and further analyzed by computer vision algorithms.
The newly founded company demonstrated its tech at CES in Las Vegas in January, and it has some pretty bold plans for the future of computer vision.
“Our vision is to have technology on at least 10% of cameras in the next five years, and in at least 50% of cameras in the next 10 years,” Bauer projected. “When you detect each individual photon with a very high time resolution, you’re doing the best that nature allows you to do. And you see the benefits, like the high-quality videos on our webpage, which are just blowing everything else out of the water.”
TechCrunch saw the technology in action at a recent demonstration in Boston and wanted to explore how the tech works and what the implications are for computer vision and AI applications.
A new form of seeing
Digital cameras generally work by grabbing a single-frame exposure by “counting” the number of photons that hit each of the sensor pixels over a certain period of time. At the end of the time period, all of those photons are multiplied together, and you have a still photograph. If nothing in the image moves, that works great, but the “if nothing moves” thing is a pretty big caveat, especially when it comes to computer vision. It turns out that when you are trying to use cameras to make decisions, everything moves all the time.
Of course, with the raw data, the company is still able to combine the stream of photons into frames, which creates beautifully crisp video without motion blur. Perhaps more excitingly, dispensing with the idea of frames means that the Ubicept team was able to take the raw data and analyze it directly. Here’s a sample video of the dramatic difference that can make in practice: