A colleague recently sent me the following link describing a project by MIT's Fluid Interfaces Group, related to next generation interaction solutions. Besides being an interesting (and occassionally amusing) presentation, it reminded me of some thinking I participated in on a similar front, which I thought was worth sharing. For ease of reading, I'll break the subject into multiple posts, starting with image recognition.
Wouldn't it be great if my computer could see what I saw, tell me everything I want to know about it? I'd look at a product on a shelf and know everything about it. I could look at a billboard advertising a film and instantly know where and when I could see it. I'd never forget a name... This might sound like something out of Minority Report, but in fact the "wearable webcam" concept has been around for some years now and I'm slightly surprised that its still not made it commercially.
Microsoft have been very active in this space with their Sensecam, which initially was a simple camera on a neckchain that took an image every 5 or 10 seconds (meaning an end to lost keys, pens, phones etc...), but has grown over the years to incorporate body monitoring, GPS and so on. Without detailed discussion of the concept's advantages and disadvantages, active recording of daily or even second-by-second activity has been practical for some years, but hasn't come to market for some reason or other.
Contextual analysis of images is much more interesting and challenging. From my perspective this is a logical extension of semantic search (as practiced by search engines). At the moment, search engines identify and link pictures at a macro level by reading their metadata tags (user added). Live images sadly lack meta tags, so recognising them depends on being able to rapidly match images. Again, Microsoft Labs have an interesting research project related to this, called Photosynth. This is likely to be a very processor intensive task as pictures inherently contain far more data than text, although location information will doubtless help to reduce processing time by reducing the initial size of the searched database to results returned to other users in the general vicinity.
As the MIT team suggest, the Cloud is a realistic option for image search, provided that sufficient mobile data bandwidth exists at the user's location. That said, solid state storage is becoming ever more capacious, smaller and cheaper, so local caching of an encyclopedia of common 3D images is likely to be commercially possible in the next few years. From a commercial perspective, the opportunity for monetisation of this technology seems clear - location based advertising delivered at the point of identification, so that if I'm looking at a car, I can see the web link that will tell me all about it. Since banner advertising is a zero-sum game of sorts, I'm afraid the money is likely to come from TV and press advertising budgets...
My conclusion on contextual search is that it is currently possible, but will not be commercially feasible until deep, widespread 4th generation mobile networks are available - 2011/12 or thereabouts in the most developed markets. Similarly, monetisation will depend on linking adverts to images in the same ways proposed for monetisation of online VOD.
The combination of always-on, line-of-sight image capture and rich, rapid identification could be a true killer app for mobile data, however its value will be vastly dilluted if it is not combined with next generation screen technology. More on that next time.