Saturday, August 8, 2009

Information Overload... Future?

It is clear that with all the web 2.0 apps, real-time feeds, expanding social networks, information overload is happening. We are being bombarded everyday with so much information, that our attention-span is dramatically lowering, and our focus is diminishing. If we don't get our answer on Google within seconds, we get frustrated, and instead of researching from what we have, we re-search.

Increasingly, there will be an important need for organization, relevance, and personalization. One way this can be achieved is via the bottom-up semantic web approach where data entities are expressed in RDF formats, interconnected via powerful ontologies, all of which are aligned all across the web. This can organize the data out there in a powerful way, which is a precursor to delivering relevance, which current algorithmic approaches cannot assure.

However, there are some issues with the above. Expecting everyone to annotate their data into RDF, using the appropriate ontologies, is foolish, especially when considering that many of the content publishers, let alone web designers, don't have strong software backgrounds.

Another issue is akin to a chicken-and-egg problem: Since the benefits of annotating data and embedding data cannot be realized until this trend is deployed at a large-scale, there will continue to be a small incentive for the individual to spend the extra time annotating his/her data.

The second approach is a top-down approach which relies on statistical analysis, NLP, and algorithmic tools to embed semantics in unstructured data. Applications like Freebase, OpenCalais, and other tools that convert unstructured text to RDF are some of tools in this category. While they provide support in making the semantic transition, the chicken-and-egg problem still exists. Plus, some of these tools are inaccurate since they are purely algorithmic driven.

I have a different perspective towards all this about how semantics is going to emerge:

1. It is going to be a movement in the web, which cannot be forced upon by external organizations like the W3C, or anyone else. The information overload is going to be realized by content subscribers and publishers, and we are going to realize that disconnected tags are inefficient and useless. The key here is that the web will realize it on its own, and no one can force it to realize it. Now might just not be the right time.

Let's take Twitter for example. Their simple scheme of organizing related information is via using the hash tag ('#'). People employing a hash tag referring to the same topic in their tweets invariably acts like a annotation tool. This was not invented and imposed upon by Twitter. Instead, it was started by Twitter users and later made a standard by Twitter...

2. It is going to be an iterative process. Once point 1 is achieved, content providers are not going to immediately go full out and annotate all their content. They are going to annotate some important parts. Once these annotations begin, EXISTING services like Google Search and other content aggregators will modify their solutions to account for annotations because their aim is to provide relevance and annotations will help them do that. Annotation will also help content providers in getting visibility in delivering that relevance. This pattern will go back-and-forth. This is akin to how many other web technologies and trends have achieved fruition.

3. When I say 'annotate' in point 2, I refer to annotation in the simplest of semantic markup formations, namely RDFa and Microformats. These two formats will serve as intermediaries towards the semantic transition since they integrate extremely easily with HTML which many content publishers are familiar with.

4. An equilibrium will be reached between the usefulness of annotating certain content vs. the value provided in organizing all the content from different sources. It may not be necessary to go full-out and annotate the plethora of properties a certain object has. The web will determine what is important to annotate. However, the more extensive annotation and RDF markup may be employed in certain vertical domains at the enterprise and academic level, just not in the mainstream web.

5. Meanwhile, continued research in NLRB, data mining and information retrieval will coalesce with annotated sources to provide more intelligent applications. Again, it will be an iterative process, depending on the degree of annotation realized.

These are just some of the thoughts that are going on in my head as I learn these technologies and observe how the web works. There is no use predicting the future because time is out of your control. I have worked with what I have today to come up with the above thoughts. If something new comes up, please expect another post from me on how my views have radically changed :)

I think it is more important to observe, and see opportunities as they arise. It is also more fun that way!

No comments:

Post a Comment