Saturday, August 8, 2009

Information Overload... Future?

It is clear that with all the web 2.0 apps, real-time feeds, expanding social networks, information overload is happening. We are being bombarded everyday with so much information, that our attention-span is dramatically lowering, and our focus is diminishing. If we don't get our answer on Google within seconds, we get frustrated, and instead of researching from what we have, we re-search.

Increasingly, there will be an important need for organization, relevance, and personalization. One way this can be achieved is via the bottom-up semantic web approach where data entities are expressed in RDF formats, interconnected via powerful ontologies, all of which are aligned all across the web. This can organize the data out there in a powerful way, which is a precursor to delivering relevance, which current algorithmic approaches cannot assure.

However, there are some issues with the above. Expecting everyone to annotate their data into RDF, using the appropriate ontologies, is foolish, especially when considering that many of the content publishers, let alone web designers, don't have strong software backgrounds.

Another issue is akin to a chicken-and-egg problem: Since the benefits of annotating data and embedding data cannot be realized until this trend is deployed at a large-scale, there will continue to be a small incentive for the individual to spend the extra time annotating his/her data.

The second approach is a top-down approach which relies on statistical analysis, NLP, and algorithmic tools to embed semantics in unstructured data. Applications like Freebase, OpenCalais, and other tools that convert unstructured text to RDF are some of tools in this category. While they provide support in making the semantic transition, the chicken-and-egg problem still exists. Plus, some of these tools are inaccurate since they are purely algorithmic driven.

I have a different perspective towards all this about how semantics is going to emerge:

1. It is going to be a movement in the web, which cannot be forced upon by external organizations like the W3C, or anyone else. The information overload is going to be realized by content subscribers and publishers, and we are going to realize that disconnected tags are inefficient and useless. The key here is that the web will realize it on its own, and no one can force it to realize it. Now might just not be the right time.

Let's take Twitter for example. Their simple scheme of organizing related information is via using the hash tag ('#'). People employing a hash tag referring to the same topic in their tweets invariably acts like a annotation tool. This was not invented and imposed upon by Twitter. Instead, it was started by Twitter users and later made a standard by Twitter...

2. It is going to be an iterative process. Once point 1 is achieved, content providers are not going to immediately go full out and annotate all their content. They are going to annotate some important parts. Once these annotations begin, EXISTING services like Google Search and other content aggregators will modify their solutions to account for annotations because their aim is to provide relevance and annotations will help them do that. Annotation will also help content providers in getting visibility in delivering that relevance. This pattern will go back-and-forth. This is akin to how many other web technologies and trends have achieved fruition.

3. When I say 'annotate' in point 2, I refer to annotation in the simplest of semantic markup formations, namely RDFa and Microformats. These two formats will serve as intermediaries towards the semantic transition since they integrate extremely easily with HTML which many content publishers are familiar with.

4. An equilibrium will be reached between the usefulness of annotating certain content vs. the value provided in organizing all the content from different sources. It may not be necessary to go full-out and annotate the plethora of properties a certain object has. The web will determine what is important to annotate. However, the more extensive annotation and RDF markup may be employed in certain vertical domains at the enterprise and academic level, just not in the mainstream web.

5. Meanwhile, continued research in NLRB, data mining and information retrieval will coalesce with annotated sources to provide more intelligent applications. Again, it will be an iterative process, depending on the degree of annotation realized.

These are just some of the thoughts that are going on in my head as I learn these technologies and observe how the web works. There is no use predicting the future because time is out of your control. I have worked with what I have today to come up with the above thoughts. If something new comes up, please expect another post from me on how my views have radically changed :)

I think it is more important to observe, and see opportunities as they arise. It is also more fun that way!

Wednesday, August 5, 2009

"semantic web" startups like Twine

I don't understand their model. Lets take Twine for example -

Twine aims to be yet another ecosystem on the web where users can store all the information they find interesting. Twine then automatically suggests other information using semantic technology. Sure, they annotate everything in RDF, use rule-engines, etc... but users don't use Twine because of that.

One thing I don't understand is why Twine reinvents the wheel by focusing its efforts on creating an ecosystem similar to existing ones like del.icio.ous, Digg, Facebook, and numerous more... By doing this, it puts a limitation on the user: "To use Twine's intelligence, you need to bookmark and send data via the Twine API."

Isn't the whole purpose of the semantic web to aggregate all the data out there under a single knowledge base. In this knowledge base, the aim is to organize data from the disparate sources in a organized, intelligent, and consistently "aligned" manner. Well, the data is already out there! The connections are also already out there in some format (either in RDBMS, XML, etc...). So why does a semantic web application like Twine need the users to make Twine their central ecosystem to use its semantic intelligence?

Getting users to switch their bookmarking service may not necessarily be easy, especially if they can't realize the benefits of Twine's suggestion tool right away. This is a well known chicken-and-egg problem.

Perhaps, Twine should start initiating partnerships with content silos out there on the web whereby it can annotate their data into RDF/OWL and maintain it in a knowledge base along with data from other content silos. Then an average user can simply use Twine's services to get intelligent suggested content (which is Twine's core offering and differentiator) without the need to bookmark everything via Twine...

Tuesday, July 28, 2009

Getting my hands dirty...

Lately, I have been very intrigued by the technologies and opportunities associated with the semantic web. And being who I am, I like to get my hands dirty immediately. To most this would mean, starting to code and understand the technology in depth. To me, 'getting my hands dirty' means that and more. I have began reading Semantic Web Programming, which I must say, provides a great introduction to semantic web technologies while offering some depth learning as well. I am currently on page 208 and plan to finish the book before college starts in September.

But 'getting my hands dirty' means a lot more than learning the technology. I am trying to find any opportunity to meet people who are gurus in this area already for additional perspectives.

I'm definitely planning on interacting with the MIT DIG research group in CSAIL, which is working on cutting edge semantic web research. I recently attended the Semantic Technology Conference 2009 in California where I enjoyed meeting enthusiasts and learning from them. I have also gotten in touch with Alexandre Passant in France, who has been kind enough to share his thoughts and the details of his project in EDF which puts semantic web technologies to use at the enterprise level.

I am also on the look out for any part-time opportunities at semantic web startups in the Cambridge/Boston area to further my interest in this area.

Monday, July 27, 2009

Welcome!

I will using this blog as a way to communicate my thoughts and ideas on the emerging semantic web and related technologies.