Giovanni Pirrotta

Just a curious person

Limits of the Web today

June 03, 2013

The Web was born in 1999 at the CERN laboratory, when scientist need to share information and scientific researches in realtime. To accomplish this task, Tim Berners-Lee conceived a hypertextual system of communication between documents that became, in few years, the keystone of the World Wide Web (WWW). He introduces a markup convention, called HyperText Markup Language (HTML), to visualize and format documents allowing the linking through anchor words. When HTML pages are loaded in Web server programs installed on computers and located in different places of the world, users can reach the Web documents using particular programs called browsers allowing them to navigate the Web of linked documents.

The World Wide Web is mainly based on documents written in HTML language but, infortunately, the language defines only how to show data without specifying anything about the meaning of data showed.
All data, such as local sport events, weather information, plane times, major baseball statistics, television guides, and so on are presented by numerous sites, but all in HTML. So it is difficult to reuse these data in different contexts, it is difficult for machines to perform automatic tasks in order to carry out specific activities.

We imagine for example the difficulty for a search engine to find the information desired by a user. When we perform a search, we expect the search engine is able to understand the nuance of meaning that we want to express without problems of interpretation. For example, if we search the Italian word espresso, how does the search engine understand if we are searching for information about:

  • espresso, the past participle of the Italian verb esprimere;
  • espresso, a kind of Italian train;
  • espresso, a kind of coffee;
  • espresso, the famous Italian newspaper.

The word espresso has many meanings depending on the context in which it appears. Unlike, in a semantic-based system each word is related to a series of semantic networks, which are not simply lists of words, but dense data links allowing to process complex information. A software agent should understand if different propositions identify the same concept, individuate all relationships between a word and their properties, unlike current search engines that index all contents by words and not by semantic networks.

The documents published on the Web can be considered as belonging to a huge global file system; since HTML pages are written by humans for humans, the semantics is implicit inside the pages and links, and it is made explicit in the human-interaction. The data contained in Web pages are not organized and structured to be processed, queried and interpreted unambiguously in a model that can explain the implicit semantics in a manner understandable to machines, because in this context everything revolves around documents and their links, and not around data, i.e., the resources contained into documents.

For example, with HTML languages we can list items for sale but there is no capability within the HTML itself to assert unambisuously that, item number “ABC” is a DVD recorder with a price of euro 100,00. Rather HTML can only say that the portion of text “super” is something that should be placed near the text “DVD recorder” and so forth. There is no way to say “This is an e-commerce site” or event to establish that DVD recorder is a kind of product or that euro 100,00 is a price.

Summarizing, the problem of the actual Web is that the content is not structured and the links are untyped and meaningless. We can see each Web server as a disconnected data silos and in each silos the data cannot be reused in external contexts from the original one. We cannot ask expressive queries on data, we cannot process content within applications of mashup data with information coming from different data sources.
We have problems in terms of interpretation, integration and automatic processing.

That’s all for now. In the next post I will talk about how machines can think. Stay tuned!