The Large Knowledge Collider

The Large Knowledge Collider or LarKc is a Large-Scale Integrating Project funded by the European Seventh Framework Programme, under the Intelligent Content and Semantics research area.

Project Objectives

The aim of LarKC is to go beyond the limited storage, querying and inference technology that is currently available for semantic computing. The fundamental assumption is a suitable reasoning infrastructure for the Semantic Web must go beyond the current paradigms that are strictly based on logic. By fusing reasoning (in the sense of logic) with search (in the sense of information retrieval), and taking seriously the notion of limited rationality (in the sense of Herbert Simon), we will obtain the paradigm shift that is required for reasoning at Web scale.

To achieve this, the vision of LarKC is

  • To build an integrated platform for semantic computing on a scale well beyond what is currently possible. The platform will fulfill needs in sectors that are dependent on massive heterogeneous information sources such as telecommunication services, bio-medical research, and [...]-discovery.
  • The platform will have a pluggable architecture in which it is possible to exploit techniques and heuristics from diverse areas such as databases, machine learning, cognitive science, Semantic Web, and others.
  • The platform will be implemented on a computing cluster and via “computing at home”, and will be available to researchers and practitioners from outside the consortium to run their own experiments and applications, using suitable plug-in components.

What

We will develop the Large Knowledge Collider, a pluggable algorithmic framework that will be implemented on a distributed computational platform. This will allow reasoning at Web scale by trading quality for computational cost and by embracing incompleteness and unsoundness.

Pluggable: Instead of being built only on logic, the Large Knowledge Collider will exploit a large variety of methods from other fields: cognitive science (human heuristics), economics (limited rationality and cost/benefit trade-offs), information retrieval (recall/precision trade-offs), and databases (very large datasets). A pluggable architecture will ensure that computational methods from these different fields can be coherently integrated.

Distributed: the Large Knowledge Collider will be implemented on parallel hardware using cluster computing techniques, and will be engineered to be ultimately scaleable to very large distributed computational resources, using techniques like those known from SETI@home.

Why

Mike Lynch, CEO and Founder of Autonomy, recently stated that “meaning-based computing is the way of the future as 80 per cent of information within enterprises is unstructured and that understanding this ‘hidden’ intelligence is at the heart of improving the way we interact with information”. Some of the most advanced use cases for such semantic computing today require reasoning about 10 billion RDF triples in less than 100 ms. These numbers originate from the telecom sector aiming to generate revenue streams through new context-sensitive and personalized mobile services, but this is just one example of a general demand. The Web has made tremendous amounts of information available that could be processed based on formal semantics attached to it.

Research efforts around the Semantic Web have developed a number of languages that use logic for this purpose. However, logic does not scale to the amount of information and the setting that is required for the Web. A reasoning infrastructure must be designed and built that can scale and that can be flexibly adapted to the varying requirements of quality and scale of different use-cases. If such an infrastructure is not built, “meaning-based computing” will never happen on the Web and will remain confined to well-controlled data-sets inside company intranets.

How

The Large Knowledge Collider LarKC will perform massive, distributed, and necessarily incomplete reasoning over web-scale knowledge sources as illustrated in the below figure. Massive inference is achieved by distributing problems across heterogeneous computing resources and coordinated by LarKC. Some of the distributed computational resources will run highly coupled, high performance inference engines on local parallel hardware before the results are communicated back to the distributed computation environment. In a Web context complete information is an empty hope, and the distributed computation shown at the left includes some failed computations that do not thwart the entire problem solving task.

The right side of the figure illustrates the architecture that achieves this. LarKC allocates resources strategically and tactically to:

  1. Retrieve raw content and assertions that may contribute to a solution,
  2. Abstract that information into the forms needed by its heterogeneous reasoning methods,
  3. Select the most promising approaches to try first,
  4. Reason, using multiple deductive, inductive, abductive, and probabilistic means to move closer to a solution given the selected methods and data, and
  5. Decide whether sufficiently many, sufficiently accurate and precise solutions have been found, and, if not, whether it’s worth trying harder.