Einstein Enterprise

Einstein Enterprise is an open source environment upon which to build intelligent software. It can be used to create electronic advisors, intelligent assistants, recommendation engines, as well as predictive, diagnostic, problem solving, and planning applications. Einstein Enterprise was created by Marc Schneiderman, chief architect and founder of nTeligence Corporation.
Architecture
Einstein Enterprise includes both “design time” and “runtime” environments. Within the design time environment knowledge is extracted from subtle patterns contained within historical data, the semantic decomposition of text based documents, the detailed analysis of images, and through interviews with subject matter experts. Unlike earlier attempts at creating intelligent software environments, which required that information be fed manually into the system, Einstein has the ability to autonomously acquire and increase its own knowledge by searching the internet. In addition, Einstein is able to visually recognize people and directly engage in conversation with them.
Both mathematical and cognitive models of behavior are stored within a knowledge repository. At runtime, this knowledge is loaded, replicated, and cached within local memory across any number of processing nodes. It is there that it is applied to event streams, or other “live” data, in real-time to make recommendations, give advice, or provide suggestions.
The runtime architecture is based upon a pure peer to peer network that is capable of self-healing. All of the nodes automatically discover one another, significantly simplifying administration and configuration management of the environment. In the rare case that a node goes down, existing jobs and tasks will be re-assigned to the remaining nodes. Upon restarting, the node will automatically rejoin the cluster, and be reassigned tasks, all within a matter of a few seconds.
Core Subsystems
Einstein Enterprise is comprised of the following core technologies:
(a) NoSQL Database - non-relational document oriented database, which requires no schema definition prior to use. Optimized for extremely low latency reads and writes. Supports “eventual consistency” through full replication across multiple nodes, complete with failover capabilities. Supports “sharding”, which segments and distributes information based upon a key, so that portions of the database are as close to the nodes that will be processing it as physically possible. Built in understanding of JSON and BSON document structure.
(b) Complex Events Processor - allows for the aggregation, filtering and correlation of business events as they are actually taking place. Support for multiple input streams. Sliding time and volume based windows. SQL-like query language supporting inner and outer joins. Full support for temporal events. Built in math functions. Can be connected to industrial sensors in order to intelligently process event streams.
(c) Voice Recognition / Speech Synthesis - support for speaker independent voice recognition. Provides for an exceptionally high level of accuracy within specific domains. This is achieved through the use of language modules and sample vocabularies.
(d) Computational GRID - supports the real-time execution of mathematical and cognitive models across a massively parallel, highly redundant, distributed computing environment. Utilizes a true “peer to peer” architecture that supports automatic failover and recovery in real-time. Provides an easy to use map/reduce API.
(e) Data GRID - a distributed in-memory data cache whose contents are replicated across all processing nodes within the system. Supports the creation of a global shared memory space if needed. Supports the Memcache protocol, as well as others. Can be configured to persist data within a file system, or database, if needed for recovery purposes.
(f) Inference Engine - rule engine which supports the industry standard reté algorithm, as well as advanced pattern matching capabilities. Includes a plug-in for the IDE, in which rules are written. Allows for testing and debugging of rules, as well as interactive viewing of the reté tree, the contents of working memory, and the dynamic binding of variables during rule execution.
(g) Knowledge Repository - knowledge warehouse containing all of the mathematical and cognitive models created within the design time environment. Full support for version control, as well as industry standard interfaces, and communication protocols, such as JCR, CMIS, Webdav and Sharepoint. Provides an easy to use web based interface for viewing and checking in files.
(h) Data Mining Toolkit - integrated set of tools, algorithms, and workbench for building mathematical models. Provides a GUI to create a data mining workflow which can be stored in industry standard PMML format within the knowledge repository. Includes tools for the visualization and transformation of data, feature selection, cross validation, algorithms (i.e. - Bayesian, clustering, regression, neural networks, support vector machines (SVM), decision trees, rule induction, etc.), as well as facilities to evaluate and compare model performance. Can use existing SAS and SPSS data files as input data sources. Add-on packages for text analytics, time series data, and product recommendations are also included.
(i) Natural Language Processing - engine that decomposes English sentences into their elemental parts of speech (i.e. - noun, verb, adjective, etc.), so that the true “semantic” meaning can be determined. The parts of speech are based upon the industry standard Penn Treebank tag sets. Additional components perform Named Entity Recognition (NER) for locations, organizations, and people within bodies of text, without any prior markups or tags being defined. Custom industry specific NER components can also be created as needed. Interdependencies between words in each sentence are also determined.
(j) Prolog - a declarative programming language, used for representing human knowledge. The underlying execution paradigm supports both forward and backward chaining. There are bi-directional programming interfaces to the Java language for full integration with other Einstein components. Includes a visual development environment, complete with a debugger.
(k) Distributed Processor - Batch oriented distributed computing framework that is capable of executing map/reduce computations across clusters of low cost machines. Incorporates facilities to dispatch and monitor jobs, as well as manage the execution of assigned tasks, re-dispatching them if needed to alternative nodes. Includes a highly fault tolerant distributed file system that is capable of storing extremely large files across multiple nodes in the cluster. Can run existing applications that were written to execute within Hadoop (including Hortonworks and Cloudera) environments.
(l) Web Crawler - Industry standard web crawler which walks any number of websites, identifying and traversing links and then downloading associated content. Maintains a database of pages visited, their associated raw information, as well as properly formatted parsed data. Maintains a graph of hypertext links. In addition to plain text and HTML, the web crawler is capable of parsing XML, MS-Office documents and PDF files as well, via plug-in modules. Can run in a clustered environment utilizing the distributed processor or in stand alone mode.
(m) Computer Vision - Set of image and video processing tools and libraries. Supports both linear and non-linear filtering of images, geometric image transformations, as well as structural analysis. Includes motion estimation and object tracking algorithms. Provides facilities to both detect and classify pre-defined objects (i.e. - faces, cars, buildings, industrial parts, medical images, etc.). Capable of determining the similarity of two images. Includes a subset of core data mining algorithms.
See Also
Artificial Intelligence

See Also

4th Generation Data Leak Prevention
CloverETL
Digital lifetime storage
EVA Netmodeler
WebFlicker CMS

< Prev		Next >

[ Back ]

Main Menu