Tuesday, March 15, 2005

ST 2005 Note #7 - Further Reading

  • W3C Semantic Web page.
  • Sem Web Central - open source tools.
  • Mindswap. An academic group, the director of which co-authored the original Semantic Web paper with Tim Berners-Lee et al.
  • IBM Alphaworks semantic technologies page.
  • DAML Semantic Web Services.
  • Semantic Web community portal.
  • ST 2005 Note #6 - Using semantic technologies today

    Current state


    The semantic web is here and the supporting standards; XML, RDF and OWL are current W3C recommendations. Vendors and the open source community are providing tool support for ontologies. Some examples are:

    • Open Source

      • Jena 2 - HP's semantic web framework.
      • SWOOP - An ontology editor from mindswap.
      • Protege - An ontology editor from Stanford. Comes with an OWL plug-in (amongst other things).
        SWeDE is a semantic web development environment built on the Eclipse IDE.

    • Commercial


      • Unicorn have a commercial semantic offering they target as an entire architectural solution.
      • SchemaLogic provide tools for enterprise vocabulary and taxonomy management.
      • Network Inference provide a semantic web offering and ontology management.


    Certainly semantic technology has yet to cross the chasm, but Early Adopters are obtaining value from the application of the new technologies and existing large corporations (both private and government) have been using elements of the semantic conceptual stack for many years.


    from www.writersblock.ca/ summer1998/bookrev.htm

    Key Lessons


    Experience reports from ST 2005 suggest the following lessons:
    • Starting Points
      • Start small and focus on a specific business subject area.
      • Roll out a vocabulary or taxonomy first and allow this to become bedded in. In particular, this approach does not require RDF or OWL adoption.
      • Drive the semantic effort from EAI/EII initiatives to get good traction.
    • Approaches

      • ST 2005 practitioners are of the opinion that federated ontologies are good practice for large organisations. Organisations need to accept that they will end up with multiple overlapping ontologies. Provide high level upper ontologies to map these overlapping ontologies to and/or provide infrastructure for mapping between ontologies.
      • Define vocabularies, taxonomies and ontologies from both top down and bottom up perspectives. It is important that a top down modeling activity provides broad shape to semantics work. However, a bottom up approach is required to make sure the semantic activities capture information as it is today. Defining a 'to-be' model without understanding the existing semantics of systems is not recommended by practitioners met at ST 2005.
      • Examine the producing and consuming systems and ensure that the semantics of the existing models are captured in enough detail to validate any vocabulary work.
      • Avoid attempting to build an enterprise ontology from the ground up. These tend to fail due to the time it take to build them and the federated nature of large organisations. Use integration driven vocabularies to drive out elements of the Enterprise data model piece by piece. Only the concepts shared between applications are in the model, but it is a significant improvement and it's an improvement that can happen in relatively simple stages.
      • Leverage existing ontologies (such as those from XML standards bodies like OASIS).
    • Governance

      • A vocabulary, taxonomy or ontology will need well defined governance and, in large organisations, a lightweight and nimble standards group to maintain the integrity of the system.
      • The UPS speaker (had a metadata repository that started in the 80's) recommend having a QA group responsible for the integrity of the models.
      • Control naming of data elements so they are inline with the vocabulary in the organisation. This allows schema (database or XML) to be reconciled back to the vocabulary.

    ST 2005 Note #5 - Semantic Web Services

    Web Service proponents know that the web has only grown so effectively because it allows individuals to establish their own web sites, date repositories, models, APIs etc. W3C standards are really substrates that allow individual expressions to be integrated at some level.

    A Web Service world where every service in a subject area uses the same data model is unrealistic and does not reflect experience of the web to date. Web Service proponents need to find mechanisms to map between web services' different models of their domain.

    UDDI is a current standard for service discovery. However, to query a UDDI registry the requestor must submit a query in the implicit ontology of the UDDI. What is actually required is the ability for a service provider to describe the capabilities of the service in one or more defined ontologies. A service requester should be able to search for a service using capabilities defined in one or more ontologies. Providing the discovery service has mappings between these ontologies then it will be able to provide matches to searches that may otherwise have returned no results.

    Describing a web service using an ontology will not only aid searching but it will also aid use. For a requestor to use a service without additional coding it will need the service to specify the ontology for the terms used in the WSDL. In addition, a requester needs to understand pre and post-conditions for a service and the effects of a process or service. For example, if ordering a book from Amazon it would be helpful if a service requester could determine if an order would result in a book being dispatched or whether other process steps are needed. In the Semantic Web Services world OWL-S is attempting to provide some aspects of this functionality. A service can describe itself using OWL-S and document:

    • Pre and post-conditions.
    • Effects of operations.
    • The semantics of the terms used in the WSDL.
    • Capabilities of the service.

    With these facilities, OWL-S is providing foundations for semantic discovery, automated web service composition and the ability for a requester to determine what the inputs and outputs mean. Further information can be found in this paper.

    ST 2005 Note #4 - The Semantic Conceptual Stack

    The Stack


    The conceptual stack in the semantic technology arena is composed of the following key elements:







    LayerDescription
    SyntaxThe underlying representation of the structure. XML provides this foundation piece.
    VocabularyA collection of terms and their definitions.
    TaxonomyA collection of terms organised into a classification scheme.
    OntologyA specification of a conceptualisation[1]. More practically, data, structure, meaning and rules.

    As an organisation moves up the stack the degree of detailed and coverage of the semantic content of the business grows. This in turn means that the utility of the models grow. However, the effort required to define and manage the models increases rapidly.

    Decomposing the stack


    Each layer in the stack can be decomposed into high level pieces. One such decomposition is below:

    This figure shows the constituents of each layer related to semantics.

    Vocabulary


    The first rung on the ladder, once a common syntax has been defined, is the vocabulary. At its simplest this contains terms (e.g. Currency) and a definition (e.g. medium of exchange, monetary system). This can be elaborated in a number of ways. Firstly the terms can be mapped to an existing lexical database such as WordNet. In this manner the definition of the term Currency could simply be http://wordnet.princeton.edu/cgi-bin/webwn2.0?stage=2&word=currency&posnumber=1&
    searchtypenumber=2&senses=1&showglosses=1
    . Secondly, the term could be defined in relation so surrounding terms. In particular guidance on how to rule whether an item is a Currency is very useful. The medical profession uses this mechanism (rule-in/rule-out) to provide mechanisms for determing if symptoms rule in or out a specific disease. Finally, a term maybe associated with a canonical name and a short name (for use in database schema etc).

    Taxonomy


    The second rung, a taxonomy presents terms within a classification framework. A taxonomy would classify terms. For example, Currency could be a unit of measure, it could be countable.

    Ontology


    The third rung, an ontology takes terms, defines the data associated with them, the relationships between terms and constraints/rules that define how terms and relationships can be combined and what their lifecycles are. An ontology can be viewed as a data model with an associated constraint language or as a sequence of assertions of the form [subject, predicate, object].

    Practical application


    Elements of this stack are in use now in many organisations.

    Metadata repository


    For instance, UPS has a metadata repository (started in the late 80's) which stores a Taxonomy. Terms are associated with a definition, a canonical name, a short name and an abbreviation. UPS ensure that all schema that reference terms use the standard names. This makes it relatively simple to understand the meaning behind the entity definitions in, for example, a database schema. UPS use the classification scheme to reason about the vocabulary. For example, one classification is code. This allowed UPS to identify they they had a growing list of code terms and move to establish a central code repository and identify sources for codes (such as standards organisations).

    B2B standards alignment


    Other organisations are establishing defined ontologies within specific domains. This has allowed groups to map an internal ontology to external standards. This activity has enabled standards alignment for B2B activities (both internally and externally).

    Web Services


    There is a lot of activity around semantics and web services. This is covered in ST 2005 Note#5 - Semantic Web Services.

    Enterprise Application and Information Integration


    Organisations use a common message bus and/or data bus[2] to make the integration activity cost effective in the medium term. Implicitly or explicitly, the data on these buses normally has a common data model and master/static/reference data source. Without these elements the bus becomes a conduit (an expensive one at that) for point to point interfaces. When a common bus is used it is important that the common models in use can be communicated clearly and all involved understand the semantics of the model. This degree of understanding involves the following elements in the model:

    • Entities
    • Relationships, including roles, cardinality, ownership etc.
    • The business meaning of each entitity.
    • The business meaning of each relationship.
    • The lifecycle of the entities and relationships.

    It is this information that ontologies document and the current crop of W3C standards provides mechanisms for persisting this information in a machine readable form.

    Inference


    Ontologies provide the opportunity for organisations to infer implicit relationships between instances based on the explicit relationships in the ontology and associated business rules. However, there was limited example of its used on commercial organisations at ST 2005.

    Summary



    • Organisations are already using elements of the semantic conceptual stack.
    • Vocabularies, taxonomies and ontologies are reducing the cost of information and application integration.
    • The entire stack does not have to be adopted at once.



    [1] Gruber, Tom. http://www-ksl.stanford.edu/kst/what-is-an-ontology.html
    [2] Data bus could be a data warehouse or an operational data store.

    ST 2005 Note #3 - The Semantic Tech Stack

    The semantic web has a technological stack that looks like this (from a W3C perspective):

    +---------------------------------+
    + XML +
    +---------------------------------+
    + RDF +
    +---------------------------------+
    + RDF-Schema +
    +---------------------------------+
    + Ontology Vocabulary (e.g. OWL) +
    +---------------------------------+
    + Logic +
    +---------------------------------+

    RDF and RDF Schema have been covered in brief in earlier notes. OWL is based heavily on work done as part of DAML+OIL. The aim of DAML was to provide the ability to infer facts from an instance document and its associated ontological schema. OWLs aim appears to be very similar. As we move up the stack we find increasing capabilities in a number of areas:

    1. Ability to define types, relationships and constraints (i.e. structural constraints).
    2. Ability to infer facts not explicitly represented in an instance document.
    3. Ability to define semantic constraints, business rules in other words[1].

    This comes at a price; ontology production[2], validation costs rise and inference engine sophistication rises.



    [1] Not sure this is different from 1) in any meaningful way but it feels different.
    [2] As more sophisticated ontologies are modeled using complex logic assertions then more analysis and design time is required for each ontology.

    Monday, March 14, 2005

    ST 2005 Note #2

    The problem with the RDF in Note #1 was that we could not constrain the RFD in anyway. A publisher of a movement could add any triples it liked. For many purposes this is not enormously useful.

    The last RDF snippet defined was this:

    <types:Movement rdf:about="http://www.newco.com/movements/1234">
    <terms:carries rdf:resource="http://www.newco.com/grades/5678">
    </types:Movement>

    <types:Grade rdf:about="http://www.newco.com/grades/5678">>
    <terms:name>JET</term:name>
    </types:Grade>


    What is needed is to ensure that Movement only has one property, carries, and that it can only map to instances of Grade. Imagine we are creating a schema at http://www.newco.com/movements. We'll define each class as an ID within that schema, i.e. a Movement will have URI http://www.newco.com/movements#ID. We also need the namespace for RDF schema.

    This means we need to change the namespace of our instance document, and we'll want to reference these namespaces within content so we'll define ENTITY elements as well:

    <!DOCTYPE rdf:RDF [
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY movements 'http://www.newco.com/movements#'>
    <!ENTITY rdfs 'http://www.w3.org/2000/01/rdf-schema#'>
    ]>
    <rdf:RDF xmlns:rdf="&rdf;"
    xmlns:movements="&movements;"
    xmlns:rdfs="&rdfs;">


    We've changed the types namespace so it references IDs - note the trailing #.


    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

    ... and the XSD namespace if we reference XSD types.

    Anyway, back to the schema definition. Define Movement as a class:

    <rdfs:Class rdf:about="&movements;Movement"
    rdfs:label="Movement">
    <rdfs:subClassOf rdf:resource="&rdfs;Class"/>
    </rdfs:Class>


    Likewise, let's declare a property carries, ensuring it maps to a grade and is associated with a Movement.

    <rdf:Property rdf:about="&movements;carries"
    rdfs:comment="Carries relationship mapped to Grade"
    rdfs:label="carries">
    <rdfs:range rdf:resource="&movements;Grade"/>
    <rdfs:domain rdf:resource="&movements;Movement"/>
    </rdf:Property>

    but we need to ensure Grade is defined:

    <rdfs:Class rdf:about="&movements;Grade"
    rdfs:label="Grade">
    <rdfs:subClassOf rdf:resource="&rdfs;Class"/>
    </rdfs:Class>


    Now let us define a name property for Grade:

    <rdf:Property rdf:about="&movements;name" rdfs:label="carries">
    <rdfs:domain rdf:resource="&movements;Grade"/>
    <rdfs:range rdf:resource="&xsd;string"/>
    </rdf:Property>

    We don't have to, but we can also make the fact xsd:string is a datatype explicit thus:

    <rdfs:Datatype rdf:about="&xsd;string"/>


    One particularly interesting facet is that the properties are defined outside of classes. They are then associated with classes using the rdfs:domain attribute.

    However, as the W3C primer notes, RDF doesn't address:

    • cardinality constraints
    • specifying whether a property is transitive
    • specifiying that a property is a unique identifier
    • specifying that two different classes (different URIs) represent the same class., ditto instances.
    • disjoint classes
    • class specific range/cardinality constraints


    This is where OWL and other richer schemas come in.

    ST 2005 Note #1

    A number of organisations have been discussing how they are using RDF as their data interchange format. They describe their use of RDF and present examples of RDF encoded in XML.

    When representing data in XML it often boils down to how you use XML to represent a directed graph, where the arcs are labeled and have meaning. This in turn raises the value/identity issue. For example:

    <movements>
    <movement>
    <grade><name>JET</name></grade>
    <movement>
    <movement>
    <grade><name>JET</name></grade>
    <movement>
    <movements>

    In this example we assume that the message has to be self-contained, i.e. no external references. The example includes the grade JET, which is identified by value. There are a number of things we don't know. What is the relationship between a movement and the grade. Does a movement carry the grade? If a movement is deleted is the grade deleted as well? Addressing the first issue involves making the relationship a first class element. Whilst doing that let's ensure there is only every one JET grade defined in the document.

    <movements>
    <movement>
    <carries>
    <grade id='1'><name>JET</name></grade>
    </carries>
    <movement>
    <movement>
    <carries href='#1'>
    <grade><name>JET</name></grade>
    </carries>
    <movement>
    <movements>

    In this example the second carries back references the first one. The implication is that the JET grade has an identity which is important. In complex schema this type of approach is often used, though the references and the referenced may have different locations. Of course, there are issues around containment (i.e. when I delete the first movement element does that delete the JET grade or not. However, let's ignore that for the time being.

    Note that an approach to representing relationships has been constructed 'on the fly'. It's not standard and the semantics aren't clear.

    We could express the each movement as a set of triples (subject, predicate, object) thus:

    movement (some id) carries grade (some id)

    Or, with a liberal addition of URIs for identifiers:

    http://www.newco.com/movements/1234 http://www.newco.com/predicates/carries
    http://www.newco.com/grades/5678.
    http://www.newco.com/grades/5678 http://www.newco.com/predicates/name JET.

    This URI based triples model is what RDF uses. Moving to XML, or an XML representation of the triples, we'll assume the following pre-amble:

    <?xml version="1.0"?>
    <!DOCTYPE rdf:RDF [<!ENTITY xsd "http://www.w3.org/2001/XMLSchema#">]>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:terms="http://www.newco.com/terms/"
    xmlns:grades="http://www.newco.com/grades/"
    xmlns:types="http://www.newco.com/types/">

    In RDF we could describe the movement and grade thus:

    <rdf:Description rdf:about="http://www.newco.com/movements/1234">
    <rdf:type rdf:resource="http://www.newco.com/types/Movement"/>
    <terms:carries rdf:resource="http://www.newco.com/grades/5678">
    </rdf:Description>

    <rdf:Description rdf:about="http://www.newco.com/grades/5678">
    <rdf:type rdf:resource="http://www.newco.com/types/Grade"/>
    <terms:name>JET</terms:name>
    </rdf:Description>

    Note we added type, so we know what type of this we are dealing with. For brevity, we can remove the rdf:Description and rdf:type verbosity by using the type as an element name:

    <types:Movement rdf:about="http://www.newco.com/movements/1234">
    <terms:carries rdf:resource="http://www.newco.com/grades/5678">
    </types:Movement>

    <types:Grade rdf:about="http://www.newco.com/grades/5678">>
    <terms:name>JET</term:name>
    </types:Grade>

    What have we gained over the initial XML?

    • We have a formal model for representing information about an entity (triples) which we do not have to invent.
    • We have a well defined mapping from this model to XML and we didn't have to invent it.
    • We haven't had to invent a mechanism for handling the fact our grade/movement model isn't hierarchical, the types are peers.
    • We have not ended up in a world of xsi:type pain.
    • We have RDF tool support if we need it.
    • If we need collections with defined semantics then RDF supplies these, we do not have to invent them.

    Clearly RDF is a big topic, and there is also RDF-Schema and then OWL to layer some rules on top of structure. More on this later.


    [1]The ST 2005 Notes are consolidated notes from the 2005 Semantic Technology conference.