The Rule of One

The Rule of One

Kurt Cagle 23/04/2018 5

I'm going to put my geek hat on for a bit. Over the course of the last couple of months I've been exploring semantic modeling from the standpoint of "context-free" design - where I've been looking for patterns that seem to hold true regardless of what the data topic itself. One pattern that I feel comfortable now identifying is "The Rule of One", or put another way "Ted Codd was right".

When people come to RDF and semantics from XML or JSON, there is a question about the direction of an assertion. A class instance (a resource) can be said to "own" a property or relationship if the resource is a subject (the first part) in an assertion and the property is the predicate. A resource is referenced if the resource appears as the object (the last part) in an assertion.

Now, in an information representation such as XML or JSON, ownership would tend to point to the idea that if you have a container (such as a class) then the items that are contained in that class (i.e., instances of that class) are "owned" by the class - the arrows go from class to instance. In a similar example, a taxonomy can be thought of as a succession of "narrow" than statements from a single high level representation such as Thing. A thing may be an animal, vegetable or mineral , and all the way down the chain to variations of cat species. This seems logical. It is also, I believe, a poor way to model information.


Consider, for instance, the class of Batman's villains (that's a class?!). Anyone who collected comics, watched TV or movies can probably enumerate them - the Joker, Catwoman, the Penguin, Poison Ivy, the Riddler, Scarecrow, Harley Quinn, and so on. There are two different ways of stating this:

@prefix character: <http://www.example.com/ns/character/>.
character:Batman    character:hasAntagonist 
     character:Joker,
     character:Catwoman,
     character:Penguin,
     character:PoisonIvy,
     character:Riddler,
     character:Scarecrow,
     character:HarleyQuinn.

#or

@prefix character: <http://www.example.com/ns/character/>. 
character:Joker        character:isAntagonistOf   character:Batman.
character:Catwoman     character:isAntagonistOf   character:Batman.
character:Penguin      character:isAntagonistOf   character:Batman.
character:PoisonIvy    character:isAntagonistOf   character:Batman.
character:Riddler      character:isAntagonistOf   character:Batman.
character:Scarecrow    character:isAntagonistOf   character:Batman.
character:HarleyQuinn  character:isAntagonistOf   character:Batman.


The first approach has the Batman "owning" each of his antagonists. Because this is typical of a hierarchical approach, this is usually called top-down modeling. The second approach has each of his respective villains "owning" Batman. In this case, there is still a hierarchy, but it moves from the bottom-up.

If you were working with RDF exclusively, either approach is valid, but the bottom-up approach is preferable. Why? Because if you were to delete a given character (say the Joker) in a bottom-up model, where by delete I mean to eliminate all triples with the subject of Joker, the Batman entry need not be changed. More generally, in a bottom-up model, deleting a given resource should never involve changing another resource. If the Joker entry is deleted in a top-down model, then every reference to her in the database also needs to be deleted.

There's a second benefit to using a bottom-up approach. It means that at any given point each resource has at most one association for a given property. Joker's antagonist is Batman. It can be represented in a SQL table. This means that in converting the RDF to SQL, there is a one-to-one mapping between the RDF and the SQL.

This is one reason why RDF often seems a bit backwards to people working with JSON or XML. The most common assertion in RDF is the type assertion:

character:Batman      a  class:Character.
character:Superman    a  class:Character.
character:Joker       a  class:Character.
character:HarleyQuinn a  class:Character.
character:Catwoman    a  class:Character
character:Thor        a  class:Character.
character:BlackWidow  a  class:Character.


This is a bottom-up approach - the instance declares its type. There is only one "a" (which is a shorthand for rdf:type) property for each instance, rather than the general class enumerating all of its instances. In XML, on the other hand, the approach is top-down:

<Character>
     <character id="Superman">..</character>
     <character id="Batman">..</character>
     <character id="Joker">..</character>
     <character id="HarleyQuinn">..</character>
     <character id="Catwoman">..</character>
     <character id="Thor">..</character>
     <character id="BlackWidow">..</character>
</Character>


Notice the id attribute. Here's where attributes come into play. In a forest (aka, an XML database) you actually want only one entry that contains the relevant information for a resource, because otherwise you end up with the potential for duplication. This means that in practice any item that actually represents an entity should be its own top level document, especially if it is referenced elsewhere. This would be better modeled as:

<class:Character id="Superman">..</class:Character>
<class:Character id="Batman">..</class:Character>
<class:Character id="Joker">..</class:Character>
<class:Character id="HarleyQuinn">..</class:Character>
<class:Character id="Catwoman">..</class:Character>
...


<Heroes>
     <class:Character ref="Superman"/>
     <class:Character ref="Batman"/>
     <class:Character ref="Thor"/>
     <class:Character ref="BlackWidow"/>
</Heroes>


A sequence of references in XML or JSON is equivalent to a bottom-up approach in RDF.

character:Superman    a class:Hero.
character:Batman      a class:Hero.
character:Thor        a class:Hero.
character:BlackWidow  a class:Hero.


Now, suppose that you have a situation where a character is in fact in two classes. Consider, for instance, the character of Catwoman. She's .... complicated.

character:Catwoman  a class:Villain.
character:Catwoman  a class:Hero.


In modeling, this actually suggests first that class:Villain and class:Hero are not exclusive, which is not ideal, because it means that you will have some instances where you would need to construct some fairly complex table relationships. Instead, this is where it becomes necessary to turn a relationship into an object by treating villain and hero as morality roles.

moralityRole:CatwomanHero
    a class:MoralityRole;
    moralityRole:character  character:Catwoman;
    moralityRole:morality   morality:Hero;
    . 

moralityRole:CatwomanVillain
    a class:MoralityRole;
    moralityRole:character  character:Catwoman;
    moralityRole:morality   morality:Villain;
    . 


This has decoupled a one to many relationship into a one to one relationship. I can query whether Catwoman is a hero or a villain (or both) by taking advantage of these distinct hybrid classes as shown in SPARQL:

select ?character ?morality where {
    ?character  a  class:Character.
    ?morality   a  class:Morality.
    ?role       a  class:MoralityRole.
    ?role moralityRole:character ?character.
    ?role moralityRole:morality  ?morality.
}

==>

    ?character     |    ?morality
===================+=====================
character:Batman   |   morality:Hero
character:Catwoman |   morality:Hero
character:Catwoman |   morality:Villain
character:Joker    |   morality:Villain


SQL developers will recognize the above as equivalent to the second normal form. What's most significant about these is that if you delete an entry in a second normal form, it does not change any of the entries it is pointing to.

A second consequence of the rule of one is that even in those situations where you think you have multiple atomic values (such as a list of non-preferred terms), if you have a one-to-many relationship then what you are actually looking at is a referential relationship between objects.


For instance, consider representations. A given character may have multiple representations. A character such as Catwoman has literally had more than nine lives, as she has been depicted by multiple actors in and rendered by multiple artists over the course of her ninety year lifespan (It's one reason why I like to single out that particular character when talking about modeling).

A good example of a normalizing model for a media character looks something like this:

character:Catwoman
     a character;
     character:name "Catwoman";
     .
character:Catwoman3
     a character;
     character:name "Catwoman";
     character:alterEgo    character:SelinaKyle3;
     character:series      series:Batman1966;
     character:actor       actor:JulieNewmar;
     character:archetype   character:Catwoman;
     .

character:Catwoman4
     a character;
     character:name "Catwoman";
     character:alterEgo    character:SelinaKyle3;
     character:series      series:Batman1966;
     character:actor       actor:EarthaKitt;
     character:archetype   character:Catwoman;
     .
character:Catwoman9
     a character;
     character:name "Catwoman";
     character:alterEgo    character:SelinaKyle7;
     character:series      series:DarkKnight;
     character:actor       actor:AnneHathaway;
     character:archetype   character:Catwoman;
     .
character:Catwoman8
     a character;
     character:name "Catwoman";
     character:alterEgo    character:SelinaKyle8;
     character:series      series:Catwoman;
     character:artist      artist:AdamHughes;
     character:archetype   character:Catwoman;
     .

character:Catwoman12
     a character;
     character:name "Catwoman";
     character:alterEgo    character:SelinaKyle10;
     character:series      series:Gotham;
     character:actor       actor:CamrenBicondova;
     character:archetype   character:Catwoman;
     .


In this particular case, if there is an archetypal character called Catwoman, then there are multiple representations, based upon the actress, the artist, the series (whether film, TV, cartoon or comic) each of which also have a reference to an archetype. To get a list of all actresses that have played the character over the years, the SPARQL query looks like this:

select ?actress where {
    ?character character:archetype ?archetype.
    ?character character:series ?series.
    {
        {
        ?character character:actor ?actress.
        }
      UNION 
        {
        ?character character:artist ?artist.
        }
      UNION 
        {
        ?character character:alias ?alias.
        }
   }
},
{archetype: character:Catwoman}


The union here comes because a given character representation will either have an actor or an artist (in the case of a animated series, it will have both).

Character aliases represent another multiplicity case, one of the reasons that modeling comic books can be such a challenge. A character can have more than one alias. At the same time, as was true with the original Batman TV series, the same character may end up being portrayed by different actors. A fully decomposed entity needs to reduce all of these pluralities into individual entries.

Because both archetype and alterEgo take an object that is of type Character, this particular class contains its own join object. This can add a lot of objects to your data set, and the temptation to "simplify" this by breaking the Rule of One can be high.


However, it's also one that should be avoided, because a significant percentage of modeling problems that arise with large datasets comes from trying to assert top-down thinking - by saying "well, we won't have very many use cases where we deal with plurals" you are almost always going to find yourself re-engineering down the road.

As a rule of thumb, the more that you turn your many-to-many relationships into one-to-one relationships, the more naturally your models will decompose. Moreover, the cost to create a "table" in RDF is almost non-existent. It's likely that any object to object relationship would be better recast as a join object, unless you are absolutely certain that your relationship is purely hierarchical, and even then going bottom-up is preferable. This includes things like addresses - a person may have multiple addresses, an address may be associated with multiple people, which means that a habitation object makes sense (especially when coupled with a specific event interval):

habitation:HQGotham
    a class:Habitation;
    habitation:address address:Gotham;
    habitation:character character:HarleyQuinn;
    habitation:from "2005"^^xsd:gYear;
    habitation:to   "2011"^^xsd:gYear;
    .

habitation:HQConeyIsland
    a class:Habitation;
    habitation:address address:ConeyIsland;
    habitation:character character:HarleyQuinn;
    habitation:from "2012"^^xsd:gYear;
    .


One other consequence of this is that in modeling this way, most resources become fairly easier to model, as the only things associated with them are typically going to be things that describe some aspect of their existence that remains completely unchanged over it's lifespan, with the bulk of the heavy lifting being taken care of by join objects.

Summary

By following the Rule of One in your modeling, you make it easier to transform data between serialization and storage formats, better preserve the distinction between resource definitions and resource references, and can limit the amount of re-engineering of data models (and consequently of APIs) that will happen as you realize that what you thought were atomic properties turn out not to be.

Share this article

Leave your comments

Post comment as a guest

0
terms and condition.
  • Sam Olerhead

    Informative read, appreciate your hard work !!

  • Codie Ramshaw

    Excellent explanation

  • Arthur Young

    Intriguing post, thanks for sharing.

  • Gary Baker

    Very informative and insightful.

  • Kumar Mohit

    Excellent post !!!!

Share this article

Kurt Cagle

Tech Expert

Kurt is the founder and CEO of Semantical, LLC, a consulting company focusing on enterprise data hubs, metadata management, semantics, and NoSQL systems. He has developed large scale information and data governance strategies for Fortune 500 companies in the health care/insurance sector, media and entertainment, publishing, financial services and logistics arenas, as well as for government agencies in the defense and insurance sector (including the Affordable Care Act). Kurt holds a Bachelor of Science in Physics from the University of Illinois at Urbana–Champaign. 

   

Latest Articles

View all
  • Science
  • Technology
  • Companies
  • Environment
  • Global Economy
  • Finance
  • Politics
  • Society