Data Science: What's In a Name?

Kurt Cagle 08/07/2021

I am Kurt Cagle, or, according to my birth certificate, Kurt Alan Cagle. My name is Kurt Cagle.

Now, think about that for a bit. The to be verb is remarkably slippery, and it is slippery in almost every language on the planet that has such a construct. For instance, consider the following statements:

I am Kurt Cagle.

I am a writer.

These are two of the most fundamental assertions in language. The first statement can be broken down as:

There exists a label associated with the referenced entity that (at least locally) identifies that entity to differentiate it from other entities.

The second statement can also be restated as:

There exists a set associated with the referenced entity that indicates membership of that entity within that set, which in turn has a label.

Makes sense, right? Welcome to the world of ontology!

The shape of names evolved over time. The concept goes way back- the proto-indo-european word for name (which had its origins in the Crescent Valley) was nomen (nuh-min) , and outside of that family tree, the Chinese root for name is ming, which many linguists would recognize as a cognate form of nomen meaning that people have been using names for at least six thousand years, and possibly for far longer.

The first names were likely given names, and were in essence "gifted" names: the name bestowed by others (typically the parent) to signify a given aspiration - such as Grace, Hope, Luke (shining) - or a beseechment or dedication to a deity, such as Mark (Mars-like or martial) or Michael (gift of God) or Gabriel (strength of God), with the suffix "el" in the latter two cases meaning Lord or Ruler (from the Sumerian and Phoenician Ba'el, reflected in the name Al-lah in Arabic and Muslim cultures).

Women's names were typically diminutives of men's names, where a diminutive was a shortened or "softened" form of a man's name that often stemmed from the roots for small, such as Gabriella or Marcia, softened forms of Gabriel and Mark respectively. They were also given names that reflected beauty, such as plant names (e.g., Holly, Ivy, or Lily), or gem names (Ruby, Pearl). Occasionally male names in different languages than the naming language would become feminized variants, such as the French Jean (John, in English) becoming the feminine form of John in England. In general, there are many more variants of female names than male one.

Within family groups, this differentiation was sufficient to ensure uniqueness most of the time, though in small groups you might have adjectives that qualify these names - Big John, Tall John, Red John, and so forth. In some cases, especially among rulers, these qualifiers became parts of their name - Charlemagne was, as an example, Charles the Great. The word nickname, by the way, has nothing to do with the devil (Old Nick) but instead started out as ekename in Old English, where eke meant "also" or "alternative". As eke fell out of usage in comparison to also in OE, eke became nekename, with the middle syllable eventually lost to become nickname. Alternative names, synonyms, or aliases, tend to be weaker because they generally have weaker authority (a lesson that ontologists should pay especially close attention to).

Once cultures reached a certain size, given names were no longer adequate to fully differentiate members of that population. One solution to this, seen especially in northern cultures, was to use familial relationships: John, Son of James (John Jameson), was different from John, Son of John (John Johnson). Admittedly, this made more sense in villages where people knew one another's families reasonably well. but it also accounts for the reason that Johnson is one of the most common surnames in regions with strong Nordic roots. In other places (especially in England and Germany) profession names were used to differentiate family lines - Smith, Sawyer (a person who used saws to cut down trees, or a lumberjack), Miller, Tinker (a tin smith), Carpenter, and so forth often uniquely identified a person in that profession, and as family trades were frequently handed down, so too were the differentiating surnames.

Finally, family names also tended to echo prominent place features - Lake, Brook, Craig (a mountain), Fields, etc. - associated with the family (this was especially true of nobles). This was especially true of nobles and other officals, who often took the name of a given property or city that they had dominion over, though the use of originating cities or regions as qualifiers also goes way back.

The use of both a given name and a family or surname almost invariably was tied into tax collection. For instance, after the invasion of England by Willelm of Normandy (a.k.a,. William the Conqueror) in 1066, one of the first orders of business was to identify the wealthy people and assets in the country, in a survey called the Domesday Book. These tax records served to freeze what had been until that time colloquial names (such as the use of professional names such as Smith or Miller as differentiators), while also formalizing "House Names" such as the Houses of York or Lancaster (lampshaded in George R.R. Martin's Game of Thrones series as House Stark and House Lanister respectively).

It's worth noting that taxonomists and ontologies refer to the given + family or sur-names as qualified names; the surname qualifies the given (or local) name. In a more formal code standpoint, the qualified name acts as a namespace for the terms (names) within that space, and the qualifier typically denotes set or class membership. Such a system dramatically reduces the likelihood that a name may refer to more than one person. As such, it is a mechanism for determining uniqueness in a broader set.

Note that beyond the emergence of given and surnames, there are other qualifiers that can differentiate a name, such as patronymics (senior, junior, the third, elder, younger, etc.), and honorifics that ironically also qualify a person by profession or distinction (sir, which is a contraction of Senior, doctor, reverend, etc.) as well as gender identifiers, up to and including the latest fashion of specifying pronouns for address purposes.

Western European styles also reflect a cultural preference for putting the given name first in narrative prose, though in legal contracts and other communications, the reverse order of surname and given name, separated by a comma is frequently used to facilitate sorting by family name. Asian countries,, on the other hand (with notable exceptions including Thailand and the Philippines), always use the qualifying (sur) name first. As such, it is typical to store a common usage name in the Western-style while also storing given names and surnames separately in order to facilitate sorting using either convention.

Cardinality and Reification

It is dangerous to assume that there is always a one-to-one correspondence between an individual and a name. Indeed, for fifty percent of the population, it is likely that their name will change at least once in their lifetime. That segment, of course, is women. Until comparatively recently (the 1960s in the United States) if a woman married, she was expected to take the surname of her husband. The feminist movement started changing that, in part as a reflection of shifting expectations about property ownership, taxation, and a weakening of the ecclesiastical view of marriage and divorce. While still a fairly low percentage, women in more marriages than ever are choosing to keep their "maiden names" if they marry, or both partners (especially in same-sex relationships) are choosing to create hyphenated surnames that differ from their pre-marriage surnames.

Nonetheless, in modeling individuals, the assumption should be that surnames especially will change over time, and given names may very well change too. Once again, gender plays a role. A person may very well either physically change their sex through surgery or may at least publicly present themselves as the opposite gender, with names reflecting this event.

It's worth noting that there are always political dimensions when it comes to data modeling, and nowhere is that as intense as with identity modeling. Any modeling involves making certain assumptions, assumptions that are often informed by cultural norms and expectations. We are now entering an era where identity is fluid: it changes over time based upon gender intent, relational status, professional appelation (The Artist Formerly Known as Prince) and even social context. For instance, you are increasingly seeing gender pronoun preferences (he,him,his;she,her,her;ze,zir,zis) in social media.

Yet at the same time this adds to the complexity of the model. From a semantics perspective, this recreates a structure that occurs whenever you have temporal evolution, what I'd call the now-then pattern.

The now part of the pattern is an assertion that, at the time the assertion is made, is true:

Her name is Jane Doe

The then part of the pattern, on the other hand, is a set of assertions that specify a range (possibly open-ended) identifying an event or state:

This is an event.

This event refers to a property called name.

The value of this property is Jane Doe.

This event began on March 16, 1993.

This event ended on June 17, 2021.

This event was reported by Kurt Cagle.

This second structure is known in semantic circles as an example of reification, meaning that the second set of assumptions describes a single relationship. The this in this case is in fact the statement Her name is Jane Doe. For those familiar with SQL, reification typically describes 3rd Normal Forms (or 3NF).

In more abstract terms, the initial statement can be broken down as:

r = {s->[p]->o}

where q is a reference to a subject entity, p is a reference to a relationship or property, and o is a reference to an object or value relative to that relationship. The reification is then a set of other relationships that refer to the given assertion or statement q:

r is a reification.

r has property p.

r has subject s.

r has object o.

r starts at time t1

r optionally ends at time t2.

r was reported by m.

The reification is significant because it specifies the time to live of a given relationship between two things. Reifications can also hold other metadata (for instance, specifying a pronoun type indicating preferred gender designation). However, it's also worth noting that you can have a great deal of information within a reification, but that also adds significantly to the number of assertions (triples) bound to that reification.

In terms of a graph, a reification is in fact the metadata associated with the information about an edge, when given two objects. For instance, if s is an airport, o is also an airport, and p is an indication that a route exists between s and o, then r:{s->[p] ->o} is in fact the route between ?s and ?o:

airport:_SEA airport:hasRoute airport:_DEN (Seattle has a route to Denver).

The route is in effect a reification (especially as routes, which are largely ephemeral and abstract entities, change far more quickly than airports do).

The route can assign a mean travel time as a property on the reification. This is, effectively, contextual information, information that belongs not to either airport but rather to the relationship that exists between the two.

With regard to names, this introduces some interesting modeling issues. A personal name goes from being a simple label to being something with a structure, a presence, and a role or type. More on that in a bit, but before digging into the weeds, its time to emphasize an important point here:

Reifications are almost invariably trade-offs between the need to deal with transients and the complexity of combinatorics. In the case of names, for instance, a given individual may have multiple names, though some may be birth names, some nicknames, some professional names, and some due to change in marital status or presentation status. A person may even have multiple names simultaneously. Names are, of course, not necessarily unique, but they still serve as one of the most commonly used identifiers for people, and for this reason as much as any other, this kind of reification makes sense.

Modeling Names (and a Sneak Peak of Templeton)

Given all of this, what would the best model for names look like? The now-then pattern suggests a two pronged approach: first, model what a Personal Name should look like, then, from the set of all such names for the individual, choose the primary name for that person from the set, the name that is currently used to best represent that individual.

The following example is in what I'm calling Templeton (short for RDF Template Notation).

First a few words about the notation. The core of it (just as with SPARQL) is Turtle as a way of describing assertions (triples here). Variable names (beginning with a question mark) provide a label, and in some cases (such as ?fullName) a value used in multiple assertion templates. If a line is indented (and the preceding line ends with a semicolon) then the un-indented first term remains in force. For instance,

is short for

The hash mark (#) is a comment, but in the template it's used to signal cardinality. Thus #* indicates that the previous assertion may be repeated zero or more times, #+ indicates a one-or-more repetition, and #? indicates an optional assertion. If a variable starts with an uppercase letter, it indicates an IRI (or reference pointer), if it indicates a lowercase letter, though, then the value is an atomic value, defaulting to a string. Thus,

?PersonalName personalName:hasStartDate ?startDate; #? xsd:date

indicates that ?startDate in this particular case is a date.

The notation

%[a,b,c,...]% a class:PersonalNameType.

indicates that the list of items are each subjects to the associated predicate and object, and is very useful for specifying type enumerations. Finally the single a is a shorthand for rdf:type.

Note: Templeton is a shorthand templating notation I've been developing as a way of creating schemas that can be expanded to OWL, SHACL, XML Schema, or JSON-Schema. I'm working on a parser for it now.

Of Compositions, Associations, and the Now/Then Pattern.

The modeling of PersonalName should seem straightforward, with a few caveats. First, it has been my observation working with dozens of ontologies over the years that almost every time you define a class, there is usually some kind of intent indicator needed. Such indicators do not materially change the definition of the class, but they do provide a level of context about what a particular instance is intended to do. For instance, PersonalNameType identifies whether something is a birth name, a married name, an alias, or a professional name (among others) These are differentiated from being subclasses because they do not change any other properties.

The second caveat has to do with modeling. UML differentiates between a composition and an association. An association typically describes a relationship between two disparate entities, and in the semantic parlance could be considered the same as a reification (or third normal form construction in SQL) . A composition, on the other hand, occurs when there is an existential dependency between the subject and object. For instance, even if you have two people who have the same personal name, these two instances are distinctive (having different start and end dates, for instance). Should a person be deleted from the database, all of the names associated with that person would also need to be deleted (which is not true for associations).

In my own modeling, compositions should always belong to the reference subject, or, put another way, the relationship points from the subject to the object semantically. Associations, on the other hand, generally are reifications - there is a reifying object such as the route in our airport example, that binds two entities together. If you delete the reification (the route, here), you don't in this case delete the associated entities (the airports),

There are some objects that seem to skirt the boundaries. An address is a good example. If a person has an associated address, a naive modeling would make an address a composition. However, it's not. Multiple people can live at the same address. If one person moves away, that does not cause the address itself to "disappear". This also means that the association of a person with an address should be seen as being a reification. I use the term Habitation as the class for that reification, one that points to both a person and an address:

Regardless of whether something is a composition or an association, there are times where you just want to know what a person's current primary name is, without having to build complex queries to find it. This is where inferred triples come into play. An inferred triple is typically generated, either through a SPARQL Update query or as part of a CONSTRUCT (these are more or less the same, depending upon how inferred triples are persisted).

For instance, the following SPARQL Query will change the primary name for a person to the specified value:

Inferred triples are frequently transitory assertions - they reflect the default value from a set of objects, but that can change, and frequently they provide a way of shortcircuiting complex queries. For instance Person:hasPrimaryNameString is the string representation of the default personal name, This can be made even more powerful by making that particular property the subproperty of something like skos:prefLabel (assuming a basic inference engine), so that a naive query, such as:

will return a list of all entities which have a primary label of "Jane Doe" in them. Note that this isn't a terribly efficient query, but it can be handy, nonetheless.

So when you're thinking about the design of your models, identify those properties that you'd intuitively want to see for the classes in question that can be inferred or derived, and in effect pre-generate or update these properties as the state of the object changes so that your users don't have to build complex queries. Remember, a triple store is an index, and such actions can be thought of as optimizing that index.

Summary

Modeling, when it comes right down to it, is the process of questioning your assumptions and optimizations. A big issue that arises with most traditional SQL systems is that many database modelers optimize for complexity by reducing the number of database tables and joins, but this also reduces the contextual metadata that is increasingly a requirement in today's data rich world.

Share this article

Leave your comments

Post comment as a guest

Comments

Comments

No comments found

Kurt Cagle

Tech Expert

Kurt is the founder and CEO of Semantical, LLC, a consulting company focusing on enterprise data hubs, metadata management, semantics, and NoSQL systems. He has developed large scale information and data governance strategies for Fortune 500 companies in the health care/insurance sector, media and entertainment, publishing, financial services and logistics arenas, as well as for government agencies in the defense and insurance sector (including the Affordable Care Act). Kurt holds a Bachelor of Science in Physics from the University of Illinois at Urbana–Champaign.

Data Science: What's In a Name?