We live in the age of data. Around us, databases contain exobytes about us and the world that we live in. There is a sense, especially among managers and decision makers, that all of that data should be able to tell us something, especially after nearly a decade where "Big Data" as a meme has become one of the most marketed phrases on the planet.
Yet for all of that, the millions upon millions of dollars spent on big Hadoop projects, enough data lakes to turn into a data sea, the rise of the Data Scientist as rock star (okay, I'll admit, I didn't see that one coming), the reality has been ... underwhelming. That ten million dollar investment in high volume systems resulted in a one million dollar total (not net) return over the course of a couple of years. That data lake sits underutilized, because developers continue working with mySQL databases. They're faster. NoSQL has become, "well sort of kind of SQL". We've built multiple social media platforms that nobody is using because they've used all the top spots on their phone.
Those data scientists? They spend most of their time dealing cleaning up the same crap data that they did before they became rock stars, just at a higher salary. Moreover, while they can tell you things like how data clusters and the fact that you have a 46% chance of making a 1.5% profit this quarter, a 38% chance of making a 0.5% profit this quarter, and a 16% chance that you'll take a 3% loss. Yet when you take that 3% loss, they get fired, and the remaining data scientists very quietly readjust their models accordingly.
This is not the fault of programmers or data scientists, and arguably not completely the fault of system architects (though they shoulder some of the blame). They are in fact doing what they are being asked to do. A bigger part of the responsibility for this is directly due to business managers who do not understand what should be the central concept of databases ... the notion of context.
Here There Be Dragons
Context is a remarkably subtle idea, one that is perhaps so subtle that it bypasses people who work with information every day. Context is the set of assumptions and relationships that information has based upon the environment around it. It determines uniqueness, categorization, what relationships are important and what aren't, what metrics are used and so forth.
A good example of a context problem was the Y2K problem, or the fast-approaching Y2.038K problem. Many database designers from the 1960s to the mid-1990s made a context assumption - that they were in the twentieth century, and so didn't have to rely upon the first two digits of a year being anything but "19". Of course, as the twenty-first century loomed, someone realized that if something wasn't done, a lot of people were going to suddenly find themselves having been born in the future. This made COBOL programmers very rich at the end of the twentieth century.
A similar problem will happen in 2038, by the way. On most computer systems (and the databases written for them), dates are calculated as the number of seconds from 1970. As it turns out, many databases until a few years ago allocated a long integer, which was 31 bits long (the remaining bit was used to set the sign of that integer). Two to the thirty first power is about 2.147 billion seconds, which implies that in that period (about sixty eight years), dates will start going negative if the maximum size of a date stored in a database is 32 bits. 1970 + 68 is, you guessed it, 2038, so in about twenty years, any legacy databases are about to go very wonky.
The context here is subtle - the representation of the data in a specific field was only valid in a certain range. This is why contextual problems are so hard to resolve - they are all too often at a deep or subtle enough level that even realizing that it is a problem in the first place can be tricky. In retrospection, anyone could have seen it, but in practice, most people do not challenge their assumptions this deeply. They lack either time, expertise, or desire to do so, assuming that these kinds of problems are "edge-cases". Indeed, an edge-case is simply a context assumption that the problem won't bite the programmer in the butt while they still have a job. After that, it's someone else's problem.
A similar problem comes about due to identifiers. Most relational databases (and even many NoSQL databases), make use of numeric keys to identify specific resources. Early on, databases administrators or programmers used sixteen bit integers for those keys, meaning that they were good so long as you didn't have 65,539 rows for a given table you were golden. After that, the indexer would roll back to 1, and suddenly you begin getting duplicate entries to certain queries. So these were switched to long integers, because after all you'd never have two billion entries in a given database, right? Twitter produces that many entries in a day. Now, developers have finally begun working with UUIDs, which have two to the 128th power possible values, or roughly 340,000,000,000,000,000,000,000,000,000,000,000,000 potential values.
Now, the chance for collision here (where two different resources are assigned the same GUID) is not 3.4x10^38. Many GUIDs are based upon specific randomizer algorithms, and creating a truly random seed is actually one of the hardest problems in computer science. However, the possibility that it will come up is small enough that for all but the most specialized of cases, the numbers will be unique.
Again, this discussion problem is contextual. A numeric identifier is only unique within a given database, the number that I assign to that book will be different in a publisher's database will be different (most likely) than the one that may be in a bookseller's database. The number is meaningful only if the context is fully qualified: publisherX:1953302 is different from bookseller:1953302 but may be the same as bookseller:3295322. Master data management works upon the assumption that two books can be deemed the same if they happen to share an agreed upon third identifier (say an ISBN number), but may be the same if additional contextual evidence (a book's publisher is the same, the author(s) are the same, the title is the same and the version is the same).
Now, even the first condition above is not a guarantee - someone may have mis-keyed an entry. It happens. A lot. This means that in order to identify that two books are "probably" the same, you need to do forensic analysis - analysis of information after the facts.
Now forensic analysis is hard in traditional databases, because in order to make the comparison, you need BOTH databases to be instantiated at the same time in the same server, then have to make sure that the properties that each table has are equivalent. In the real world, that never happens. Ever. In very rare cases, standards are set up (and rigorously enforced) that identify common schemas, but even there, the likelihood that such schemas are enforced and consistently validated and constrained is virtually nil ... and that's a best case scenario that I, in thirty plus years of working with both structured SQL data and semistructured XML and JSON data, have never actually seen.
Most data scientists would love to spend all their time doing statistics and generating cool looking reports. In reality, most data scientists spend the bulk of their time (90% +) attempting to unify data from disparate sources, clean it up, map it with some kind of transformation, then save it out into a different data format that matches their working tools, a process known in the database community as Extract/Transform/Load (or ETL). Of that, the extraction and loading process is straightforward. The transformation process is ... not.
There are many reasons for this. Some are structural - one modeler may represent a user name as a single field, where others may represent it in two or three. One may represent the name first then last, another may represent it as "last, first". These require knowing how to work with a computer language to do the mapping. Others may have revenue projections by month rather than by quarter - and it may not be obvious that this is the case. Others may define revenue as gross revenue, a second as net revenue. One may use FASB standards, others may use Basel III. These differences are contextual, and often may not even be known by the programmer at the time (most programmers have only a very limited understanding of accounting unless they spend a huge amount of time in the financial sector).
Teasing out the metadata that represents the contextual underpinnings takes time, and frequently involves injecting new data into the system that better establishes this context. Thus, even if the same entity is identified in two different databases, the likelihood that the data that each row contains will be very different.
This is one reason why open data, usually governmental based, is so important. That data doesn't (necessarily) give you the context, but if you work from that data it guarantees that you are sharing the same context, whether it is stated or not. Unfortunately, due to one of those larger contractions that occur at the end of large scale periods of growth, countries are becoming increasingly nationalistic and disinclined to share that data, which means that the easy solution of having common data sources is disappearing. This means that companies will need to focus more on building contextualization than they have in the past.
Semantic data stores represent one potential solution. These are a form of database that allows you to encode not only the data, but the rules that identify the logical model of that data in a consistent fashion. It makes it easier to do master data management (as multiple "graphs" can be stored and worked on simultaneously), and it can also store additional information such as properties of relationships, and common controlled vocabularies. They make cross-graph comparisons much easier to do, and they make it possible to store new kinds of information without the need of restructuring the existing database to do so.
What's more, once stored in this fashion, it can be queried using either a common programming language called SPARQL or, in many cases, can be queried in a more SQL like manner through a bridge (which increasingly is built into the store). Because SPARQL data is (mostly) normalized, it can map easily to SQL databases and can also map to denormalized structures such as JSON or XML. This is expecially true as JSON works towards exposing a schema language called SWAGGER even as RDF (the generic framework for triple store databases) has both OWL and the recently established SHACL language, to match the XML Schema Language (XSD).
This is a critical development, because data instances are notoriously difficult to reverse engineer to a centralized schema. In JSON, for instance, a database using object keys is difficult to discern from a generic object, while an array of objects may have no context information about what those objects are. By establishing schematic languages across all three formats, it makes interchange (and mapping) a MUCH simpler problem. I started working with interchange issues for Microsoft's BizTalk server in 1997, and it has remained a major thorn for the majority of my professional working life.
The remaining challenge, in that respect, is, ironically SQL. SQL is challenging because it has only limited ability to be self-referential, it cannot handle inheritance well, and it generally cannot, within a query, reference information about its own schema - which means that SQL context usually only exists within the heads of a limited number of application developers. SQL also does not handle distributed data well. There are signs that serialization standards especially are emerging, but there's a huge installed SQL/RDBMS base for which serialization is the responsibility of the programmer, not the data store.
Serialization is the flip-side of context. When you serialize information, you are usually taking a subset of that data, filtering it by some matching algorithm, and often renormalizing data (folding hierarchies). Yet the reality of most data is that it has multiple axes of connectivity, or mechanisms for categorization. Context is almost always what mathematicians refer to as a graph. When you analyse the resources described by that graph, especially when you incorporate categorization into the mix, you are choosing a bias on what connections to break, what categories (and how they are defined) are important.
There's a Hole in the Bucket
Ontologists, taxonomists, machine learning specialists and data scientists spend a huge amount of their time determining how data is categorized, because it is usually much easier to work with discrete groups than it is a continuum computationally. It is how the brain handles so much of its own processing.
A book or movie can be modeled on the basis of its genre - that is, as a cluster of plot elements that are frequently found in general proximity which appeal to a certain market demographic. Does it have robots in it? It's science fiction. Does it have unicorns in it? It's fantasy. The problem comes when it contains both robots and unicorns in it (e.g., Blade Runner). Or vampires and detectives, or any other myriad hybridizations. Ultimately, one of the biggest challenges found in any modeling is that the real world is very seldom as orthogonally reducible as we'd like to believe, especially when dealing with conceptual rather than physical characteristics.
Context plays a part here as well - our choice of categorization are often modeled upon our own preconceptions, regardless of the real world. Gender, employment, race, education, industry, generational cohort, social class, all of these are arbitrary and defined largely by the bias of one organization or another, often to the detriment of those who fall through the cracks. Is a writer who gets paid through royalties employed? What about someone who's independently wealthy and writes books to influence others? Both are doing the same activity, all that has changed is the nature of the monetary transaction.
Semantic systems are generally more ideal for capturing these nuances, because categorization can be defined inferentially. Machine learning can also be effective in this regard because it has (little) fundamental bias in choosing clustering, so can identify when resources cluster, but it still becomes necessary to identify why those clusters occur. If you feed an image recognition systems with pictures of red squirrels, and then ask it to identify whether a gray squirrel is a squirrel, it may answer negatively. The system needs training, which in turn means that the selection of the training set is again a contextual constraint that introduces bias in the categorization.
Semantic (contextual) systems make it possible to capture the context of such categorization, to examine the rules (which may themselves be inferred) that determine why a given category exists, and that in turn can create subtle variations of classifications that more accurately represent the information about resources.
I like to think of context as a map with a marker saying "you are here". The marker by itself has no meaning, it is only when emplaced within the map that the marker has meaning. You need both marker and map to know where you are. Or as Buckaroo Bonzai once said (in turn quoting Confucius) "No matter where you go ... there you are."
Leave your comments
Post comment as a guest