What We Talk About When We Talk About Structured Data (part 1)

August 18, 2010 at 3:59 pm 2 comments

Still trying to find that needle...

Sometimes it feels like my job wholly consists of talking about structured data. I give definitions and abstract examples – but it’s not easy to help people understand the real benefits of it. And so, as I try to find the best way to get my point across, I often think of real world examples of how structured data is useful in our daily lives and why it tends to be preferable to unstructured, narrative text.

With that in mind – let’s go out to the ball park!

THAT's the Chicago way!

Baseball. America’s Past Time. The Sweet Science. City of Lights. The Ghost and the darkness.

Okay, I’m not the biggest baseball fan. I will watch playoffs and world series. But I do know that a lot of baseball is about numbers: strike outs, home runs, bases stolen, RBIs, wins, losses, saves, ERAs, pitches thrown. Each of these play a crucial factor in determining how well your team is doing or how a player is perceived by his fans. And, following theories like those laid out in Moneyball, these metrics can even determine who will be signed on to your favorite team.

So what does this have to do with structured data? Glad you asked!

Structured data is present even in the land of the Jock! For example, here is a written account of the August 4, 2010 game between the Cleveland Indians and the Boston Red Sox taken from ESPN’s site. Here’s an excerpt:

Andy Marte hit a three-run homer in a five-run seventh inning, and Jayson Nix homered off the Fisk Pole to give Cleveland its fourth win in five games. The Indians took advantage of three Boston errors to score seven unearned runs, including all five in the seventh when they turned a 4-1 game into a 9-1 blowout.

The Indians, who put catcher Carlos Santana and designated hitter Travis Hafner on the disabled list this week, improved to 12-8 since the All-Star break and moved out of last place and into a tie for fourth. Nix was at DH on Wednesday, and Lou Marson went 1-for-4 at catcher.

“A lot of them are happy to be here and see the ballpark and all that,” manager Manny Acta said. “They’ve got an opportunity to impress us over the next two months.”

Masterson was the winning pitcher for Boston in Game 5 of the 2008 AL championship series when the Red Sox staged the biggest postseason rally since 1929, coming back from a 7-0 deficit to beat the Tampa Bay Rays. With Cleveland this year, he is 2-0 with a 0.64 ERA against Boston and 2-10 with a 6.06 ERA against everyone else.

Now, can you answer me the following questions: What was the final score? What is Masterson’s ERA? What’s his record? How many runs did Andy Marte get? How many runs were scored in the seventh inning?

The answers are all contained in the above passage. Please note how long it took for you to read through it in order to find all of the answers. Please compare that with the following box score (again, from ESPN’s site):

Excel Spreedsheets - Not Just For Nerds Anymore

All of the answers to those questions I posed can be found in the tables on that site. For a third way of looking at the game, there’s the Play-by-Play version – in which each event is presented in compartmentalized sequences.

The same game, the same information, captured in multiple ways. Are any of these approaches better than others? Well, the obvious (and diplomatic) answer would be “no.” But that’s not true – each of these ways of capturing and communicating an event is better than another depending on what you’re looking for in the data.

For example – if you were interested in how Ortiz’s batting average was in this game versus his overall average this season, or if you wanted to see who had the best batting average in that game, you would choose the “box score” view. The data is devoid of any subjectivity, so it is a pure number that is easy to compare and contrast with others in the same category. Now you are comparing metrics using the same language – batting average to batting average, for example. This detachment is a double-edged sword, though, as it means that the data loses all context; if a player had been injured for half the season, his numbers would be completely out of whack as compared to a healthy player. Also, lack of context means it’s hard to see who did what and when – look at that box score. It says when the runs were scored, but not by who; conversely, the other tables on the box score page show who scored the runs, but not when. So this data is completely structured, pure and unified in a common language. It’s valuable to researchers but doesn’t take into account the total aspects of the event (the game). When one wants to look at how well a particular player does in a stadium, they would use this format. Or if one wanted to assemble the average number of pitches Lester throws in the course of the season, tables like this would be the most helpful.

The flip side is the Recap version – a purely narrative account that is devoid of structure. This approach presents the event as a story – telling a linear story, but not necessarily in the most linear fashion. For example, it opens with the results of the game and then goes back into what happened. However, all of the context is present in this account; not only do we know what happened and when it happened, but we know how people reacted to it and how it all came about. For those who are interested in capturing the mindset of those involved, or those who wish to feel like “they were there” – this may be the best approach. If you’re seeking specific information, though, this is the format that vexes you the most. If you had to answer all those questions without reading the piece first, you’d spend an inordinate amount of time hunting through each paragraph for the information you sought. Not only that, but in the excerpt above you’ll notice that sometimes numbers are written out, sometimes they are just presented; there’s a lack of standardization within the piece. Some statistics are placed towards the beginning, while others can be found at the end. So the narrative provides a service by painting a picture that involves multiple angles – but at the expense of data clarity and unity. If time is not an issue, and if precision of data/information is not an issue – if you are less interested in studying something or comparing figures – then narrative will be right for you.

Please note: arc void when discussing Inception

Lastly there’s the mutant hybrid stepchild of these two approaches. It’s not a sea of verbiage burying the information nor is it an emotionless set of numbers provided without any context. The Play-by-Play approach captures the sequence of events along with cataloging the runs as they happen. The reader knows how everything unfolded and, if desired, could search each of these fields to see how many times Ortiz struck out or Marte hit a run.

This analogy breaks down a bit here, as there’s not as much standardization in the Play-by-Play as there should be. Similar terms are used and usually in the same fashion (example: [Player name] struck out or [Player name] walked), but there are enough subtle differences that makes it harder to use as pure data. For example – I could still search for who struck out by searching through the table for the word “struck” – either using a computer program or my own eyes. However, I might get thrown by all of the various additions in the sentences – struck out swinging, ground ruled double versus a double, etc. So a superior version of this hybrid would be one that has uniform language and formatting (example: [Player name] [action] [number of pitches]), while also laying out how the events unfolded in sequence.

Think of the Play-by-Play as a mog - half man, half dog. He's his own best friend!

When approaching how data is captured – you have to ask what am I going to do with it?

How will this data ultimately be used?

For research into specific aspects of an event? In that case, you’ll need to ensure that the language and format is consistent throughout all documents.

For explanation on the reason for events or insight into how people think? Narrative paints a complete picture that can be absorbed and parsed over long periods of time.

For tracking when something happens, how it happened or how often it happened? Then you’ll need something with a unified language and format that’s consistent in every instance of data capture, but also maintains a sequence of events.

When you finish making your postoperative report, how do you envision the data in it to be ultimately used? Is it for research? Is it for quality initiatives? Is it for posterity’s sake? Is it for legal protection?

If you answered “All of the Above,” then you ask yourself – how am I ensuring that the data I capture will be put to that use in the most efficient way possible?


Entry filed under: Structured Data. Tags: , , , , , , , , , , , , , , , , , , , , , , , , , .

Words from Around the Web What We Talk About When We Talk About Structured Data (Part 2)

2 Comments Add your own

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed

Wholly Owned Subsidiary of mTuitive


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 3 other followers

mTuitive on Twitter!


Disclosure Statement - The authors of this blog are paid employees of mTuitive Inc. and are compensated for their services.

%d bloggers like this: