Thursday, December 23, 2010

Linked Data revisited: What I learned, what we created, and what's next

(Updated with concluding thoughts about the class and Semantic Web/Linked Data at end of post) You may remember a brief preview at beginning of the fall semester of my Linked Data Ventures class that was taught by Tim Berners-Lee. In the months since that post, we really rolled up our sleeves and got into the concepts and languages that support the Semantic Web -- and also created real applications and business ideas based on Semantic Web/Linked Data.

TBL taught some of the classes, but we also had some great technical sessions with Lalana Kagal and Ian Jacobi from MIT's CSAIL as well as business sessions with Reed Sturtevant and Katie Rae. Another organizer for the class was K. Krasnow Waterman, a 2006 MIT Sloan Fellow who told me about the history of the Linked Data Ventures Class when I met her at an alumni reception in New York earlier this month.

In addition, nearly every week, we had guest speakers who work with these technologies or develop companies based on Linked Data, including OpenCalais, an RPI faculty member named Jim Hendler who has worked on the federal government linked data initiatives, and numerous startup founders.

But what I wanted to show in this post was a summary of what we learned, from the point of view of someone who started the class with only a vague understanding of what the Semantic Web was. Here are some examples from my homework assignments for 6.898 in the early part of the semester (Note: There may be mistakes!). At the end of the post, I offer some concluding thoughts about the class and the broader SemWeb ecosystem.

Linked Data Assignments

My circles and Arrows diagram for assignment 2. The goal was to get us to think about relationships described in a paragraph of text in terms of subject-predicate-object "triples". Here's the assigned text:
Joe Lambda, a 25-year-old man, has a FOAF file. Joe has an AIM account "jlambda", and a Jabber account "", which is also his e-mail address. Joe is a graduate student at Foobar University, a university in the Cambridge, Massachusetts (42.373611°N, 71.110556°W), the homepage of which is located at "".

Joe Lambda has two friends, Bill Foo and G. Baz. Normally, Joe lives in Somerville, Massachusetts (42.3875°N, 71.1°W), a city that borders Cambridge, with Bill. G. Baz is their neighbor. Joe, Bill, and G. have a number of different interests, but are all interested in Linked Data. Joe is also interested in Astronomy, and Cricket, Bill also enjoys American Literature and Baseball, and
G. is interested in the TV show Arrested Development and Hockey.
And here's the diagram:

LinkedIn Data revisited

Then, we moved onto the languages, starting with turtle/n3, which identifies SPO relationships in a more human-readable format than the XML-based RDF. A brief, imperfect sample, based on the text from assignment 2, above:

@prefix ex: .
@prefix dbp: .
@prefix sws57: .
@prefix sws72: .
@prefix sws26: .
@prefix foo: .
@prefix rdf: .
@prefix rdfs: .
@prefix foaf: .
@prefix gn: .
@prefix rel: .
@prefix geo: .
@prefix vivo: .
@prefix xsd: .

ex:me foaf:interest dbp:Cricket.
dbp:Cricket rdfs:label "Cricket"@en.
ex:me foaf:name "Joe Lambda"@en;
foaf:age "25^^xsd:int";
foaf:gender foaf:male;
foaf:aimChatID "jlambda";
foaf:mbox "";
foaf:schoolHomepage foo:;
foaf:based_near sws57:;
rel:livesWith [rel:livesWith ex:me;
rdf:type foaf:Person;
foaf:based_near sws57:;
foaf:name "Bill Foo";
foaf:interest dbp:Baseball;
foaf:interest dbp:Linked_Data].
dbp:Linked_Data rdfs:label "Linked Data".
dbp:Baseball rdfs:label "Baseball".
foo:about#university foaf:homepage foo:;
rdf:type vivo:University;
rdfs:label "Foobar University";
foaf:based_near sws72:.
sws72: rdfs:label "Cambridge"@en;
geo:lat "42.373611^^xsd:decimal";
geo:long "-71.110556^^xsd:decimal";
gn:parentADM1 sws26:;
rdf:type gn:Feature;
gn:neighbour sws57:.
sws26: rdfs:label "Massachusetts"@en;
rdf:type gn:Feature.
sws57: gn:neighbour sws72:;
rdf:type gn:Feature;
gn:parentADM1 sws26:;
geo:lat "42.3875^^xsd:decimal";
geo:long "-71.1^^xsd:decimal";
rdfs:label "Somerville"@en.

We also designed our own ontologies, which define words, relationships, and other Semantic Web concepts relating to various topic areas. RDF and turtle/n3 graphs can then reuse ontologies for specific graphs (this is what the @prefix code refers to in the previous example). In the following example for assignment #4, we had to create an ontology for top-level biology definitions. Mine looked like this:

@prefix owl: .
@prefix xsd: .
@prefix rdfs: .
@prefix rdf: .

owl:Class rdfs:subClassOf rdfs:Class .

Eukaryote a owl:Class.
[ a owl:Restriction;
owl:onProperty cell;
owl:allValuesFrom CellWithNucleus ].

NonEukaryote a owl:Class.
[ a owl:Restriction;
owl:onProperty cell;
owl:allValuesFrom CellNoNucleus ].

LivingThing a owl:Class;
owl:unionOf ( Eukaryote NonEukaryote ) .
NonLivingThing a Class.
LivingThing owl:complementOf NonLivingThing.

CellWithNucleus a owl:Class,
[ a owl:Restriction;
owl:cardinality "1"xsd:nonNegativeInteger;
owl:onProperty nucleus ] .

CellNoNucleus a owl:Class.
[ a owl:Restriction;
owl:cardinality "0"xsd:nonNegativeInteger;
owl:onProperty nucleus ] .

CellWithNucleus owl:complementOf CellNoNucleus.

cell rdf:type rdf:Property;
rdfs:domain LivingThing.

nucleus rdf:type rdf:Property;
rdfs:domain CellWithNucleus.

Species rdfs:subClassOf LivingThing.

speciesName rdf:type rdf:Property;
rdfs:domain LivingThing;
rdfs:range Species.

datedescribed rdfs:subPropertyOf speciesName;
a owl:DatatypeProperty;
rdfs:range xsd:date;
rdfs:domain Species.

describername rdfs:subPropertyOf speciesName
rdfs:domain Person.

Animal a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Animal owl:intersectionOf ( Eukaryote Species ).

HasTail rdfs:subClassOf Animal.
HasLegs rdfs:subClassOf Animal.
LeggedTailedAnimal a owl:Class.
owl:unionOf ( HasTails HasLegs ) .
numberOfLegs a owl:DatatypeProperty;
rdfs:domain HasLegs;
rdfs:range xsd:integer;

Fungi a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Fungi owl:intersectionOf ( Eukaryote Species ).

Plants a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Plants owl:intersectionOf ( Eukaryote Species ).

Bacteria a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Bacteria owl:intersectionOf ( NonEukaryote Species ).

Archaea a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Archaea owl:intersectionOf ( NonEukaryote Species ).

Protists a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Protists owl:intersectionOf ( Eukaryote Species ).

… But unfortunately it did not map too well to the ideal solution that we were shown after we handed it in. Creating a model of these relationships depends heavily on logic as well as an understanding of the capabilities of OWL, the language that ontologies are written in.

Finally, we learned the Semantic Web query language, SPARQL. I had taken a SQL class years ago at the Boston College Woods College of Advancing Studies, and this experience was a good introduction to SPAQRL, which basically involves generating new graphs of data from existing triples in a very SQL-like manner.

The following SPARQL example that was shown to us in the class lab generates a list of countries from a triplestore based on the CIA World Factbook and restricts it to countries with a certain area and population:

PREFIX factbook:
SELECT ?country ?population_total ?area
WHERE {?country factbook:population_total ?population_total .
?country factbook:name ?country_name .
?country factbook:area_total ?area .
FILTER (?population_total > "5000000"^^xsd:long || ?area > "500000"^^xsd:long ) . }

But the class wasn't just about learning these languages and concepts. For the second half, we were tasked with forming teams and developing an actual application and business model built on Linked Data. The instructors for this segment were Reed Sturtevant and Katie Rae, but we got a lot of feedback from Tim Berners-Lee, Lalana Kagal and Ian Jacobi during the practice demo in late November. Startup founders and angels gave us some additional feedback on demo/pitch day on December 7. Our team consisted of two Sloan Fellows and an undergrad Computer Science/Media Lab student. We ended up creating a neat little educational app that teaches kids about different countries. You can see a brief demo in the following video (scroll ahead about two or three minutes to see it):

The winner of the demo contest was a neat restaurant review/location service. The people on the team seemed pretty serious about taking it to the next level, so we'll see how that progresses over the spring.

Future of the Semantic Web

There is also the question of the future of the wider Semantic Web/Linked Data world. For ten years people have been talking about the potential of the technology, and there have certainly been a slew of tools, projects, apps, and datasets made available . But there are also some limitations to the Semantic Web/Linked Data, as our study group found out when we were designing our mobile educational application. Performing live queries to the Web was a no-go, owing to the slow response time, and many of the datasets (including the widely used DBPedia graph) were inconsistent or had other flaws.

Yod, Mads and I went to TBL after our December 7 demo to discuss the "curation problem," and he offered some interesting suggestions. For instance, in choosing the best photos from flickrwrapper for the "places" part of the geography app, we could add some geocoded logic to find the best light/positioning (300 meters west of the object at a certain time of the day) and employ some to-be-determined algorithm or AI to "make sure Aunt Jenny isn't in the frame". He also suggested leveraging Google to programmatically derive the semantic meaning of certain terms that have additional definitions beyond geography. But the idea of using existing Linked Data, standard queries and ontologies without extensive programmed/human curation is just a dream ... at least for the time being.

Beyond the technical issues, there is also the lingering question of what sorts of killer apps might be derived from the Semantic Web. I think a key reason the 6.898 class exists is to help launch more Semantic Web-based startups, open-source tools, and new datasets, in the hope that one or more of these efforts will spark a truly innovative or ground-breaking app that moves LD and the Semantic Web into the mainstream in a highly visible way. I don't know if our educational app or the others from the class will move beyond the prototype phase, but there has been a lot of serious talk in our class about using these and other ideas as the basis of new ventures once we finish. I've been thinking about how the Semantic Web could vastly improve many common data-driven genealogy or history applications (areas which I have written about for years -- see "Google/ followup: Using outsourced Chinese labor to overcome OCR limits" and "Making a case for quantitative research in the study of modern Chinese history: The Xinhua News Agency and Chinese policy views of Vietnam, 1977--1993"), and over the next few months will do some additional research and reach out to people at MIT and elsewhere to evaluate the viability of such a venture (feel free to contact me at ian dot lamont -at- sloan dot mit dot edu if you want to discuss).

Lastly, I would like to offer my profuse thanks to K. Krasnow, Reed, Katie, Ian, Lalana and TBL for not only offering Linked Data Ventures this year, but also for making it a truly challenging and eye-opening experience. It really is one of the best classes I've had at MIT.

1 comment:

  1. The ontologies without extensive programmed/human curation is just a dream...


All comments will be reviewed before being published. Spam, off-topic or hateful comments will be removed.