Thursday, December 23, 2010

Linked Data revisited: What I learned, what we created, and what's next


(Updated with concluding thoughts about the class and Semantic Web/Linked Data at end of post)
You may remember a brief preview at beginning of the fall semester of my Linked Data Ventures class that was taught by Tim Berners-Lee. In the months since that post, we really rolled up our sleeves and got into the concepts and languages that support the Semantic Web -- and also created real applications and business ideas based on Semantic Web/Linked Data.

TBL taught some of the classes, but we also had some great technical sessions with Lalana Kagal and Ian Jacobi from MIT's CSAIL as well as business sessions with Reed Sturtevant and Katie Rae. Another organizer for the class was K. Krasnow Waterman, a 2006 MIT Sloan Fellow who told me about the history of the Linked Data Ventures Class when I met her at an alumni reception in New York earlier this month.

In addition, nearly every week, we had guest speakers who work with these technologies or develop companies based on Linked Data, including OpenCalais, an RPI faculty member named Jim Hendler who has worked on the federal government linked data initiatives, and numerous startup founders.

But what I wanted to show in this post was a summary of what we learned, from the point of view of someone who started the class with only a vague understanding of what the Semantic Web was. Here are some examples from my homework assignments for 6.898 in the early part of the semester (Note: There may be mistakes!). At the end of the post, I offer some concluding thoughts about the class and the broader SemWeb ecosystem.

Assignments:

My circles and Arrows diagram for assignment 2. The goal was to get us to think about relationships described in a paragraph of text in terms of subject-predicate-object "triples". Here's the assigned text:
Joe Lambda, a 25-year-old man, has a FOAF file. Joe has an AIM account "jlambda", and a Jabber account "joe.lambda@example.com", which is also his e-mail address. Joe is a graduate student at Foobar University, a university in the Cambridge, Massachusetts (42.373611°N, 71.110556°W), the homepage of which is located at "http://foobar.example.org/".

Joe Lambda has two friends, Bill Foo and G. Baz. Normally, Joe lives in Somerville, Massachusetts (42.3875°N, 71.1°W), a city that borders Cambridge, with Bill. G. Baz is their neighbor. Joe, Bill, and G. have a number of different interests, but are all interested in Linked Data. Joe is also interested in Astronomy, and Cricket, Bill also enjoys American Literature and Baseball, and
G. is interested in the TV show Arrested Development and Hockey.
And here's the diagram:


Then, we moved onto the languages, starting with turtle/n3, which identifies SPO relationships in a more human-readable format than the XML-based RDF. A brief, imperfect sample, based on the text from assignment 2, above:

@prefix ex: .
@prefix dbp: .
@prefix sws57: .
@prefix sws72: .
@prefix sws26: .
@prefix foo: .
@prefix rdf: .
@prefix rdfs: .
@prefix foaf: .
@prefix gn: .
@prefix rel: .
@prefix geo: .
@prefix vivo: .
@prefix xsd: .


ex:me foaf:interest dbp:Cricket.
dbp:Cricket rdfs:label "Cricket"@en.
ex:me foaf:name "Joe Lambda"@en;
foaf:age "25^^xsd:int";
foaf:gender foaf:male;
foaf:aimChatID "jlambda";
foaf:mbox "mailto:joe.lambda@example.com";
foaf:schoolHomepage foo:;
foaf:based_near sws57:;
rel:livesWith [rel:livesWith ex:me;
rdf:type foaf:Person;
foaf:based_near sws57:;
foaf:name "Bill Foo";
foaf:interest dbp:Baseball;
foaf:interest dbp:Linked_Data].
dbp:Linked_Data rdfs:label "Linked Data".
dbp:Baseball rdfs:label "Baseball".
foo:about#university foaf:homepage foo:;
rdf:type vivo:University;
rdfs:label "Foobar University";
foaf:based_near sws72:.
sws72: rdfs:label "Cambridge"@en;
geo:lat "42.373611^^xsd:decimal";
geo:long "-71.110556^^xsd:decimal";
gn:parentADM1 sws26:;
rdf:type gn:Feature;
gn:neighbour sws57:.
sws26: rdfs:label "Massachusetts"@en;
rdf:type gn:Feature.
sws57: gn:neighbour sws72:;
rdf:type gn:Feature;
gn:parentADM1 sws26:;
geo:lat "42.3875^^xsd:decimal";
geo:long "-71.1^^xsd:decimal";
rdfs:label "Somerville"@en.

We also designed our own ontologies, which define words, relationships, and other Semantic Web concepts relating to various topic areas. RDF and turtle/n3 graphs can then reuse ontologies for specific graphs (this is what the @prefix code refers to in the previous example). In the following example for assignment #4, we had to create an ontology for top-level biology definitions. Mine looked like this:

@prefix owl: .
@prefix xsd: .
@prefix rdfs: .
@prefix rdf: .

owl:Class rdfs:subClassOf rdfs:Class .

Eukaryote a owl:Class.
[ a owl:Restriction;
owl:onProperty cell;
owl:allValuesFrom CellWithNucleus ].

NonEukaryote a owl:Class.
[ a owl:Restriction;
owl:onProperty cell;
owl:allValuesFrom CellNoNucleus ].

LivingThing a owl:Class;
owl:unionOf ( Eukaryote NonEukaryote ) .
NonLivingThing a Class.
LivingThing owl:complementOf NonLivingThing.

CellWithNucleus a owl:Class,
[ a owl:Restriction;
owl:cardinality "1"xsd:nonNegativeInteger;
owl:onProperty nucleus ] .

CellNoNucleus a owl:Class.
[ a owl:Restriction;
owl:cardinality "0"xsd:nonNegativeInteger;
owl:onProperty nucleus ] .

CellWithNucleus owl:complementOf CellNoNucleus.

cell rdf:type rdf:Property;
rdfs:domain LivingThing.

nucleus rdf:type rdf:Property;
rdfs:domain CellWithNucleus.

Species rdfs:subClassOf LivingThing.

speciesName rdf:type rdf:Property;
rdfs:domain LivingThing;
rdfs:range Species.

datedescribed rdfs:subPropertyOf speciesName;
a owl:DatatypeProperty;
rdfs:range xsd:date;
rdfs:domain Species.

describername rdfs:subPropertyOf speciesName
rdfs:domain Person.

Animal a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Animal owl:intersectionOf ( Eukaryote Species ).

HasTail rdfs:subClassOf Animal.
HasLegs rdfs:subClassOf Animal.
LeggedTailedAnimal a owl:Class.
owl:unionOf ( HasTails HasLegs ) .
numberOfLegs a owl:DatatypeProperty;
rdfs:domain HasLegs;
rdfs:range xsd:integer;


Fungi a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Fungi owl:intersectionOf ( Eukaryote Species ).

Plants a owl:Class,
[ a owl:Restriction;
owl:minCardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Plants owl:intersectionOf ( Eukaryote Species ).

Bacteria a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Bacteria owl:intersectionOf ( NonEukaryote Species ).

Archaea a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Archaea owl:intersectionOf ( NonEukaryote Species ).

Protists a owl:Class,
[ a owl:Restriction;
owl:Cardinality "1"xsd:nonNegativeInteger;
owl:onProperty cell ] .
Protists owl:intersectionOf ( Eukaryote Species ).

… But unfortunately it did not map too well to the ideal solution that we were shown after we handed it in. Creating a model of these relationships depends heavily on logic as well as an understanding of the capabilities of OWL, the language that ontologies are written in.

Finally, we learned the Semantic Web query language, SPARQL. I had taken a SQL class years ago at the Boston College Woods College of Advancing Studies, and this experience was a good introduction to SPAQRL, which basically involves generating new graphs of data from existing triples in a very SQL-like manner.

The following SPARQL example that was shown to us in the class lab generates a list of countries from a triplestore based on the CIA World Factbook and restricts it to countries with a certain area and population:

PREFIX factbook:
SELECT ?country ?population_total ?area
WHERE {?country factbook:population_total ?population_total .
?country factbook:name ?country_name .
?country factbook:area_total ?area .
FILTER (?population_total > "5000000"^^xsd:long || ?area > "500000"^^xsd:long ) . }

But the class wasn't just about learning these languages and concepts. For the second half, we were tasked with forming teams and developing an actual application and business model built on Linked Data. The instructors for this segment were Reed Sturtevant and Katie Rae, but we got a lot of feedback from Tim Berners-Lee, Lalana Kagal and Ian Jacobi during the practice demo in late November. Startup founders and angels gave us some additional feedback on demo/pitch day on December 7. Our team consisted of two Sloan Fellows and an undergrad Computer Science/Media Lab student. We ended up creating a neat little educational app that teaches kids about different countries. You can see a brief demo in the following video (scroll ahead about two or three minutes to see it):



The winner of the demo contest was a neat restaurant review/location service. The people on the team seemed pretty serious about taking it to the next level, so we'll see how that progresses over the spring.

There is also the question of the future of the wider Semantic Web/Linked Data world. For ten years people have been talking about the potential of the technology, and there have certainly been a slew of tools, projects, apps, and datasets made available . But there are also some limitations to the Semantic Web/Linked Data, as our study group found out when we were designing our mobile educational application. Performing live queries to the Web was a no-go, owing to the slow response time, and many of the datasets (including the widely used DBPedia graph) were inconsistent or had other flaws.

Yod, Mads and I went to TBL after our December 7 demo to discuss the "curation problem," and he offered some interesting suggestions. For instance, in choosing the best photos from flickrwrapper for the "places" part of the geography app, we could add some geocoded logic to find the best light/positioning (300 meters west of the object at a certain time of the day) and employ some to-be-determined algorithm or AI to "make sure Aunt Jenny isn't in the frame". He also suggested leveraging Google to programmatically derive the semantic meaning of certain terms that have additional definitions beyond geography. But the idea of using existing Linked Data, standard queries and ontologies without extensive programmed/human curation is just a dream ... at least for the time being.

Beyond the technical issues, there is also the lingering question of what sorts of killer apps might be derived from the Semantic Web. I think a key reason the 6.898 class exists is to help launch more Semantic Web-based startups, open-source tools, and new datasets, in the hope that one or more of these efforts will spark a truly innovative or ground-breaking app that moves LD and the Semantic Web into the mainstream in a highly visible way. I don't know if our educational app or the others from the class will move beyond the prototype phase, but there has been a lot of serious talk in our class about using these and other ideas as the basis of new ventures once we finish. I've been thinking about how the Semantic Web could vastly improve many common data-driven genealogy or history applications (areas which I have written about for years -- see "Google/Ancestry.com followup: Using outsourced Chinese labor to overcome OCR limits" and "Making a case for quantitative research in the study of modern Chinese history: The Xinhua News Agency and Chinese policy views of Vietnam, 1977--1993"), and over the next few months will do some additional research and reach out to people at MIT and elsewhere to evaluate the viability of such a venture (feel free to contact me at ian dot lamont -at- sloan dot mit dot edu if you want to discuss).

Lastly, I would like to offer my profuse thanks to K. Krasnow, Reed, Katie, Ian, Lalana and TBL for not only offering Linked Data Ventures this year, but also for making it a truly challenging and eye-opening experience. It really is one of the best classes I've had at MIT.

Monday, December 20, 2010

Harvard's Open Learning Initiative, and more online education criticism

The Crimson reports the Harvard Extension School is finally participating in something that I've been urging for years: Freely available, not-for-credit video lectures featuring real Harvard faculty, including Harry Lewis, Peter Bol, and Gregory Nagy. The courses are part of the Open Learning Initiative. The same Crimson article also reports on one faculty member's approach to developing for-credit online courses. I think the faculty member, Paul Bamberg, is taking the right approach to experimentation with online video. But it also puts the Extension School's degree credit policies in a questionable light (more on that further down the page). 

First, the Open Learning Initiative. Many schools have started similar efforts, including Carnegie-Mellon. The idea is to provide video lectures for free to members of the public, for the sake of sharing knowledge and providing educational materials. It's a super idea that MIT spearheaded a decade ago with OpenCourseWare, and also reminds me of Khan Academy (see my blog post "MBA Math Review" for more about this). The Extension School's offerings are currently limited to the following classes:

CLAS E-116 The Heroic and the Anti-Heroic in Classical Greek Civilization
Gregory Nagy, PhD, Francis Jones Professor of Classical Greek Literature, Professor of Comparative Literature, and Director of the Center for Hellenic Studies, Harvard University.

Kevin McGrath, PhD, Associate in Sanskrit and Indian Studies, Harvard University.

CSCI E-2 Bits

Harry R. Lewis (pictured at right), PhD, Gordon McKay Professor of Computer Science, Harvard University.

CSCI E-52 Intensive Introduction to Computer Science Using C, PHP, and JavaScript

David J. Malan, PhD, Lecturer on Computer Science, Harvard University.

ENGL E-129 Shakespeare After All: The Later Plays

Marjorie Garber, PhD, William R. Kenan Jr. Professor of English and American Literature and Language and of Visual and Environmental Studies, Harvard University.

HIST E-1825 China: Traditions and Transformations

Peter K. Bol, PhD, Charles H. Carswell Professor of East Asian Languages and Civilizations, Harvard University.

William C. Kirby, PhD, T. M. Chang Professor of China Studies, Harvard Faculty of Arts and Sciences, Spangler Family Professor of Business Administration, Harvard Business School, and Harvard University Distinguished Service Professor.

HIST E-1890 World War and Society in the Twentieth Century: World War II

Charles S. Maier, PhD, Leverett Saltonstall Professor of History, Harvard University.

MATH E-222 Abstract Algebra
Benedict Gross, PhD, George Vasmer Leverett Professor of Mathematics, Harvard University.

MATH E-102 Sets, Counting, and Probability

Paul G. Bamberg, DPhil, Senior Lecturer on Mathematics, Harvard University.

(An aside: The Extension School OLI page mentions "The course ENVR E-117 Organizational Change Management for Sustainability was removed at the request of the instructor." No reason is given)

According to the Crimson article, Professor Bamberg is also setting up a new distance education course, Math 23b, which will launch next month. Unlike the OLI classes, Math 23b is for credit. Prof. Bamberg and some Harvard College students seem to be fully aware of the potential limitations of such a class, as Crimson reporter Vipul Shekhawat described:
In the spring, Bamberg hopes to have an Extension School Math 23b section that meets solely through video conference. Students would present proofs in video format and would not need to come to class at all. This system could allow students from around the world to enroll in Math 23 under the Harvard Extension School.

But the class itself depends on collaboration through student study groups, cooperation on problem sets, and “proof parties” accompanied by food and drink. Could this all be replicated online?

“You can’t really have an online proof party, and you especially can’t do it with refreshments,” says Bamberg. Even lecture videos might be insufficient, according to Bamberg, since the general view among faculty is that a video is no substitute for a live lecture. Students seem to agree. “I think you’re more engaged in person,” says Kyle S. Solan ’14, a student in Computer Science 50, which has video lectures. “It’s more immersive than sitting there watching the lecture on your computer.”

Regardless of the difficulties they may face, Bamberg and the Extension School will begin testing the distance-learning delivery of Math 23b in January. “You never really know what’s going to happen,” says Bamberg. “But it’s worth giving a try.”
This prompted me to write the following comment at the end of the article:
The math class is an encouraging experiment. However, I am afraid that many of the Extension School's other online offerings do not offer the promise of true collaboration with instructors and other students. Email and infrequently used discussion boards are not a substitute for live discussion, something which Prof. Bamberg and other faculty seem to acknowledge ("the general view among faculty is that a video is no substitute for a live lecture").

I applaud the Extension School for its efforts to experiment with online education through the math class and OLI, but Harvard needs to reconsider how far it wants to take for-credit distance learning through the clumsy and limiting technologies that are the norm today. Does a Harvard education now mean watching lectures on a computer screen, without being able to raise your hand, engage in a spontaneous debate with other students or faculty? I don't think so, yet, the Extension School lets students take large chunks of their degree requirements using online education credit (between 50% to 90% of course credit for most degrees).

Credit policies and online education at the Extension School need to be revisited, and not just by supporters in the DCE administration -- how about an open, frank discussion about distance education at Harvard that involves FAS and Mass Hall?
Someone who responded to that comment noted that some classes do use a live discussion tool for distance education sections. I in turn responded with this:
My comment about a lack of spontaneous debate specifically referred to the lectures, which consist of prerecorded video, according to the Extension School website. The reliance on prerecorded video also favors lecture-style classes, as opposed to smaller seminar-style classes with live discussion. 
The Extension School website also says "Much of the communication between teaching staff and students takes place via e-mail and the course website, for local as well as distance students." As a former local student, I found this comment quite interesting, as very little communication between students and faculty in my classes took place via email or class discussion boards, when such options were provided. It's not hard to see why -- talking is much easier and faster than typing, and email and discussion boards will exclude people who aren't cc'ed on the message, not checking the website message threads, or deliberately not responding to messages because they are too busy or would rather not talk. I know distance students have the same frustrations with asynchronous communication technologies -- the Extension School website even features the comments of a distance education student named Dan Hilferty, who says "I still wrestle with how to create more interaction between students in the class." 
I think the Illuminate software for section discussions and other experiments taking place at the Extension School (such as Professor Bamberg's Math 23 class) are a step in the right direction, and I think it's great that you and other distance students get so much from the classes and work so hard to do it (there was actually a Crimson article four or five years ago that quoted Bamberg revealing that a few of his distance students from the Extension School actually got the highest grades, beating out the in-class Harvard College counterparts). There is definitely a place for distance education at the Extension School. But what I strongly disagree with is having prerecorded lectures of top faculty become the wholesale replacement for live classes for degree candidates. It's a fundamentally different experience that offers addictive convenience, but falls short in terms of offering students a chance to spontaneously interact with faculty and other students. FAS and the University administration need to re-evaluate credit policies for online learning based on asynchronous technologies, and closely consider what a "Harvard education" means. 

Some additional context: I've been a critic of the Extension School's online education credit policies for some time, and earlier this year had an opportunity to take a for-credit math class through the University of California, Berkeley, Extension School (see my blog post about this, "My online math class: Convenience gets an "A", but at what cost?"). The convenience was great, but many of my worst suspicions about online education using asynchronous communications were confirmed.

Research and other sources: CMU OLI website, Extension School OLI website, Harry Lewis OLI video, The Crimson, my past articles on this blog and Harvard Extended.