Skip to content

Review: Coursera Social Network Analysis class

I recently completed the Coursera Social Network Analysis class. This was my first time taking a Coursera class. In this post, I will describe my experience with Coursera generally, and review the Social Network Analysis class in particular.

Along with several of my colleagues, I took Martin Odersky’s Functional Programming Principles in Scala class at the same time. Although I finished my last assignment for that class weeks ago, the class isn’t technically complete, so I will reserve comments on that class for a later date.

The Social Network Analysis class is an introductory level class. No previous experience with social network analysis is presumed. The professor is Lada Adamic, of the University of Michigan. If you take the class, you will learn what social network analysis is, the basic mathematical techniques used, a few interesting applications, and have the chance to use software tools such as Gephi, R, NetLogo, and others.

One point which is perhaps not obvious to outsiders is that “social network analysis” is actually relevant to analyzing almost any network which can be modeled as a directed or undirected graph. The class includes lectures on using the techniques presented here on networks which are not actually social networks.

There is quite a bit more math underlying real-world social network analysis then an outsider might expect. Most so-called "social media experts" would probably find it incomprehensible. However, this class really only skims the surface of the math. Students who are not “mathematically inclined” probably won’t have any problem getting through the class. Students who really like math will probably want to follow up with something more in-depth.

The class has an optional programming component. You can opt to not complete any of the programming assignments, and still pass the class. However, I think the most valuable things I learned from the class came from completing the programming assignments. It’s one thing to hear about, say, a mathematical formula for determining some property of the network, but just reading this formula on the slide sometimes doesn’t really stick in my memory. If I actually implement this analysis in R, on the other hand, I have to think about what all of the variables mean. Also, in completing the programming assignments I had the opportunity to use some tools which I had never heard of before, like NetLogo, and other tools, like R, which I had heard about, but never actually used.

If you do complete the programming assignments, then you will receive a “with distinction” on the certificate (a PDF, really) you receive on successful completion of the class. Coursera, however, is not at this time an accredited institution, so you should probably take this class to fulfill your love of learning, rather than because you think it will enhance your resume.

I think the programming assignments would be somewhat difficult for students without prior programming experience. None of them were difficult for me to complete, but somebody without a general sense of common programming syntax, and without real-world experience of setting up slightly-novice-unfriendly tools, might be frustrated by the fairly low level of support included in the course. It is assumed that you are able to install and configure, for example, the R system, pick a text editor and a REPL, install libraries from various locations, figure out differing syntax in various systems, and deal with at least one required library which doesn’t always seem to return accurate results. No professional programmer will have a problem with any of this, but I can see several roadblocks to non-programmers. This is unfortunate, because the programming itself is not difficult, and really does help with understanding the material in the class.

Like just about any class I’ve ever taken anywhere, online or offline, there are “bugs” in the assignments. Judging by forum posts, many people seem to find this surprising, but, in my experience, it happens in traditional classrooms as well. In a few cases, a multiple-choice question had more than one correct answer. In other cases, selecting the correct answer would not give you credit. The Coursera staff would eventually fix these problems.

In contrast to the Functional Programming class, where your grade was completely determined by unit tests and static analysis against your Scala code, the Social Network Analysis class was more “traditional” in that there were multiple-choice-style homeworks and a timed final exam.

Because the class is free, the measure of its value is, “Is taking this course worth my time?” For me, the answer turned out to be a definite yes. I probably spent about 5 hours a week listening to the lectures and completing the assignments. Prior to taking the class, I spent some time reading about the subject matter. I found taking the class to be a more efficient way to really understand what the field was about. For that reason, I recommend it to anyone who needs to interact with networks, social or otherwise, and is not familiar with the field.

Tagged , ,

The Homomorphic Encryption Patent Land Rush

I noticed this morning that Google patent search returns 189 results for the query “homomorphic encryption." I have written about homomorphic encryption in the past; it is a true mathematical breakthrough which has the potential to transform cloud computing security. But the emphasis, here, is on “potential.” There is no fully homomorphic encryption scheme which is efficient enough to be practical for real-world, general-purpose computation.

This, apparently, has done nothing to stop the patent land rush for any process which might conceivably utilize homomorphic encryption. Some of these are legitimate under the current rules of U.S. patents, even if you personally find those rules wrongheaded. For example, IBM has filed patent claims on Craig Gentry’s breakthrough algorithms.

On the other hand, it is not difficult to find patents which don’t so much describe an invention as speculation on what someone might invent in the future. It reminds me of the well-worn economics joke, "Assume a can opener…" Some of these contain amusing disclaimers, such as the following:

Currently there is no practical fully homomorphic cryptosystem, that is, there is no secure cryptosystem that allows for the homomorphic computation of additions and products without restrictions. There has been a recent contribution by Gentry that presents a cryptosystem based on ideal lattices with bootstrappable decryption, and it is been shown that it achieves a full homomorphism. Nevertheless, the authors of this method to concede that making this scheme practical remains an open problem.

Others invoke a form of rhetorical hand waving more commonly seen in freshman term papers than patent applications:

There exist well known solutions for secure computation of any function e.g. as described in [45]… The general method employed by most of these solutions is to construct a combinatorial circuit that computes the required function… It seems hard to apply these methods to complete continuous functions or represent Real numbers, since the methods inherently work over finite fields.

The “well-known” solutions, here, are both computationally impractical and covered by other patent applications. And “seems hard” is really not the right phrase to pick when you are describing a problem for which there is no known solution.

In 2004, Phillips filed a patent application for the very idea of homomorphic encryption. Never mind that no algorithm for fully homomorphic encryption existed at the time, and that Phillips did not invent the idea itself. Buried in paragraph 69 are the following weasel words:

An encryption scheme with these two properties is called a homomorphic encryption scheme. The Paillier system is one homomorphic encryption scheme, but more ones [sic] exist.

Note that Pascal Paillier does not appear to work for Phillips, published his encryption scheme 5 years prior to this patent application, and Paillier encryption is only homomorphic with respect to addition.

Some approved patents venture into the absurd. Remember the string concatenation example I used in my first article on homework encryption? Folks who follow my public speaking closely (hi, mom!) will remember that I sometimes joke about a startup selling cloud-based, homomorphically encrypted string concatenation ("Concatenatr"). It’s a joke, right? You can stop laughing now; SAP actually received a patent for that one.

Tagged ,

Speaking at "Moving to Better Secure the Cloud"

I’ll be speaking at a Slashdot/Geeknet "virtual trade show" today.

Moving to Better Secure the Cloud: Governance, Risk, and Compliance Management

My presentation will be on the potential business impact on the web if an efficient and fully homomorphic encryption system is invented. I’ll be speaking sometime in between 3:15 and 4:00 EST, for about 20 minutes. The target audience is CIOs.

Sorry for the short notice, but this came together at the last minute!

Tagged ,

Ad-hoc SQL/POCO Queries in Entity Framework 4.0

Since version 4.0, the Entity Framework has had the ability to query un-mapped data and project it onto POCOs using ad-hoc SQL.

Here, for example, is how we check the current SQL Server version:

        internal class SqlVersionInfo
        {
            public string Edition { get; set; }
            public string ProductLevel { get; set; }
            public string ProductVersion { get; set; }
        }

        private static SqlVersionInfo GetSqlServerVersion(ObjectContext context)
        {
            const string sql =
             @"SELECT SERVERPROPERTY('productversion') as [ProductVersion],
                      SERVERPROPERTY('productlevel') as [ProductLevel],
                      SERVERPROPERTY('edition') as [Edition];";
            var reader = context.ExecuteStoreQuery<SqlVersionInfo>(sql);
            return reader.Single();
        }

Note that no mapping step is required — the code will just run with any ObjectContext at all.

Tagged , ,

Sometimes, SELECT Really Is Broken

We got a bug report from a customer the other day; a certain query in one of our web services was giving the following error:

A column has been specified more than once in the order by list. Columns in the order by list must be unique.

Seems clear enough, except that

  1. There was no duplication in the ORDER BY. Our DBA discovered that the "problem" reference was [Project2].[C5] ASC, which is an alias, not a real DB column name. It certainly appeared only once, but removing it made the error go away. There is documentation which implies that this was an intentional change in SQL Server 2005, but:
  2. The same query worked fine on other SQL Server 2005 servers. The query work correctly on a local QA environment. But it failed on the customer’s database.
  3. We didn’t write the query; it was produced by the Entity Framework, and my experience tells me that the SQL Server provider for EF doesn’t tend to produce un-compilable SQL.

In order to isolate the problem, I ran the LINQ query in LINQPad. This allowed me to reproduce the problem outside the context of the application (useful, since the test case took a couple of minutes, whereas executing the query in LINQPad directly is almost instant), and also allowed me to switch DB servers easily.

I found a workaround — I could change the LINQ and produce SQL which would work on both servers. But I didn’t like this "fix" because I still didn’t understand the problem. I don’t want to "fix" one thing but cause another problem for some other customer. So I dug a little deeper.

Googling turned up information about the actual problem this error message is supposed to explain, but that wasn’t the case with our query. I did learn that we are not the first group to hit this problem, and that some people "fix" it by setting the database compatibility level to SQL Server 2000. But I don’t like this "fix" at all. Changing the database compatibility level would mean that I would also have to change the ProviderManifestToken in the application’s EDM ask, causing the Entity Framework to emit SQL Server 2000-compatible SQL. That would make the entire application less efficient. So I kept looking.

This Stack Overflow answer didn’t explain everything, but provided an important clue. We checked the server versions on both our local QA environment and the customer system. The customer server was running SQL Server 2005, SP 3. Our local QA environment was running SQL Server 2005, SP 4. Further experimentation showed that this was, in fact, the problem.

It’s generally true that "SELECT isn’t broken." But when you’ve ruled out every other possibility, it’s sometimes worth checking!

Tagged , ,

Testing Cross Cutting Concerns

So imagine you’ve been asked to implement the following requirement:

When a to-do list item is marked as complete, the CompletedOn date time property shall be set to the current time.

That doesn’t sound so hard to implement, but how can I test it? I don’t know precisely what the value of the CompletedOn property "should be" since I don’t know precisely when the Complete setter will run.

Method #1: Fuzzy Logic

One option would be to see if the value is "close enough":

var todo = new TodoItem("Test item");

var testStart = DateTime.Now;
todo.Complete = true;
var testStop = DateTime.Now;

Assert.IsTrue(todo.CompletedOn >= testStart && todo.CompletedOn <= testStop,
    "CompletedOn not set to correct timestamp.");

There’s a lot to dislike about this technique. The test could fail if it runs when the system clock moves backwards, e.g., when daylight savings time ends. It’s imprecise, and clearly won’t work in cases where we need a total order across multiple timestamps.

On the other hand, it’s good enough to start writing an implementation, and with that in hand we can consider a better test.

Implementing TodoItem

Here’s a first pass at an implementation:

public class TodoItem
{
    public DateTime? CompletedOn { get; private set; }

    private bool _complete;
    public bool Complete
    {
        get { return _complete; }
        set
        {
             _complete = value;
             CompletedOn = value ? DateTime.Now ? null;
        }
    }
    // ... rest of class

Method #2: Ambient Context

In his excellent book, Dependency Injection in .NET (I’ll write a full review someday, but the book was just updated and I’m still reading that), Mark Seeman suggests replacing the call to DateTime.Now in the implementation with a separate singleton which I can control. The implementation would change to:

public class TodoItem
{
    public DateTime? CompletedOn { get; private set; }

    private bool _complete;
    public bool Complete
    {
        get { return _complete; }
        set
        {
             _complete = value;
             CompletedOn = value ? TimeProvider.Current.Now ? null;
        }
    }
    // ... rest of class

…and the test becomes:

var todo = new TodoItem("Test item");
var expectedTime = new DateTime(2011, 01, 01, 1, 2, 3);
TimeProvider.Current = new OneTimeOnlyProvider { Now = expectedTime } ;

todo.Complete = true;

Assert.AreEqual(expectedTime, todo.CompletedOn, "CompletedOn not set to correct timestamp.");

The test is quite a bit better than the first attempt. I’m now testing for a specific value.

On the other hand, the readability of the code suffers somewhat. Most .NET programmers immediately know what DateTime.Now means, whereas with TimeProvider.Current.Now they have to guess. Also, developers on the team must learn to use a "non-standard" method of obtaining the current time.

Should legacy code, which might not even require a specific value of DateTime.Now for its tests be updated to use this "new" method, just for the sake of consistency? What if that changes the performance characteristics of the code? It’s hard to dismiss this concern as "premature optimization" when we don’t want to change the functionality of the legacy code. This leaves me with the choice of a possibly (slightly) risky change to existing code or inconsistency in the code base.

At any rate, this seems like a good solution, but it’s not the only one.

Method #3: Moles

It’s pretty common to mock or stub methods not directly related to the system under test. Most mocking libraries don’t support this for static methods, but Moles, from Microsoft research, does.

Moles is a kind of monkeypatching for .NET; you can replace just about any method at runtime.

With Moles, we stick with the original implementation of the system under test and instead change the test:

var todo = new TodoItem("Test item");
var expectedTime = new DateTime(2011, 01, 01, 1, 2, 3);
MDateTime.NowGet = () => { return expectedTime; }; // this replaces all calls to DateTime.Now

todo.Complete = true;

Assert.AreEqual(expectedTime, todo.CompletedOn, "CompletedOn not set to correct timestamp.");

Again, it’s a better test than method #1.

Is it better than #2? Well, it’s certainly different. Moles replaces all calls to System.DateTime.Now in the code under test, whether or not I’ve intended this. In my sample code, there’s only one. But in the real world the code under test is generally far more complicated.

Interestingly, the technique in use here — replacing a standard method with one we supply — is the same for Ambient Context and Moles. However the place where we make this redirection is quite different. Although the new method is defined in the test project in both cases, the actual rewiring is done in the test with Moles, whereas with Ambient Context it’s a source code change.

Moles is clearly useful in cases where you need to change a dependency of code which is not your own. Imagine trying to test non-MVC ASP.NET code which depends on ASP.NET methods which internally refer to HttpContext.Current. Indeed, this was painful even in early versions of MVC, although not so much in MVC 3.

But that’s not the case with this code: I have a direct dependency on DateTime.Now, not an indirect dependency. So in this case I’m not forced to use something like Moles.

Moles is pre-release (though it is currently being "productized"), un-supported, and closed source. That, by itself, will disqualify its use from many projects. On the other hand, it’s required for Pex, which is useful in its own right.

Method #4: Constructor Injection

We could of course use normal DI for the time feature

public class TodoItem
{
    public DateTime? CompletedOn { get; private set; }

    private ITimeProvider _timeProvider;

    private bool _complete;
    public bool Complete
    {
        get { return _complete; }
        set
        {
             _complete = value;
             CompletedOn = value ? _timeProvider.Now ? null;
        }
    }
    // ... rest of class

…and the test becomes:

var expectedTime = new DateTime(2011, 01, 01, 1, 2, 3);
var todo = new TodoItem("Test item", new OneTimeOnlyProvider { Now = expectedTime });
todo.Complete = true;

Assert.AreEqual(expectedTime, todo.CompletedOn, "CompletedOn not set to correct timestamp.");

The test is simple and the code is clean. For this specific case it’s arguably the best option.

However, if a significant fraction of all types start requiring a time provider constructor argument, especially if it’s mostly a transient reference to pass along to another type’s constructor, then it’s not a good choice.

Conclusions?

The "fuzzy logic" method is better than no test at all, but that’s about the best I can say for it. Moles can do things which no other method can, but that’s not required for this case. The Ambient Context method works fine here, but raises questions about how to apply it consistently. I can’t tell you which method is best to use; I’m still trying to work that out for myself! Your comments would be welcome.

Thanks to Mark Seeman for answering my questions on Twitter.

Tagged , , ,

Would You Buy a Used Framework from This Tool?

I think the Web Platform Installer is a great tool, but I have to question the wisdom of its home page:

If you click on these, you see… nothing. A description would be nice. ("Application Request Routing? What’s that? EC-CUBE?")

But that’s not really the problem. The bigger problem is this: A "spotlighted installers" feature probably sounded great on the drawing board, but this tool is intended for public-facing web servers. It isn’t the App Store; public-facing web frameworks should not be impulse buys.

What should go here? I’m not certain. My first thought is security updates, but those should already come through Microsoft Update. Perhaps a link there, or a display of what is already installed, and whether it’s up to date?

Tagged , , ,

Great CS Textbooks, Cheap

I’m probably late to this party, but I’ve discovered that you can find incredible deals on used CS textbooks at Amazon, especially for older editions.

For example, I recently ordered a copy of Programming Language Pragmatics, by Michael L. Scott. It’s $63 new for the hardcover or $43 on a Kindle. I got a used copy of the (somewhat older) second edition for $3 + postage, for a total of $7. True, I don’t get the new chapter on VMs, but I can live with that. The third edition totally dried up the market for the second, so there are really good deals available!

The book itself is good so far, though I haven’t read enough to give a full review. It covers substantially similar territory as the dragon book, except more scope talking about programming languages and computer architecture and less depth talking about compiler internals. They’re both worth reading.

Tagged ,

An Excuse Not to Roll Your Own Authentication Scheme

The Rails 3.1 Release Candidate announcement contained news of many new and useful features, plus these regretful words:

has_secure_password: Dead-simple BCrypt-based passwords. Now there’s no excuse not to roll your own authentication scheme.

I will briefly provide an excuse.

"Simple BCrypt-based passwords" is a reasonable feature, but shouldn’t be mistaken for end-to-end authentication, or even a substantial subset of that problem. Web site authentication in the real world is a far harder problem than salting and hashing a password — which BCrypt does OK, as far as I know. You can prove this to yourself merely by observing that many frameworks which have correctly implemented salting and hashing have nonetheless had their authentication systems compromised by other means.

Having (correctly) validated a username and password, most web authentication frameworks use an encrypted authentication token stored in a cookie (or some other place) for future requests. This way the client (the browser) doesn’t need to remember the password or repeatedly prompt the user for it. However, once the token has been issued, having a copy of it is as good as having the password (for a short period of time, anyway — they typically expire). That’s how tools like Firesheep work on un-secured networks. If you produce or handle this token incorrectly then may as well not bother doing username and password authentication in the first place!

Remember last year when the ASP.NET membership provider used in large numbers of ASP.NET sites was discovered to be vulnerable to a padding oracle attack? ASP.NET, as far as anyone knows, is salting and hashing its passwords correctly. But that’s not enough to stop people from breaking into the system. The other links in the "security chain" have to be secure as well. In the case of ASP.NET, the server leaked information which was useful when attempting to break the encryption of the authentication token. This is sometimes called a side channel. Having succeeded in breaking the encryption, a client could then create a "fake" authentication token which the server would mistake for one it had issued itself. Hence, it was possible to break into a site without ever knowing the password. The authors of this exploit had formerly found similar vulnerabilities in JSF and other frameworks.

All of this adds up to old news: Security is hard. Even experts in the field make mistakes that other experts in the field overlook for years. Anyone can build a system which they themselves can’t break into. The best solution for developers of sites not specifically intended to prototype new and potentially vulnerable security systems is to use proven, off-the-shelf authentication systems, and to instead put their efforts into delivering value to their end users!

Tagged , , , ,

Why Won’t Visual Studio Step Into This Code?

I helped another developer debug an interesting problem this morning. Let’s see if you can spot the problem. The code in question looked something like this simplified version containing only enough code to show the problem:

public void Execute()
{
    DoStuff();        // breakpoint 1
}

public IEnumerable<Coordinate> DoStuff()
{
    LaunchMissiles(); // breakpoint 2
    // rest of method here
}

Note that the result of the function DoStuff is not used by Execute. That result actually exists only for testing purposes; it’s essentially a log we use to monitor changes the method makes to external state. The unit tests in question passed, so it was clear that DoStuff worked correctly, at least in a test context. The problem was that when the code ran outside of a test context (i.e., in the real application), the DoStuff method would never run. The debugger would stop at breakpoint 1, but not at breakpoint 2, but only in the "real" application. Similarly, attempting to step into DoStuff would not actually go into the method body. If we debugged the unit tests, the debugger would stop at both breakpoints, and the method worked.

Can you spot the bug?

Perhaps it would help if I showed more of the method:

public IEnumerable<Coordinate> DoStuff()
{
    LaunchMissiles(); // breakpoint 2
    yield return CurrentCoordinates();
}

Now do you see the bug? Remember, the unit tests pass. There is no special knowledge about our application needed to see the problem here; all of the information required to spot the bug is in the code snippets above. The problem is a code bug, not a setup or configuration issue.

Perhaps it would help if I showed you a version of DoStuff which "works."

public IEnumerable<Coordinate> DoStuff()
{
    LaunchMissiles(); // breakpoint 2
    return new List<Coordinate> { CurrentCoordinates() };
}

With this version, both the unit tests and the "real" application work correctly.

The Solution

At first glance, this might seem puzzling. I’ve changed only the last line, and both of those versions appear to do almost exactly the same thing. Why is the behavior of the breakpoint at the previous line different?

The answer is that using yield return causes the C# compiler to change the entire method, not just that single line. It surrounds the code with a state machine containing the rest of the method body. Importantly, the iterator returned from the "yield return" method is entirely lazy; it will not run the method body at all until you attempt to iterate the result of the method. But Execute ignores this result, so the method never runs at all.

Discussion

Some languages, like Haskell, go to great lengths to segregate expressions and side effects. C# isn’t one of them, but even so it’s common to try to improve quality by doing so. Eric Lippert, a member of the C# compiler team, once wrote:

I am philosophically opposed to providing [an IEnumerable<T>.ForEach() method], for two reasons.

The first reason is that doing so violates the functional programming principles that all the other sequence operators are based upon. Clearly the sole purpose of a call to this method is to cause side effects.

The purpose of an expression is to compute a value, not to cause a side effect. The purpose of a statement is to cause a side effect.

It is clear that causing side effects could cause an expression to change in mid-computation. This is problematic for debugging and quality, especially if some of the evaluations are lazy. But as this example demonstrates, the opposite is also true: Adding expressions to a computation can change the side effects, too.

Tagged , ,

Bad Behavior has blocked 713 access attempts in the last 7 days.

Close