Skip to content

Strange Loop Crossword

I wrote a 15*15, NYT-style crossword puzzle for Strange Loop. On the NYT difficulty scale, it’s roughly a Wednesday-level puzzle. However, it was written for Strange Loop and thus does presume familiarity with functional programming and math, and has a few "inside jokes."

You can find the puzzle and the solution on the Strange Loop wiki.

Tagged , , ,

Google’s Research on Interviewing Technical Candidates

Yesterday’s New York Times has a good article on Google’s analysis of what works and what does not work when interviewing candidates for technical jobs. This paragraph closely matches my experience:

Behavioral interviewing also works — where you’re not giving someone a hypothetical, but you’re starting with a question like, “Give me an example of a time when you solved an analytically difficult problem.” The interesting thing about the behavioral interview is that when you ask somebody to speak to their own experience, and you drill into that, you get two kinds of information. One is you get to see how they actually interacted in a real-world situation, and the valuable “meta” information you get about the candidate is a sense of what they consider to be difficult.

I have interviewed candidates this way for years, and I can’t recall ever being wrong in my assessment of a developer’s general intelligence and technical skills. However, I have made mistakes! In particular, this method doesn’t necessarily give you a good indication of how well the candidate will get along with other members of the team, and whether or not they will behave professionally. I’m not as good at assessing that as I am at assessing technical skill.

Unsurprisingly, Google also found

[...] that brainteasers are a complete waste of time. How many golf balls can you fit into an airplane? How many gas stations in Manhattan? A complete waste of time. They don’t predict anything. They serve primarily to make the interviewer feel smart.

The article doesn’t mention anything about whiteboard coding, but I also find that useful in technical interviews.

I have to disagree with the headline, however. While big data might not be able to tell you who to hire, it clearly told Google that they were doing it incorrectly!

Tagged ,

YAML and Remote Code Execution

YAML’s security risks are in no way limited to Rails or Ruby. YAML documents should be treated as executable code and firewalled accordingly. Deserializing arbitrary types is user-controlled, arbitrary code execution.

It’s Not Just Ruby

A few weeks ago, I had a need to parse Jasmine’s jasmine.yml in some C# code. I spent some time looking at existing YAML parsers for .NET and ended up deciding that spending a couple of hours writing a lightweight, purpose-specific parser for jasmine.yml made more sense for my use case than including an off-the-shelf YAML parser which invariably turned out to be quite a heavyweight project.

Having made this choice, I then spent a little bit of time reading the YAML specification. So when, a week or so later, the first of what would become a series of malicious-YAML-based attacks on Rails began to hit the news, I started paying attention. A couple weeks after that, yet another YAML-based security flaw was corrected in Rails, and then was compromised in still another malicious YAML attack separate from the Rails bugs. This one had the risk of compromising any machine which runs gem install on a regular basis, although it doesn’t seem like that actually happened at this point.

Because all of these attacks landed within the Ruby community, observers have occasionally characterized this as a crisis for Rails or Ruby. I think that’s (at least a little) misguided. The real focus of our attention should be YAML.

Update: Aaron Patterson, one of the maintainers of Ruby’s Psych parser, has an excellent discussion of the Ruby-specific aspects of this issue.

It is very easy to demonstrate that the same vulnerabilities exist in other platforms. Here’s a two-line "attack" on Clojure’s YAML parser from Justin Leitgeb. I have to use the term "attack" here loosely, because all he is doing is deserializing an arbitrary class, which the YAML spec allows. But, in most environments, deserializing an arbitrary class is tantamount to code execution. The PyYAML library for Python has nearly the same vulnerability (though there is a workaround for it). I think that YAML-based attacks have tended to target Ruby projects simply because use of YAML is quite a bit more common amongst Rubyists, and certain prominent Ruby libraries in an internal way — users of these libraries may have no idea that the JSON input they supply, e.g., might be vulnerable to an internal YAML parser.

User-Controlled, Arbitrary Object Instantiation is Remote Code Execution

The introduction to the YAML spec states,

YAML leverages these primitives, and adds a simple typing system and aliasing mechanism to form a complete language for serializing any native data structure.

That is, in practice, remote code execution. If an external actor, such as the author of a YAML document, can cause your application to instantiate an arbitrary type, then they can probably execute code on your server.

This is easier in some environments than others. If you happen to be running within a framework where one of the indispensable types calls eval on a property assignment, then it is very, very easy indeed.

On the other hand, even in environments which make a very careful distinction between data type construction and code execution, one can imagine vulnerable code. Consider the following Haskell data type:

type LaunchCodes = (Int, Int)

…and some code, elsewhere in the application:

    case input of
        LaunchCodes (targetId, presidentialPassword) -> launchMissiles( --…

Contrived, sure. But you may have more innocuously-named types which you don’t plan for random users to spin up. Haskell, indeed, makes such attacks harder, but it’s not a free pass.

It’s difficult to overstate the danger of remote code authentication. If someone can execute code on your server, they can probably own your data center.

The YAML spec is largely mute on the issue of security. The word "security" does not appear in the document at all, and malicious user input isn’t discussed, as far as I can see.

Types in the YAML Spec

The encoding of arbitrary types is discussed in the last section of the YAML spec, "Recommended Schemas." It specifies "tag resolution," which is, in practice, the mapping of YAML content to instantiated types during deserialization. This section defines four schemas which a compliant parser should understand. The first three, "Failsafe," "JSON," and "Core," define , define tag resolutions for common types like strings, numbers, lists, maps, etc., but don’t appear dangerous to me.

However, the last section of "Recommended Schemas" is a catchall called "Other Schemas." It notes,

None of the above recommended schemas preclude the use of arbitrary explicit tags. Hence YAML processors for a particular programming language typically provide some form of local tags that map directly to the language’s native data structures (e.g., !ruby/object:Set).

While such local tags are useful for ad-hoc applications, they do not suffice for stable, interoperable cross-application or cross-platform data exchange.

In practice, most YAML deserialization code will, by default, attempt to instantiate any type specified in the YAML file.

Some YAML parsers have a "safe" mode where they will only deserialize types specified in the tag resolution. For example, PyYAML has safe_load. Its documentation notes,

Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function. Check the yaml.safe_load function though.

(emphasis in original)

Notably, however, Ruby’s Psych parser has no such method at this time.

So Should We All Dump YAML?

Some Rubyists have questioned why YAML is so pervasive in the Ruby community when other formats, like JSON or Ruby itself (à la Rake), are perfectly usable in most cases. It’s a good question to ask, especially in cases where YAML parsing has not been explicitly requested.

On the other hand, it’s not hard to imagine cases where allowing arbitrary object instantiation makes sense, such as in an application configuration file. An example in the .NET space would be XAML files. If you are defining a form or a workflow, then you want to be able to allow the instantiation of custom controls. There is no standard way to do this with, say, JSON files, so using a format like YAML makes sense here. (So does using a non-domain-specific language like Ruby, but that presumes that Ruby is available and might not be suitable for cross-language work.)

For the most part, you never want to accept YAML from the outside world. The risks are too high, and the benefits of the YAML format are largely not relevant here. Note, for example, that while gem files include manifests in YAML format, the jQuery plugin repository does essentially the same thing with JSON documents.

Why Don’t YAML Parsers with a "Safe" Mode Use It By Default?

Good question.

Tagged , , , ,

Faking a placeholder Attribute for an Editable div, and Some CSS Tricks

HTML input elements have a placeholder attribute which you can use to show a bit of text to prompt the end user. Although you can make an editable div by using the contenteditable attribute, it will not support the placeholder attribute. I needed to do both, so I ended up reinventing the placeholder attribute for editable divs. Here’s how I did it.

I wanted the "placeholder" in the editable div to behave as much as possible like a "real" placeholder in an input element. So I started by making a list of requirements:

  1. The placeholder text should cosmetically resemble an input’s placeholder.
  2. All markup used should be valid.
  3. The placeholder text should not appear in the DOM.
  4. The placeholder text should disappear when the div contains any text or when it has input focus.

I decided to use a data-placeholder attribute in lieu of a placeholder attribute since this is valid for a div. In order to prevent the placeholder text from appearing in the DOM, I used the CSS :before pseudo-element to hold the text. Making the placeholder text disappear when the div has focus is easy, but making it disappear when the div contains text is harder, because the :contains pseudo-class doesn’t exist. So I had to use a bit of JavaScript and another data- attribute for that.

The markup for the div looks like this:

<div contenteditable='true' class='editable' data-placeholder='Enter some text'></div>

Here’s the CSS:

div[data-placeholder]:not(:focus):not([data-div-placeholder-content]):before {
    content: attr(data-placeholder);
    float: left;
    margin-left: 2px;
    color: #b3b3b3;

There’s a lot here, so let’s break that down:

  • div[data-placeholder] The placeholder should only be shown on divs which have a data-placeholder attribute.
  • :not(:focus) Hide the placeholder when the div is focused.
  • :not([data-div-placeholder-content]) Hide the placeholder when the div has a data-div-placeholder-content attribute. See JavaScript below. This is my workaround for the problem that :contains doesn’t exist.
  • :before This is a CSS pseudo element. The user will see the text, but it won’t be in the (standard) DOM.
  • content: attr(data-placeholder); The content of the pseudo element should be the text contained in the data-placeholder attribute.
  • The rest is just cosmetics.

In order to hide the "placeholder" text when the div contains user-supplied text, I had to resort to a bit of JavaScript/jQuery:

(function ($) {
	$(document).on('change keydown keypress input', 'div[data-placeholder]', function() {
		if (this.textContent) {
			this.dataset.divPlaceholderContent = 'true';
		else {

This code runs whenever any div containing a data-placeholder attribute on the document receives a keydown or keypress event. It then sets the data-div-placeholder-content attribute of that div if it contains any text and clears it if not. Hopefully, that much is obvious. But what’s up with the change and input events? Do those even exist?

Well, sort of. divs don’t normally fire the change event when the user types. However, I needed a way to set the attribute state correctly if I ever change the div text in JavaScript code. Since there is no obvious event to use, I just send change. So code like this:




Without explicitly triggering change, both the "foo" and the placeholder text would be shown, which is obviously undesirable.

input is a relatively new event and is not supported by all popular browsers at this time. It’s a better choice than keydown and keypress, but I include those two, as well as input, in an attempt to support as many browsers as possible.

I’ve published the source code as a plugin on GitHub. It’s arguably overkill to turn this into a plugin, but you don’t have to use it that way; you can just include the CSS and JavaScript inline with the rest of your code.

Tagged , , , ,

Or, As We Called It Back in 1999, "Tuesday"

So this tweet got a lot of attention:


alias yolo=’git commit -am "DEAL WITH IT" && git push -f origin master’

I laughed at this, not because it implies some kind of reckless disregard for process and community, but because, in 1999,  at a former employer, when our VCS was Microsoft SourceSafe, this was just the way that we went about our business. Times have changed!

Tagged , ,

Review: Coursera Social Network Analysis class

I recently completed the Coursera Social Network Analysis class. This was my first time taking a Coursera class. In this post, I will describe my experience with Coursera generally, and review the Social Network Analysis class in particular.

Along with several of my colleagues, I took Martin Odersky’s Functional Programming Principles in Scala class at the same time. Although I finished my last assignment for that class weeks ago, the class isn’t technically complete, so I will reserve comments on that class for a later date.

The Social Network Analysis class is an introductory level class. No previous experience with social network analysis is presumed. The professor is Lada Adamic, of the University of Michigan. If you take the class, you will learn what social network analysis is, the basic mathematical techniques used, a few interesting applications, and have the chance to use software tools such as Gephi, R, NetLogo, and others.

One point which is perhaps not obvious to outsiders is that “social network analysis” is actually relevant to analyzing almost any network which can be modeled as a directed or undirected graph. The class includes lectures on using the techniques presented here on networks which are not actually social networks.

There is quite a bit more math underlying real-world social network analysis then an outsider might expect. Most so-called "social media experts" would probably find it incomprehensible. However, this class really only skims the surface of the math. Students who are not “mathematically inclined” probably won’t have any problem getting through the class. Students who really like math will probably want to follow up with something more in-depth.

The class has an optional programming component. You can opt to not complete any of the programming assignments, and still pass the class. However, I think the most valuable things I learned from the class came from completing the programming assignments. It’s one thing to hear about, say, a mathematical formula for determining some property of the network, but just reading this formula on the slide sometimes doesn’t really stick in my memory. If I actually implement this analysis in R, on the other hand, I have to think about what all of the variables mean. Also, in completing the programming assignments I had the opportunity to use some tools which I had never heard of before, like NetLogo, and other tools, like R, which I had heard about, but never actually used.

If you do complete the programming assignments, then you will receive a “with distinction” on the certificate (a PDF, really) you receive on successful completion of the class. Coursera, however, is not at this time an accredited institution, so you should probably take this class to fulfill your love of learning, rather than because you think it will enhance your resume.

I think the programming assignments would be somewhat difficult for students without prior programming experience. None of them were difficult for me to complete, but somebody without a general sense of common programming syntax, and without real-world experience of setting up slightly-novice-unfriendly tools, might be frustrated by the fairly low level of support included in the course. It is assumed that you are able to install and configure, for example, the R system, pick a text editor and a REPL, install libraries from various locations, figure out differing syntax in various systems, and deal with at least one required library which doesn’t always seem to return accurate results. No professional programmer will have a problem with any of this, but I can see several roadblocks to non-programmers. This is unfortunate, because the programming itself is not difficult, and really does help with understanding the material in the class.

Like just about any class I’ve ever taken anywhere, online or offline, there are “bugs” in the assignments. Judging by forum posts, many people seem to find this surprising, but, in my experience, it happens in traditional classrooms as well. In a few cases, a multiple-choice question had more than one correct answer. In other cases, selecting the correct answer would not give you credit. The Coursera staff would eventually fix these problems.

In contrast to the Functional Programming class, where your grade was completely determined by unit tests and static analysis against your Scala code, the Social Network Analysis class was more “traditional” in that there were multiple-choice-style homeworks and a timed final exam.

Because the class is free, the measure of its value is, “Is taking this course worth my time?” For me, the answer turned out to be a definite yes. I probably spent about 5 hours a week listening to the lectures and completing the assignments. Prior to taking the class, I spent some time reading about the subject matter. I found taking the class to be a more efficient way to really understand what the field was about. For that reason, I recommend it to anyone who needs to interact with networks, social or otherwise, and is not familiar with the field.

Tagged , ,

The Homomorphic Encryption Patent Land Rush

I noticed this morning that Google patent search returns 189 results for the query “homomorphic encryption." I have written about homomorphic encryption in the past; it is a true mathematical breakthrough which has the potential to transform cloud computing security. But the emphasis, here, is on “potential.” There is no fully homomorphic encryption scheme which is efficient enough to be practical for real-world, general-purpose computation.

This, apparently, has done nothing to stop the patent land rush for any process which might conceivably utilize homomorphic encryption. Some of these are legitimate under the current rules of U.S. patents, even if you personally find those rules wrongheaded. For example, IBM has filed patent claims on Craig Gentry’s breakthrough algorithms.

On the other hand, it is not difficult to find patents which don’t so much describe an invention as speculation on what someone might invent in the future. It reminds me of the well-worn economics joke, "Assume a can opener…" Some of these contain amusing disclaimers, such as the following:

Currently there is no practical fully homomorphic cryptosystem, that is, there is no secure cryptosystem that allows for the homomorphic computation of additions and products without restrictions. There has been a recent contribution by Gentry that presents a cryptosystem based on ideal lattices with bootstrappable decryption, and it is been shown that it achieves a full homomorphism. Nevertheless, the authors of this method to concede that making this scheme practical remains an open problem.

Others invoke a form of rhetorical hand waving more commonly seen in freshman term papers than patent applications:

There exist well known solutions for secure computation of any function e.g. as described in [45]… The general method employed by most of these solutions is to construct a combinatorial circuit that computes the required function… It seems hard to apply these methods to complete continuous functions or represent Real numbers, since the methods inherently work over finite fields.

The “well-known” solutions, here, are both computationally impractical and covered by other patent applications. And “seems hard” is really not the right phrase to pick when you are describing a problem for which there is no known solution.

In 2004, Phillips filed a patent application for the very idea of homomorphic encryption. Never mind that no algorithm for fully homomorphic encryption existed at the time, and that Phillips did not invent the idea itself. Buried in paragraph 69 are the following weasel words:

An encryption scheme with these two properties is called a homomorphic encryption scheme. The Paillier system is one homomorphic encryption scheme, but more ones [sic] exist.

Note that Pascal Paillier does not appear to work for Phillips, published his encryption scheme 5 years prior to this patent application, and Paillier encryption is only homomorphic with respect to addition.

Some approved patents venture into the absurd. Remember the string concatenation example I used in my first article on homework encryption? Folks who follow my public speaking closely (hi, mom!) will remember that I sometimes joke about a startup selling cloud-based, homomorphically encrypted string concatenation ("Concatenatr"). It’s a joke, right? You can stop laughing now; SAP actually received a patent for that one.

Tagged ,

Speaking at "Moving to Better Secure the Cloud"

I’ll be speaking at a Slashdot/Geeknet "virtual trade show" today.

Moving to Better Secure the Cloud: Governance, Risk, and Compliance Management

My presentation will be on the potential business impact on the web if an efficient and fully homomorphic encryption system is invented. I’ll be speaking sometime in between 3:15 and 4:00 EST, for about 20 minutes. The target audience is CIOs.

Sorry for the short notice, but this came together at the last minute!

Tagged ,

Ad-hoc SQL/POCO Queries in Entity Framework 4.0

Since version 4.0, the Entity Framework has had the ability to query un-mapped data and project it onto POCOs using ad-hoc SQL.

Here, for example, is how we check the current SQL Server version:

        internal class SqlVersionInfo
            public string Edition { get; set; }
            public string ProductLevel { get; set; }
            public string ProductVersion { get; set; }

        private static SqlVersionInfo GetSqlServerVersion(ObjectContext context)
            const string sql =
             @"SELECT SERVERPROPERTY('productversion') as [ProductVersion],
                      SERVERPROPERTY('productlevel') as [ProductLevel],
                      SERVERPROPERTY('edition') as [Edition];";
            var reader = context.ExecuteStoreQuery<SqlVersionInfo>(sql);
            return reader.Single();

Note that no mapping step is required — the code will just run with any ObjectContext at all.

Tagged , ,

Sometimes, SELECT Really Is Broken

We got a bug report from a customer the other day; a certain query in one of our web services was giving the following error:

A column has been specified more than once in the order by list. Columns in the order by list must be unique.

Seems clear enough, except that

  1. There was no duplication in the ORDER BY. Our DBA discovered that the "problem" reference was [Project2].[C5] ASC, which is an alias, not a real DB column name. It certainly appeared only once, but removing it made the error go away. There is documentation which implies that this was an intentional change in SQL Server 2005, but:
  2. The same query worked fine on other SQL Server 2005 servers. The query work correctly on a local QA environment. But it failed on the customer’s database.
  3. We didn’t write the query; it was produced by the Entity Framework, and my experience tells me that the SQL Server provider for EF doesn’t tend to produce un-compilable SQL.

In order to isolate the problem, I ran the LINQ query in LINQPad. This allowed me to reproduce the problem outside the context of the application (useful, since the test case took a couple of minutes, whereas executing the query in LINQPad directly is almost instant), and also allowed me to switch DB servers easily.

I found a workaround — I could change the LINQ and produce SQL which would work on both servers. But I didn’t like this "fix" because I still didn’t understand the problem. I don’t want to "fix" one thing but cause another problem for some other customer. So I dug a little deeper.

Googling turned up information about the actual problem this error message is supposed to explain, but that wasn’t the case with our query. I did learn that we are not the first group to hit this problem, and that some people "fix" it by setting the database compatibility level to SQL Server 2000. But I don’t like this "fix" at all. Changing the database compatibility level would mean that I would also have to change the ProviderManifestToken in the application’s EDM ask, causing the Entity Framework to emit SQL Server 2000-compatible SQL. That would make the entire application less efficient. So I kept looking.

This Stack Overflow answer didn’t explain everything, but provided an important clue. We checked the server versions on both our local QA environment and the customer system. The customer server was running SQL Server 2005, SP 3. Our local QA environment was running SQL Server 2005, SP 4. Further experimentation showed that this was, in fact, the problem.

It’s generally true that "SELECT isn’t broken." But when you’ve ruled out every other possibility, it’s sometimes worth checking!

Tagged , ,

Bad Behavior has blocked 1314 access attempts in the last 7 days.