Skip to content

Let’s Build a Compiler… In F#!

I’m building a small compiler for a toy language which emits .NET executables, implemented in F#. Demo compilers are a dime a dozen, but there are a few things which make this project distinct:

  • No lexer or parser generators are used. The entire compiler is written from the ground up, and is intended to be fairly simple code.
  • The source code is idiomatic F#, and is almost totally purely functional.

The project started as a fairly simple port of Jack Crenshaw’s code from his classic series, "Lets Build A Compiler," but deviated pretty quickly as I found I wanted to take the project in a slightly different direction.

  • In the name of simplicity, Crenshaw avoids using distinct lexing, parsing, etc. phases in his code. I completely understand his reasoning, but found that this was confusing to me personally, because the code I was writing did not resemble the code I read in compiler textbooks. Indeed, many "educational" compiler implementations swing the other way, using a "nanopass" (many distinct phases) design.
  • F# code directly ported from Pascal is ugly, and I wanted to use purely functional code whenever possible.
  • I had been attempting to port one "chapter" of Crenshaw’s code at a time, but I fairly quickly discovered that this is exactly backwards; it makes far more sense to implement a complete compiler and then break it down into chapter-sized steps. It is easier to design a road when you know what the destination will be! I do recognize the value of Crenshaw’s "learn by doing / just write the code" approach. I want to go back and "chapterize" the compiler when I’m done.
  • Crenshaw’s code emits 68000 ASM for SK*DOS as a string; mine emits a binary .NET assembly as a file.

As I develop the compiler, I have found that I use a back-and-forth cycle of adding features and beautifying code. New features tend to be pretty ugly until I understand how the code needs to work in the end, and then I go back and rewrite the uglier bits using the knowledge I have gained in the implementation.

For example, I have just added support for decoration and dereferencing of local variables, and, with that, multiple-line programs. This resulted in a considerable expansion of both the parser and the code generator. So I’m going to go back and simplify a number of things, especially error handling. I’m going to use a combination "railway oriented" design and error nodes in the parser.

The code does have moments of beauty, though. Here’s the top level of the compiler:

let compile = lex >> parse >> optimize >> codeGen >> methodBuilder

That’s not an ASCII art diagram; it’s actual, executable F# code.

Here is one of the unit tests:

[<TestMethod>]
member x.``2+-3 should equal -1`` () =
   Compiler.compile("2+-3") |> shouldEqual <| -1

This test uses the compiler to emit a .NET assembly in memory, execute its main function, and compare the result with the value in the test.

Anyway, this is an ongoing project. You can see where I’m at by following the GitHub repository.

Tagged , ,

On Learning Programming and Math at Coursera

Coursera, Udacity, MIT Open Courseware, and other such sites are useful to me because they decouple the desire to learn college-level material from the expense and regulations of earning (another) diploma. The latter isn’t compelling to me today, but the former certainly is.

I’ve now taken three Coursera courses: Functional Programming Principles in Scala, Social Network Analysis, and Coding the Matrix: Linear Algebra Through Computer Science Applications. I also tried to take Calculus: Single Variable, but found that the workload was higher than I could manage with home and work obligations. I had varying degrees of experience with these subjects before beginning the courses; Functional Programming was largely review for me, Social Network Analysis was mostly new to me, and Linear Algebra was somewhere in between.

First, I found all three courses to be an effective way for me to learn the material presented.  This is mostly because all three professors did an excellent job of selecting the exercises to complete for the course. Most of what I learned in all three courses came from completing the exercises. The quality of the lectures and the availability of written materials for offline study varied from course to course. But it’s hard to be too critical considering that the courses are free. For someone like me who cares much more about increasing my knowledge than receiving credit or some sort of official certificate, it almost seems foolish not to take at least one course at any given time.

Completing the courses gave me a chance to practice with some programming languages I don’t use in my professional work right now, like Scala, Python 3, and R.

Perhaps surprisingly, given that all three courses use Coursera’s services, the means of submitting assignments was wildly different from class to class.

For the Scala course, assignments were submitted via sbt, and graded by unit tests on the server. These tests were occasionally fallible; when the server was especially loaded, it could decide that a submission was experiencing infinite recursion (a real hazard in functional programming) when, in fact, it was simply executing the code slowly. At most other times, however, the tests were accurate and the feedback was pretty reasonable.

For the Social Network Analysis course, the programming assignments were an optional add-on. You had to attach the homework via Coursera’s site, and it would be graded somehow. It’s not clear to me what the mechanism was, but it always accepted my submissions. The final programming assignment was graded using a peer review system where three other students would rate your submission, and you would evaluate the submissions of three more students. I really appreciated the feedback I got on this; it seems that the other students put real effort into their feedback.

In the Linear Algebra course, the programming assignments were all in Python. You submit them using a custom Python script which analyzes your code using regular expressions and unit tests on the client and server. This system was the buggiest of the three; the client and server code were occasionally out of sync, and frequently rejected correct code. These issues were fixed as the course went on, but it made for some very frustrating hours.

The use of instructor-supplied unit testing varied from class to class, as well. Unit tests were always included in the Scala class assignments, although the grader would run additional tests on your code. The Linear Algebra class assignments occasionally included unit tests, but mostly did not. The grader always ran unit tests, but they were not usable by the students. This deficit was made up for by other students who would post reasonable unit tests to the class forum. The Social Network Analysis class never included any unit tests in the assignments at all.

My biggest disappointment with all of these courses, however, is non-technical. Coursera, like nearly everyone in academia, has an honor code which forbids sharing work. This stands in stark contrast to most other collaborative systems for learning programming, like Exercism, Project Euler, 4Clojure (and arguably GitHub) where you are expected to first solve a problem yourself, and then work with other developers to refine your solution into the best possible implementation. You learn how to do the work well, not just how to make tests pass.

With Coursera (with the sole exception of the Social Network Analysis final problem), you just can’t do that at all. There are forums, where you are allowed to opaquely discuss the assignment and your approach to it, but you cannot show any actual code for the assignment itself. In a real college course, you could review your work privately with a TA, so you’d at least get some feedback beyond pass/fail. There is no such option at Coursera. For me, this is the biggest barrier to real learning in Coursera, and it seems unnecessary, given that "cheating" on a zero-credit course you take for fun robs only yourself.

There are other areas in which I think Coursera could profit by deviating from the way things are done in college. It’s not clear to me why a fixed schedule is required for these classes. Why not give students the option of "auditing" a course on their own time? Since the lectures are pre-recorded and the assignments are, for the most part, graded mechanically, it seems like this should be possible. Having a more flexible schedule would allow students who cannot commit the roughly 6 through 16 hours per week that most of the classes seem to require.

These criticisms aside, however, the big picture is that all of the classes I have taken so far have been good because the professors did an excellent job of designing the courses. As long as Coursera continues to recruit such qualified teachers, I think their classes will be a good investment of my time.

Tagged , , , , , ,

Strange Loop Crossword

I wrote a 15*15, NYT-style crossword puzzle for Strange Loop. On the NYT difficulty scale, it’s roughly a Wednesday-level puzzle. However, it was written for Strange Loop and thus does presume familiarity with functional programming and math, and has a few "inside jokes."

You can find the puzzle and the solution on the Strange Loop wiki.

Tagged , , ,

Google’s Research on Interviewing Technical Candidates

Yesterday’s New York Times has a good article on Google’s analysis of what works and what does not work when interviewing candidates for technical jobs. This paragraph closely matches my experience:

Behavioral interviewing also works — where you’re not giving someone a hypothetical, but you’re starting with a question like, “Give me an example of a time when you solved an analytically difficult problem.” The interesting thing about the behavioral interview is that when you ask somebody to speak to their own experience, and you drill into that, you get two kinds of information. One is you get to see how they actually interacted in a real-world situation, and the valuable “meta” information you get about the candidate is a sense of what they consider to be difficult.

I have interviewed candidates this way for years, and I can’t recall ever being wrong in my assessment of a developer’s general intelligence and technical skills. However, I have made mistakes! In particular, this method doesn’t necessarily give you a good indication of how well the candidate will get along with other members of the team, and whether or not they will behave professionally. I’m not as good at assessing that as I am at assessing technical skill.

Unsurprisingly, Google also found

[...] that brainteasers are a complete waste of time. How many golf balls can you fit into an airplane? How many gas stations in Manhattan? A complete waste of time. They don’t predict anything. They serve primarily to make the interviewer feel smart.

The article doesn’t mention anything about whiteboard coding, but I also find that useful in technical interviews.

I have to disagree with the headline, however. While big data might not be able to tell you who to hire, it clearly told Google that they were doing it incorrectly!

Tagged ,

YAML and Remote Code Execution

YAML’s security risks are in no way limited to Rails or Ruby. YAML documents should be treated as executable code and firewalled accordingly. Deserializing arbitrary types is user-controlled, arbitrary code execution.

It’s Not Just Ruby

A few weeks ago, I had a need to parse Jasmine’s jasmine.yml in some C# code. I spent some time looking at existing YAML parsers for .NET and ended up deciding that spending a couple of hours writing a lightweight, purpose-specific parser for jasmine.yml made more sense for my use case than including an off-the-shelf YAML parser which invariably turned out to be quite a heavyweight project.

Having made this choice, I then spent a little bit of time reading the YAML specification. So when, a week or so later, the first of what would become a series of malicious-YAML-based attacks on Rails began to hit the news, I started paying attention. A couple weeks after that, yet another YAML-based security flaw was corrected in Rails, and then rubygems.org was compromised in still another malicious YAML attack separate from the Rails bugs. This one had the risk of compromising any machine which runs gem install on a regular basis, although it doesn’t seem like that actually happened at this point.

Because all of these attacks landed within the Ruby community, observers have occasionally characterized this as a crisis for Rails or Ruby. I think that’s (at least a little) misguided. The real focus of our attention should be YAML.

Update: Aaron Patterson, one of the maintainers of Ruby’s Psych parser, has an excellent discussion of the Ruby-specific aspects of this issue.

It is very easy to demonstrate that the same vulnerabilities exist in other platforms. Here’s a two-line "attack" on Clojure’s YAML parser from Justin Leitgeb. I have to use the term "attack" here loosely, because all he is doing is deserializing an arbitrary class, which the YAML spec allows. But, in most environments, deserializing an arbitrary class is tantamount to code execution. The PyYAML library for Python has nearly the same vulnerability (though there is a workaround for it). I think that YAML-based attacks have tended to target Ruby projects simply because use of YAML is quite a bit more common amongst Rubyists, and certain prominent Ruby libraries in an internal way — users of these libraries may have no idea that the JSON input they supply, e.g., might be vulnerable to an internal YAML parser.

User-Controlled, Arbitrary Object Instantiation is Remote Code Execution

The introduction to the YAML spec states,

YAML leverages these primitives, and adds a simple typing system and aliasing mechanism to form a complete language for serializing any native data structure.

That is, in practice, remote code execution. If an external actor, such as the author of a YAML document, can cause your application to instantiate an arbitrary type, then they can probably execute code on your server.

This is easier in some environments than others. If you happen to be running within a framework where one of the indispensable types calls eval on a property assignment, then it is very, very easy indeed.

On the other hand, even in environments which make a very careful distinction between data type construction and code execution, one can imagine vulnerable code. Consider the following Haskell data type:

type LaunchCodes = (Int, Int)

…and some code, elsewhere in the application:

do
    case input of
        LaunchCodes (targetId, presidentialPassword) -> launchMissiles( --…

Contrived, sure. But you may have more innocuously-named types which you don’t plan for random users to spin up. Haskell, indeed, makes such attacks harder, but it’s not a free pass.

It’s difficult to overstate the danger of remote code authentication. If someone can execute code on your server, they can probably own your data center.

The YAML spec is largely mute on the issue of security. The word "security" does not appear in the document at all, and malicious user input isn’t discussed, as far as I can see.

Types in the YAML Spec

The encoding of arbitrary types is discussed in the last section of the YAML spec, "Recommended Schemas." It specifies "tag resolution," which is, in practice, the mapping of YAML content to instantiated types during deserialization. This section defines four schemas which a compliant parser should understand. The first three, "Failsafe," "JSON," and "Core," define , define tag resolutions for common types like strings, numbers, lists, maps, etc., but don’t appear dangerous to me.

However, the last section of "Recommended Schemas" is a catchall called "Other Schemas." It notes,

None of the above recommended schemas preclude the use of arbitrary explicit tags. Hence YAML processors for a particular programming language typically provide some form of local tags that map directly to the language’s native data structures (e.g., !ruby/object:Set).

While such local tags are useful for ad-hoc applications, they do not suffice for stable, interoperable cross-application or cross-platform data exchange.

In practice, most YAML deserialization code will, by default, attempt to instantiate any type specified in the YAML file.

Some YAML parsers have a "safe" mode where they will only deserialize types specified in the tag resolution. For example, PyYAML has safe_load. Its documentation notes,

Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function. Check the yaml.safe_load function though.

(emphasis in original)

Notably, however, Ruby’s Psych parser has no such method at this time.

So Should We All Dump YAML?

Some Rubyists have questioned why YAML is so pervasive in the Ruby community when other formats, like JSON or Ruby itself (à la Rake), are perfectly usable in most cases. It’s a good question to ask, especially in cases where YAML parsing has not been explicitly requested.

On the other hand, it’s not hard to imagine cases where allowing arbitrary object instantiation makes sense, such as in an application configuration file. An example in the .NET space would be XAML files. If you are defining a form or a workflow, then you want to be able to allow the instantiation of custom controls. There is no standard way to do this with, say, JSON files, so using a format like YAML makes sense here. (So does using a non-domain-specific language like Ruby, but that presumes that Ruby is available and might not be suitable for cross-language work.)

For the most part, you never want to accept YAML from the outside world. The risks are too high, and the benefits of the YAML format are largely not relevant here. Note, for example, that while gem files include manifests in YAML format, the jQuery plugin repository does essentially the same thing with JSON documents.

Why Don’t YAML Parsers with a "Safe" Mode Use It By Default?

Good question.

Tagged , , , ,

Faking a placeholder Attribute for an Editable div, and Some CSS Tricks

HTML input elements have a placeholder attribute which you can use to show a bit of text to prompt the end user. Although you can make an editable div by using the contenteditable attribute, it will not support the placeholder attribute. I needed to do both, so I ended up reinventing the placeholder attribute for editable divs. Here’s how I did it.

I wanted the "placeholder" in the editable div to behave as much as possible like a "real" placeholder in an input element. So I started by making a list of requirements:

  1. The placeholder text should cosmetically resemble an input’s placeholder.
  2. All markup used should be valid.
  3. The placeholder text should not appear in the DOM.
  4. The placeholder text should disappear when the div contains any text or when it has input focus.

I decided to use a data-placeholder attribute in lieu of a placeholder attribute since this is valid for a div. In order to prevent the placeholder text from appearing in the DOM, I used the CSS :before pseudo-element to hold the text. Making the placeholder text disappear when the div has focus is easy, but making it disappear when the div contains text is harder, because the :contains pseudo-class doesn’t exist. So I had to use a bit of JavaScript and another data- attribute for that.

The markup for the div looks like this:

<div contenteditable='true' class='editable' data-placeholder='Enter some text'></div>

Here’s the CSS:

div[data-placeholder]:not(:focus):not([data-div-placeholder-content]):before {
    content: attr(data-placeholder);
    float: left;
    margin-left: 2px;
    color: #b3b3b3;
}

There’s a lot here, so let’s break that down:

  • div[data-placeholder] The placeholder should only be shown on divs which have a data-placeholder attribute.
  • :not(:focus) Hide the placeholder when the div is focused.
  • :not([data-div-placeholder-content]) Hide the placeholder when the div has a data-div-placeholder-content attribute. See JavaScript below. This is my workaround for the problem that :contains doesn’t exist.
  • :before This is a CSS pseudo element. The user will see the text, but it won’t be in the (standard) DOM.
  • content: attr(data-placeholder); The content of the pseudo element should be the text contained in the data-placeholder attribute.
  • The rest is just cosmetics.

In order to hide the "placeholder" text when the div contains user-supplied text, I had to resort to a bit of JavaScript/jQuery:

(function ($) {
	$(document).on('change keydown keypress input', 'div[data-placeholder]', function() {
		if (this.textContent) {
			this.dataset.divPlaceholderContent = 'true';
		}
		else {
			delete(this.dataset.divPlaceholderContent);
		}
	});
})(jQuery);

This code runs whenever any div containing a data-placeholder attribute on the document receives a keydown or keypress event. It then sets the data-div-placeholder-content attribute of that div if it contains any text and clears it if not. Hopefully, that much is obvious. But what’s up with the change and input events? Do those even exist?

Well, sort of. divs don’t normally fire the change event when the user types. However, I needed a way to set the attribute state correctly if I ever change the div text in JavaScript code. Since there is no obvious event to use, I just send change. So code like this:

$("#myDiv").text("foo");

becomes:

$("#myDiv").text("foo").trigger("change");

Without explicitly triggering change, both the "foo" and the placeholder text would be shown, which is obviously undesirable.

input is a relatively new event and is not supported by all popular browsers at this time. It’s a better choice than keydown and keypress, but I include those two, as well as input, in an attempt to support as many browsers as possible.

I’ve published the source code as a plugin on GitHub. It’s arguably overkill to turn this into a plugin, but you don’t have to use it that way; you can just include the CSS and JavaScript inline with the rest of your code.

Tagged , , , ,

Or, As We Called It Back in 1999, "Tuesday"

So this tweet got a lot of attention:

potch
@potch

alias yolo=’git commit -am "DEAL WITH IT" && git push -f origin master’

I laughed at this, not because it implies some kind of reckless disregard for process and community, but because, in 1999,  at a former employer, when our VCS was Microsoft SourceSafe, this was just the way that we went about our business. Times have changed!

Tagged , ,

Review: Coursera Social Network Analysis class

I recently completed the Coursera Social Network Analysis class. This was my first time taking a Coursera class. In this post, I will describe my experience with Coursera generally, and review the Social Network Analysis class in particular.

Along with several of my colleagues, I took Martin Odersky’s Functional Programming Principles in Scala class at the same time. Although I finished my last assignment for that class weeks ago, the class isn’t technically complete, so I will reserve comments on that class for a later date.

The Social Network Analysis class is an introductory level class. No previous experience with social network analysis is presumed. The professor is Lada Adamic, of the University of Michigan. If you take the class, you will learn what social network analysis is, the basic mathematical techniques used, a few interesting applications, and have the chance to use software tools such as Gephi, R, NetLogo, and others.

One point which is perhaps not obvious to outsiders is that “social network analysis” is actually relevant to analyzing almost any network which can be modeled as a directed or undirected graph. The class includes lectures on using the techniques presented here on networks which are not actually social networks.

There is quite a bit more math underlying real-world social network analysis then an outsider might expect. Most so-called "social media experts" would probably find it incomprehensible. However, this class really only skims the surface of the math. Students who are not “mathematically inclined” probably won’t have any problem getting through the class. Students who really like math will probably want to follow up with something more in-depth.

The class has an optional programming component. You can opt to not complete any of the programming assignments, and still pass the class. However, I think the most valuable things I learned from the class came from completing the programming assignments. It’s one thing to hear about, say, a mathematical formula for determining some property of the network, but just reading this formula on the slide sometimes doesn’t really stick in my memory. If I actually implement this analysis in R, on the other hand, I have to think about what all of the variables mean. Also, in completing the programming assignments I had the opportunity to use some tools which I had never heard of before, like NetLogo, and other tools, like R, which I had heard about, but never actually used.

If you do complete the programming assignments, then you will receive a “with distinction” on the certificate (a PDF, really) you receive on successful completion of the class. Coursera, however, is not at this time an accredited institution, so you should probably take this class to fulfill your love of learning, rather than because you think it will enhance your resume.

I think the programming assignments would be somewhat difficult for students without prior programming experience. None of them were difficult for me to complete, but somebody without a general sense of common programming syntax, and without real-world experience of setting up slightly-novice-unfriendly tools, might be frustrated by the fairly low level of support included in the course. It is assumed that you are able to install and configure, for example, the R system, pick a text editor and a REPL, install libraries from various locations, figure out differing syntax in various systems, and deal with at least one required library which doesn’t always seem to return accurate results. No professional programmer will have a problem with any of this, but I can see several roadblocks to non-programmers. This is unfortunate, because the programming itself is not difficult, and really does help with understanding the material in the class.

Like just about any class I’ve ever taken anywhere, online or offline, there are “bugs” in the assignments. Judging by forum posts, many people seem to find this surprising, but, in my experience, it happens in traditional classrooms as well. In a few cases, a multiple-choice question had more than one correct answer. In other cases, selecting the correct answer would not give you credit. The Coursera staff would eventually fix these problems.

In contrast to the Functional Programming class, where your grade was completely determined by unit tests and static analysis against your Scala code, the Social Network Analysis class was more “traditional” in that there were multiple-choice-style homeworks and a timed final exam.

Because the class is free, the measure of its value is, “Is taking this course worth my time?” For me, the answer turned out to be a definite yes. I probably spent about 5 hours a week listening to the lectures and completing the assignments. Prior to taking the class, I spent some time reading about the subject matter. I found taking the class to be a more efficient way to really understand what the field was about. For that reason, I recommend it to anyone who needs to interact with networks, social or otherwise, and is not familiar with the field.

Tagged , ,

The Homomorphic Encryption Patent Land Rush

I noticed this morning that Google patent search returns 189 results for the query “homomorphic encryption." I have written about homomorphic encryption in the past; it is a true mathematical breakthrough which has the potential to transform cloud computing security. But the emphasis, here, is on “potential.” There is no fully homomorphic encryption scheme which is efficient enough to be practical for real-world, general-purpose computation.

This, apparently, has done nothing to stop the patent land rush for any process which might conceivably utilize homomorphic encryption. Some of these are legitimate under the current rules of U.S. patents, even if you personally find those rules wrongheaded. For example, IBM has filed patent claims on Craig Gentry’s breakthrough algorithms.

On the other hand, it is not difficult to find patents which don’t so much describe an invention as speculation on what someone might invent in the future. It reminds me of the well-worn economics joke, "Assume a can opener…" Some of these contain amusing disclaimers, such as the following:

Currently there is no practical fully homomorphic cryptosystem, that is, there is no secure cryptosystem that allows for the homomorphic computation of additions and products without restrictions. There has been a recent contribution by Gentry that presents a cryptosystem based on ideal lattices with bootstrappable decryption, and it is been shown that it achieves a full homomorphism. Nevertheless, the authors of this method to concede that making this scheme practical remains an open problem.

Others invoke a form of rhetorical hand waving more commonly seen in freshman term papers than patent applications:

There exist well known solutions for secure computation of any function e.g. as described in [45]… The general method employed by most of these solutions is to construct a combinatorial circuit that computes the required function… It seems hard to apply these methods to complete continuous functions or represent Real numbers, since the methods inherently work over finite fields.

The “well-known” solutions, here, are both computationally impractical and covered by other patent applications. And “seems hard” is really not the right phrase to pick when you are describing a problem for which there is no known solution.

In 2004, Phillips filed a patent application for the very idea of homomorphic encryption. Never mind that no algorithm for fully homomorphic encryption existed at the time, and that Phillips did not invent the idea itself. Buried in paragraph 69 are the following weasel words:

An encryption scheme with these two properties is called a homomorphic encryption scheme. The Paillier system is one homomorphic encryption scheme, but more ones [sic] exist.

Note that Pascal Paillier does not appear to work for Phillips, published his encryption scheme 5 years prior to this patent application, and Paillier encryption is only homomorphic with respect to addition.

Some approved patents venture into the absurd. Remember the string concatenation example I used in my first article on homework encryption? Folks who follow my public speaking closely (hi, mom!) will remember that I sometimes joke about a startup selling cloud-based, homomorphically encrypted string concatenation ("Concatenatr"). It’s a joke, right? You can stop laughing now; SAP actually received a patent for that one.

Tagged ,

Speaking at "Moving to Better Secure the Cloud"

I’ll be speaking at a Slashdot/Geeknet "virtual trade show" today.

Moving to Better Secure the Cloud: Governance, Risk, and Compliance Management

My presentation will be on the potential business impact on the web if an efficient and fully homomorphic encryption system is invented. I’ll be speaking sometime in between 3:15 and 4:00 EST, for about 20 minutes. The target audience is CIOs.

Sorry for the short notice, but this came together at the last minute!

Tagged ,

Bad Behavior has blocked 713 access attempts in the last 7 days.

Close