Pay off your technical debt by preferring API clarity to generation efficiency

I’ve built the technical aspects of my career on combining technologies from Microsoft, that are easy to sell into enterprises that require the confidence that comes from their extensive support contacts and huge market footprint, with open source technologies that steer the direction of technology ahead of the enterprise curve – eventually to be embraced by them.

Microsoft has always provided powerful tools for developers in their Visual Studio product line. They focus on providing more features than any other vendor, and also having the flexibility to allows developers to design their software with the patterns that they find make the most sense to them. Because of this, the community is full of discussion, and there are always new ways to combine their technologies together to do similar things – but with quite a bit of variance on the architecture or patterns used to get them done. It can be daunting as a new developer, or a new member of a team, to comprehend some of the architectural works of art that are created by well-intentioned astronauts.

After I learned my first handful of programming languages, I began to notice the things that were different between each of them. These differences were not logic constructs, but rather how easy or difficult it could be to express the business problem at hand. Few will argue that a well designed domain model is easier to code against from a higher level layer in your application architecture than a direct API on top of the database – where persistence bleeds into the programming interface and durability concerns color the intent of the business logic.

In recent years domain specific languages have risen in popularity and are employed to great effect in open source projects, and are just starting to get embraced in Microsoft’s technology stack. A domain specific language is simply a programming interface (or API) for which the syntax used to program in it is optimized for expressing the problem it’s meant to solve. The result is not always pretty – sometimes the problem you’re trying to solve shouldn’t be a problem at all due to bad design. That aside, here are a few examples:

  • CSS – the syntax of CSS is optimized to express the assignment of styling to markup languages.
  • Rake/PSake – the syntax of these two DSLs are optimized to allow expressing of dependencies between buildable items and for creating deployment scripts that invoke operating system processes – typically command-line applications.
  • LINQ – The syntax of Language Integrated Query from Microsoft makes it easier to express relationship traversal and filtering operations from a .NET language such as C# or VB. Ironically, I’m of the opinion that LINQ syntax is a syntactically cumbersome way to express joining relationships and filtering appropriate for returning optimized sets of persisted data (where T-SQL shines). That’s not to say T-SQL is the best syntax – but that using an OO programming language to do so feels worse to me. However, I’d still consider its design intent that of a DSL.
  • Ruby – the ruby language itself has language constructs that make it dead simple to build DSLs on top of it, leading to its popularity and success in building niche APIs.
  • YAML – “Yet another markup language” is optimized for expressing nested sets of data, their attributes, and values. It’s not much different looking from JSON at first glance, but you’ll notice the efficiency when you use it more often on a real project if you’ve yet to have that experience.

Using a DSL leads to a higher cognitive retention of the syntax, which tends to lead to increased productivity, and a reduced need for tools. IntelliSense, code generation, and wizards can all cost orders of magnitude longer to use than to simply express the intended action using a DSL’s syntax when you’ve got the most commonly expressed statements memorized because the keyword and operator set it small and optimized within the context of one problem. This is especially apparent when you have to choose a code generator or wizard from a list of many other generators that are not related to the problem you’re trying to solve.

Because of this, it will reduce your cycle time to evaluate tools, APIs, and source code creation technologies based not on how much code your chosen IDE or command-line generator spits out, but rather the clarity in comprehension, and flexibility of that code once written. I am all for code generation (“rails g” is still the biggest game changer of a productivity enhancement for architectural consistency in any software tool I’ve used), but there is still the cost to maintain that code once generated.

Here are a few things to keep in mind when considering the technical cost and efficiency of an API in helping you deliver value to customers:

  • Is the number of keywords, operators, and constructs optimized for expressing the problem at hand?
  • Are the words used, the way they relate to each other when typed, and even the way they sound when read aloud easy to comprehend by someone trying to solve the problem the API is focused on? Related to this is to consider how easy it will be for someone else to comprehend code they didn’t write or generate.
  • Is there minimal bleed-over between the API and others that are focused on solving a different problem? Is the syntax really best to express the problem, or just an attempt at doing so with an existing language? You can usually tell if this isn’t the case if you find yourself using language constructs meant to solve a different problem to make it easier to read. A good example is “Fluent” APIs in C# or VB.NET. These use lambda expressions for property assignment, where the intent of a lambda is to enable a pipeline of code to modify a variable via separate functions. You can see the mismatch here in the funky syntax, and in observing the low comprehension of someone new to the concept without explanation.
  • Are there technologies available that make the API easy to test, but have a small to (highly preferred) nonexistent impact on the syntax itself? This is a big one for me, I hate using interfaces just to allow testability, when dependency injection or convention based mocking can do much better.
  • If generation is used to create the code, is it easy to reuse the generated code once it has been modified?

You’ll notice one consideration I didn’t include – how well it integrates with existing libraries. This is because a DSL shouldn’t need to – it should be designed from the ground up to either leverage that integration underneath the covers, or leave that concern to another DSL.

When you begin to include these considerations in evaluating a particular coding technology, it becomes obvious that the clarity and focus of an API is many times more important than the number of lines of code a wizard or generator can create to help you use it.

For a powerful example of this, create an ADO.NET DataSet and look at the code generated by it. I’ve seen teams spend hours trying to find ways to backdoor the generated code or figure out why it’s behaving strangely until they find someone created a partial class to do so and placed it somewhere non-intuitive in the project. The availability of Entity Framework code first is also a nod towards the importance of comprehension and a focused syntax over generation.

Why you should use Migrations instead of Visual Studio 2010 Database Projects

If you work on an application that uses a database, chances are you have to deal with releasing new versions of your software that make changes to it. The SQL language provides comprehensive support for making these types of changes and can access even advanced features of your chosen database platform. Schema changes are made through create and alter statements typically, and data movement is performed using selects and inserts.

When releasing your software initially, deployment is straightforward as there is no existing data to deal with. As users exercise the features in your software, rows of data are added to tables, and future changes require more care to not destroy or make invalid changes to the existing data.

In the past, DBAs or developers with sufficient SQL programming knowledge have written scripts to make the changes necessary to update database assets that have existing data in them, paying special care to typical situations like adding a new NOT NULL column (you need to initialize it with data to enable the constraint) splitting one column into two, or splitting some columns of a large table out into a new detail table.

For years seasoned developers have used the following approach for making changes to the database:

  • Add a table with one row that stores the “version” of the database. This data is not really application data per se, but more like metadata that identifies the state the schema is in. This version is usually initialized to the lowest version where development starts, let’s say 1.0.0.0.
  • Create SQL scripts when you have changes that check this row. If the version of the database the script is running against is lower than the “version” of your script, make your changes.
  • When your script is done making the changes, increment the version number of the database row to its new version (1.0.0.1 for example).

The great thing about this approach is that it supports deploying changes to multiple versions of the same database. If you are “upgrading” version 1.0.0.0 database and your latest version is 1.0.0.5, any scripts that have numbers between these two versions will run in ascending order. If you are “upgrading” version 1.0.0.3 to the 1.0.0.5, scripts that apply to databases at a version prior to 1.0.0.3 will be skipped.

There are two gotchas with this approach:

  1. You need to test upgrading from any version you have in the field before deploying to that version. So if you are upgrading databases that can vary by 5 versions, you really need to test the upgrade process going from all 5 of these versions to the current version. This is more a consideration than a limitation as you always need to do this when supporting multiple upgrade paths for your software.
  2. Developers can make mistakes in their SQL script and look for the wrong version, or forget to update the version in the database to current if the operation is successful.

When Ruby on Rails was released, the ActiveRecord team along with David Heinemeier Hansson provided the then-emerging ruby community with a technology called migrations, that provides some extra help on top of this. Basically anytime you want to change the database, you would run a command at your operating system prompt that would generate a new script that’s prepended with a version number greater than the latest script in your source.

An example will help here.

  • You run the command “rails generate migration create_users” and a file 00001_create_users.rb is generated. You put code in here to both update, AND rollback changes related to the “users” table for example.
  • You run the command “rails generate migration create_roles” and a file 00002_create_roles.rb is generated. Notice the tool recognized your latest script version and created a newer version automatically.

When you want to deploy to a database, you run another command “rake db:migrate” which tells the “rake” (ruby make) build engine to run all of your database migrations against the target database. The migrations engine automatically does the work of checking the target version of the database, running only those scripts that apply, and incrementing the database version to the latest one that succeeded.

This approach solves the problem of developers having to version things manually, and really simplifies deployment to multiple versions of a database. It also allows developers to incrementally make changes needed to support changes they are working on, without stepping on the toes of other developers.

Enter Visual Studio 2010 database projects

With the release of Microsoft Visual Studio 2010, another approach was provided to developers for managing database versions. This approach was made available by Microsoft’s acquisition of DBPro.

The VS 2010 DB project approach is to have a type of project in your solution that contains scripts that can create all the artifacts in your database. There are create scripts for stored procedures, schemas, roles, tables, views etc. However, the tool is sold as not requiring developers to know as much SQL programming, but rather they are provided with a treeview panel in the Visual Studio IDE (referred to as “Schema View” in the documentation). They can interact with this tree to add tables, rename columns, and make other trivial changes via a GUI and these changes are then saved as new SQL scripts in the project.

What happens when you deploy your DB project is that an engine that is part of the build system in Visual Studio does a compare of the target database being deployed to with what a “new” database would look like based on the scripts in your project, and then generates a script to alter it to make it’s structure match the project’s source code. The engine works much like RedGate software’s “SQL compare” tool in that it is fairly intelligent about determining changes in schema and generating appropriate scripts.

At first glance, this seems like a superior solution as it gives point-and-click programmers more productivity, removes version management from the picture, and eliminates the need to manually create alter scripts. In practice however, by itself this approach will not meet the requirements of most deployment cycles.

Microsoft released an ALM rangers guide to using Visual Studio 2010 database projects that is meant to be used as primary guidance for developers, DBAs, and architects looking at how to use best practices around VS 2010 DB projects. Part of this guide talks about “Complex Data Movement”, or what I will refer to here as “changing database assets containing data” because that’s really what they are talking about.

Unfortunately Microsoft’s solution for this “complex” scenario (which is common and regular, in my experience) is to subvert the diffing engine and revert to the use of temporary tables and pre/post build scripts to trick the engine into thinking the schema doesn’t need to be changes while fixing it up afterwards. This issue is described in the ALM rangers guide, and also on Barclay Hill’s blog post here.

Jeremy Elbourn comments on the MSDN forums why this approach actually makes maintaining database changes over time even more difficult than the migration approach in a real world environment. Microsoft also recently announced the availability of database migration support in ASP.NET MVC 4 (but only if you are using Entity Framework as well). These developments leave folks responsible for determining a database change management approach confused as to where the best practices are going with respect to Microsoft’s vision.

It is of my opinion that Visual Studio 2010 database projects should be avoided in favor of a migrations engine for the following reasons:

  1. The success or failure of employing VS 2010 DB projects in real world, enterprise sized clients has yet to be demonstrated in measurable capacity and the technology is still relatively new. I’ve seen some press releases, but these are marketing announcements with no downloadable artifacts to evaluate. I also have been discussing the tradeoffs with one colleague who is using it on a single application for an enterprise client with many integrated applications.
  2. I tend to embrace tools that generate code for me or do work automatically only when they are comprehensive, well-understood, and have limited “gotchas”. Schema and data change management is a complex topic and the VS 2010 database project approach leads developers to think the solution is easy, while in practice it forces them to understand how the diffing engine works, the project structure and deployment lifecycle of a DB project build, and how to circumvent the diffing engine to change database assets containing data.
  3. The ALM guide proposes detecting existing schema state to determine when pre or post deployment scripts need to be run. “If this column exists, run this script”. This is an error prone and ignorant approach. What if version 1 has this column, version 2 does not, and version 3 adds it back in? This kind of check will fail. Ironically the workarounds for this are to come up with a custom versioning and incremental migration strategy for your pre/post build scripts anyway, which is a red flag to me that the design is flawed.
  4. VS 2010 database projects abstract developers from getting better at SQL, much like Web Forms did for HTML/CSS/JavaScript prior to ASP.NET MVC arriving on the scene. In my experience, developers are seriously lacking adequate database management skills and need to get better at all aspects of it. There are several assets not supported by VS 2010 database projects in the ALM rangers guide that need to be scripted manually anyway.
  5. The best time to write tests for changes being made to a database and reviewing their impact is when making the changes, as the structural impact is fresh in the developers’ mind. Using the diffing tool, the generated alter scripts still need to be reviewed prior to deployment, especially if you don’t have a high coverage functional and acceptance test suite to ensure no breakage was caused by the change. Chances are you have an operations person reviewing the changes before running them on production, and without comprehensive testing you are relying on them to make sure the changes are appropriate. I hope you are working closely with operations during the entire development lifecycle in this situation!
Migrations work for all technologies and are simpler to understand and maintain

If you would like to use migrations today without both adopting ASP.NET MVC 4 and Entity Framework, Thoughtworks created an open source tool “DBDeploy” (with a corresponding .NET version, DBDeploy.NET) that they use with all of their clients and handles this elegantly. The only difference between it and the rails migration approach is that rails migrations use a DSL for making the changes, while DBDeploy uses SQL.

UPDATE (6/29/2012) I now recommend using RoundhousE as it has better support for more databases, uses .NET instead of Java, and gives you dedicated directories for stored procs, functions, and other assets that can get dropped and recreated each time without having existing data come into the picture.

Re-trusting check constraints in SQL doesn’t help for NULLABLE columns

I’ve been going through a large database for a client of mine and finding foreign key and check constraints that are marked “untrusted”. This happens when a relationship between two tables has some rows with foreign key column values that don’t have a match in the related table. When this happens, Microsoft SQL Server can’t use the query optimizer as well to lookup matches between the two tables when running queries. This results in sub-optimal performance.

Unfortunately I discovered today, if the foreign key column accepts NULL, you can still run a query to re-enable the check constraint without error, but it will still be marked as “untrusted” in INFORMATION_SCHEMA and will not benefit from the query optimization available to trusted keys!

Hopefully this helps someone out there to reduce the work you need to do when determining a data optimization strategy around dealing with existing untrusted checks.

I’m glad Microsoft is embracing convention over configuration

I read Agile Web Development with Rails while visiting San Diego a couple years ago and was blown away by how well put together of a framework it was. What the book helps you realize is that if you follow certain naming conventions for your code artifacts (in this case ruby source files), it automatically wires up communication between the different architectural layers of your application.

With the recent release of ASP.NET MVC 1.0, which is Microsoft’s answer to ruby on rails, Microsoft has provided what seems to me to be a simpler approach to web applications and adapts to testability better than the oft-complicated event model of existing ASP.NET web applications.

I also downloaded Silverlight 3 Beta, Expression Blend 3 Beta, and Microsoft’s Rich Internet Application (RIA) toolkit preview. The new version of Silverlight has a ton of controls, and I love that editable forms with built in wiring up to validation are included out of the box!

When you have the RIA toolkit installed, you can create a data model in Entity Framework in your web application, create a special link to it in your Silverlight “client” project, and you can wire up similarly named domain objects to databind to your Silverlight project and the databinding hits the server using REST transparently. It’s very slick.

ADO.NET Entity Framework 2+ Top 3 “Must Have” Features

I’ve spent quite a bit of time trying to use ADO.NET Entity Framework for a large web application here and run across some shortcomings that make it unusable for this project. As such I’ve come up with my top three requests for the next release(s):

  • Multiple related designer files.Let me create multiple EF diagrams and re-use types across them from a single schema! There is a Using tag in the EF schema that lets you do this, but it can only be setup through the XML editor, and then you can only navigate one way from a relationship between the two entities that cross designers. Not a good solution!!! This would also solve the problem of only letting one person have the designer checked out at once. This is the most crippling parallel development and team scale issue of the current EF implementation for projects with large database schemas.
  • Better SQL statement optimization for SQL server. Currently LINQ creates much better performing queries than EF for the same database and joins. This is due to the storage to conceptual abstraction as it is currently implemented by EF. I’m sure this can be optimized to work better in a future release.
  • Full POCO support. I want to return classes from EF queries that have no base class or attributes that deal with mapping. There is an EF-contrib project that does this, but it’s not built-in and needs to be more robust. This will allow me to return POCOs from EF behind other layers of my architecture and not have any coupling to the EF framework.

Large schema data access in agile environment

I’m currently working on a project for a client that has a database schema comprised of 150-200 tables. I’ve hit the limit of using the LINQ or ADO.NET Entity Framework designer feasibly so I’m looking at code generation. I used Codesmith on a project with their NetTiers framework at a previous company and it was quite elegant. They ended up using NHibernate after I left to move to Austin and they got a new technical lead, but I think it would have done the job very well.

In most of the web application’s I’ve designed over the years, I have used ORMs to generate classes that sit in front of database tables and let me set and get properties backed by fields, traverse relationships defined with foreign keys, and store and update rows. I have chosen not to go the domain driven development route however where each object exposes additional non-persistence-related methods that provide behavior for a couple of reasons.

First, most web applications seem to naturally evolve some portion of the API that serves the user interface layer (business logic layer) into a web service at some point. If I create manager/gateway classes that provide CRUD and common queries in front of my domain this makes it easy to expose as a web service since I am basically re-exposing a subset of the same API. Of course for some operations this doesn’t work, where for example you want the web service to accept an “Order” object that is very rich and contains nested complex types within it. Secondly, many of the ORM tools today inherently are tightly coupled to the database and don’t give very flexible ways to add behavior to them.

I’m looking into PLINQO, a set of templates for Codesmith that do smart non-destructive code generation and allow you to change the properties of your classes from the fields backed by them, add CLR attributes to these properties, add business rules that are validated with your class, and not have these annotations overwritten when updating the code from the database when the schema changes. PLINQO has the bonus of generating manager classes that include CRUD operations and get methods for relationships, indexes, and the primary key of each table.

I’m also looking at the LINQ-to-SQL T4 templates written by Oleg Sych. T4 was added to Visual Studio as a somewhat unknown feature and you can read alot more about it on Oleg’s blog than I could tell you here. The T4 templates are great when you want to use the LINQ-to-SQL designer to model your domain, and actually spit out both the database schema and the resulting code objects.

We have many interesting requirements. The client will take over the code base at some point so we need them to accept the tool we use. We need the tool to be affordable and provide future value on projects other than this one. We also need the tool to enable us to be agile enough when the schema and business rules change. I think one of these approaches will work out great in the end.

Late May goodies on Microsoft blogs

Here’s a quick rundown of some stuff going on around the Microsoft blogsphere related to development of applications like those created by the custom development teams here at Catapult Systems:

Mapping Rails to ASP.NET 3.5

If you’ve done some rails work like I have, and are getting started with looking at ASP.NET MVC for your .NET work; or you are a .NET guy, and looking at rails for the first time, you will want to see which technologies are available to do things in each platform. I hope the following chart is helpful:

Feature Rails Technology ASP.NET 3.5 Technology
Model View Controller architecture Rails Core (Action Controllers, Views) ASP.NET MVC
Domain model from relational database Active Record ADO.NET Entity Framework
REST Active Resource ADO.NET Data Services
Markup language for MVC frameworks HAML NHAML
Build platform Rake MSBuild

IronRuby development is taking off…

Prior to my recent move from Milwaukee to Austin, I helped a friend create a web application with Ruby on Rails on the side from my day job as a .NET architect. After 6 years at the architect level on Java and .NET projects, one of the things that struck me instantly as so beneficial was rails’ standard project directories that linked models, views, and controllers together without any configuration. Documenting and tracking adherence to standards in project structure is always a necessary but many times repetitious task that can make the difference between a really maintainable enterprise class solution, and one that simply meets requirements known upon first implementation.

The ASP.NET MVC project that’s currently being developed by Microsoft is capitalizing on the success of rails in a big way. It takes advantage of .NET in some ways that ruby is not able to, and succeeds in bringing some of the rapidity of the rails platform to ASP.NET without forcing developers familiar with the Visual Studio tools to drop their favorite IDE to join the “all I need is a text editor, man!” camp.

I have to admit that ruby is a great language however, and for all the great new things in C# 3.0, the japanese fellow who created it gave us a really clean syntax and it’s obvious that the folks in Redmond agree. The IronRuby project, created by Jon Lam who recently joined Microsoft last year, is a full-blown implementation of ruby that has as its primary goal the ability to run existing ruby programs. The second goal is to provide first class interoperability with the existing .NET technology stack, without losing the spirit of the interpreted language that ruby is.

Jon and the other folks working on IronRuby have succeeded in this by leveraging Microsoft’s upcoming DLR, a parallel evolution of the CLR that provides a standard interface for creating dynamic language runtimes that target .NET.

What does all this mean? Well with the current bits you can write ruby code to create WPF apps, ASP.NET pages, Silverlight applications, and hopefully at some point ASP.NET MVC apps that can access all the goodness of .NET while at the same time having access to the thousands of open source projects in ruby such as those on rubyforge, where the IronRuby project’s course code is currently hosted.

I’ve been following IronRuby, mostly as an observer, and am seeing development start to really pick up this past month. I made some updates to their project wiki the other day to organize things in anticipation of growth and hope that more folks will soon join in to help make this a really great technology that might help you be more productive on your next .NET solution.

Follow

Get every new post delivered to your Inbox.

Join 82 other followers

%d bloggers like this: