Home

About Eric

RSS

Complete Archive




My Favorite Books

Series:

Source Control HOWTO

Marketing for Geeks

The 22 Immutable Laws of Marketing

The Business of Software

WPF 3D

Topics:

Software Development

WPF

Business

Laughs

SourceGear


Related Sites:

www.NotALegend.com

www.SourceGear.com

www.Teamprise.com

     

Obstacles to an enterprise DVCS

On 26 January 2010 I gave a presentation to Software GR.  The talk was an overview of several trends that we have seen in the version control tools market over the last 40 years.  I often like to follow a talk like this by publishing the same content here on my blog in the form of a complete article.  This time I think I'll just eliminate a lot of the tldr problem and summarize the highlights:

  • The two big trends in version control today are Integration and Decentralization.
  • Integration is driven by ALM.  It is the desire to have all tools used by a development team fully integrated together.
  • Decentralization is driven by the recent wave of DVCS tools like Git and Mercurial.  They offer compelling benefits such as performance, a different kind of scalability, and more flexible workflows.
  • The ALM trend is happening in the enterprise market.  Enterprises want everything integrated with everything else, and they want everything to support their ability to enforce process.
  • The DVCS trend is happening in the open source community.  Born of the legendary cat fight between BitKeeper and the Linux kernel developers, Git and Mercurial are maturing and gathering momentum at a remarkable rate.
  • These two trends are going to clash in a big way.  SourceGear's graphic designer drew me a nice diagram to depict this.
  • The two trends cannot stay separate.  Each one has advantages which are too important for the other community to ignore.
  • But the two trends and their respective communities are a bit like oil and water.
  • Enterprises want tools that constrain.  The open source community wants tools that empower.
  • The benefits of a DVCS would be diluted by integrating it with a bunch of other tools that are highly centralized.
  • Enterprises need a least a little centralization for things like user administration.  In their eyes, complete decentralization without accountability and auditing features is a bug.
  • Even as enterprise attitudes about open source are changing, that change is happening slowly, and the GPL (used by both Git and Mercurial) is still considered the scariest license.
  • So Git and Mercurial are not even close to being enterprise-ready.  Similarly, none of the leading enterprise ALM tools are even close to being a DVCS.
  • I believe that the main enterprise ALM providers (IBM/Rational, Microsoft, Serena and Borland) will all attempt to add DVCS features to their products.  At least two of these companies (IBM/Rational, in a talk by Jean-Michel Lemieux at the Rational Conference in 2009, and Microsoft) have already made public remarks about a desire to move in that direction.
  • And I predict that they will all fail.  It is impossible to turn any of these systems into a true DVCS without a nearly complete rewrite.  The D in DVCS is not a feature which can be added.
  • But all of them will do it anyway, by making compromises.  They will try to add "just enough" Decentralization.  Some of their customers will find the results to be sufficient.
  • Meanwhile, the true DVCS tools will continue to move forward, but their progress toward credible ALM will be slow.  Enterprise-level integration is grunge work, not the kind of coding that hackers do as a labor of love.  Nobody does this stuff without getting paid.
  • So these two trends will continue to be distinct for a while, but the pressure and tension between them will remain, and the areas of overlap are going to continue getting messier.

 

Reflecting on our "SourceSafe Must Die" Campaign

"Do I really look like a guy with a plan?
You know what I am? I'm a dog chasing cars.
I wouldn't know what to do with one if I caught it.
You know, I just ... do ... things.
-- The Joker

On the product side of marketing, planning has served me well.

But on the marcomm side, you know, I just ... do ... things.

And since the whole point of marcomm is to draw attention, I try to do things which are at least a little outrageous:

 

 

Years ago we wanted to tell people about SourceOffSite as a telecommuting solution, so we gave away boxer shorts at trade shows and ran ads advising people to "work in your skivvies".

 

 

When Microsoft did their Software Legends campaign, we spoofed it with Not A Legend.

 

 

We started talking about Vault several months before its release.  Since it was vaporware, we showed up at Tech-Ed with a fog machine in our booth.

 

 

When Vault was released, we promoted the product as a "compelling replacement for Visual SourceSafe" with a movie themed campaign.  We hired Hal Douglas to voice our trailer.  And yes, he started with "In a world...".

 

 

We created a cartoon character called The Evil Mastermind, with twelve full-page print ads forming a complete story arc, plus two full-length comic books distributed at trade shows.

 

 

When we were doing Guitar Hero in our trade show booth, we gave away actual custom guitars with The Evil Mastermind graphics.

 

We did those things because nobody had done them before.  I'll try just about anything.  I just want to see what works.  And afterward, I usually report here on my blog about how these things go.

In the second half of 2009, we did a marketing campaign for Vault.

This is your brain on SourceSafe

As I mentioned above, Vault was originally positioned to be a compelling replacement for SourceSafe.  Since our 5.0 release contains a new Handoff feature which makes it easier than ever to make the transition, we decided to focus this campaign on the product's original positioning, as a SourceSafe replacement.

The creative on this campaign was ... edgy.  In a nutshell, we compared SourceSafe to an addiction.  The ads were loosely patterned after billboards and ads aimed at convincing people not to use illegal drugs.  The basic idea was to portray SourceSafe as something which might send your life into a downward spiral toward a 12-Step program or rehab.

We knew from the beginning that some people were going to be ticked off.  We just weren't sure how many people and who.

We tried testing the ads by taking them home to our spouses for feedback.  This resulted in a few tweaks, but we didn't get any criticism that caused us to change course.  One guy's wife suggested that the ads would be a better reflection of our industry if the models in the stock photos were less attractive.  :-)

So we moved ahead.  The first thing we did was run the ads in MSDN magazine, which is published by an outside firm in close coordination with Microsoft.  Nobody complained, so we kept going.  We ran banner ads on several Microsoft-centric websites.

And then we showed up in November at the Microsoft Professional Developers Conference with this campaign as the theme for the entire booth.  We gave away hundreds of T-shirts saying "VSS Must Die".

You can see the whole thing at vssisdead.com

This campaign was darker and more negative than anything we've done before.  We knew we were pushing the envelope.

Did we go too far?

Well, we certainly didn't expect to win any Most Admired Company awards by running a marketing campaign which portrays the users of our competitor's product as drug addicts.  :-)  But the campaign was intended to be funny, in a "humor noir" sort of way.

Somewhere during the execution of this campaign, I realized that SourceSafe is very much like that dorky kid in high school that gets teased by everybody.

Why do high schoolers pick on other kids?  Because it works.  In high school, popularity is correlated with several factors, but one of them is unkindness.  The meaner you are (to the right people), the more popular you are.  Once the crowd has observed who is getting bullied by the popular kids, others join in.  Even if they don't know the kid, they start ripping on them, just to try and identify with the "in" crowd.

As grownups, most of us know that this is reprehensible.  No kid deserves to be treated this way just because they're different.

But teenagers do it anyway.  And they do it because it gets them what they want.

This particular kid is an easy target.  SourceSafe isn't just a little bit dorky.  We're talking pocket protector, greasy hair, and a sport coat with elbow patches.

When it comes to poking fun, if any company is fair game, it's Microsoft.  And if any Microsoft product is safe to pick on, it's gotta be SourceSafe.  SourceSafe is the bullying target that everyone can agree on.

Dissing SourceSafe is so common that folks do it whether they have used SourceSafe or not.  Just like in high school, people join the bashing just because they think it makes them look cool.

And SourceSafe has basically nobody defending it.  When emacs people get an attitude, the vi fans speak up.  When Visual Studio fans start trash talking, the Eclipse crowd starts showing features.  But nobody stands up for SourceSafe.  People bash it, and SourceSafe just mopes down the hall wearing flood pants and a shirt with the top button done.

So anyway, we chose to go negative on this, but we figured we were in plenty of company.

And while I'm rationalizing and making lame excuses for being a jerk, please note that SourceSafe is NOT REALLY A PERSON.

Results

In general, this campaign worked.  People at PDC loved the T-shirts.  The click-through rate on the banner ads was the highest we have ever seen.

Some people were offended, but we received far more positive feedback than negative.

However, just like the awkward kid at school, SourceSafe doesn't really deserve this.  I've admitted it before, and I'll do it again now:  SourceSafe isn't really that bad.

In its day, SourceSafe was awesome.  When it was created by One Tree in the early 90s, it was nearly revolutionary.  SourceSafe brought ease of use in version control to a whole new level.  Microsoft acquired this product because it was outstanding.

Over the years, SourceSafe hasn't always aged well.  Its architecture didn't fit with the Internet.  By relying on file sharing protocols for network access, SourceSafe ended up with data corruption problems that triggered a tidal wave of criticism.

But overall, SourceSafe has been a very successful piece of software.  Most of us would love to create something that has thousands of happy users 15 years later.

At trade shows, people come up to us and ask why they should switch from SourceSafe to Vault.  We always respond by asking them if they are happy with SourceSafe.  If they say yes, we tell them not to switch.  This conversation has happened at every show I have ever attended.  Lots of people use SourceSafe every day without problems.

And despite those negative ads, I believe SourceGear is doing more for SourceSafe users than any other company, including Microsoft.  We recently shipped SourceOffSite 5.0, a major upgrade with improved performance, new features, and a fancy new ribbon UI.

Eric, if this is an apology, it sucks.

It's not.  I don't owe SourceSafe an apology.  SourceSafe isn't a person.  It's just a bunch of code.

And it's hard to imagine the need to apologize to Microsoft as a company when so many of its employees stopped by the booth at PDC to join the bashing.  Some of them took shirts.

So I'm not really apologizing.  I'm just sharing about my experience and my reactions to it.

One exception:  SourceSafe's principal author was Brian Harry.  In working through the Microsoft acquisition of Teamprise I came to know Brian and developed a great deal of admiration for him.  He is incredibly smart, and his accomplishments are amazing, including SourceSafe, the CLR, and Team Foundation Server.  Brian, if our campaign caused you any personal offense, please accept my public apology.

So anyway, there you have it.  I went negative.  And it worked.

I'm not eager to do it again.

But I have no regrets.  You know, I just ... do ... things.

Credit and blame

I wrote most of this piece in the first person, but the truth is I deserve more blame than credit.  If you were offended or disappointed by this marketing campaign, blame me.  On the other hand, if you liked this marketing campaign, credit John Woolley and Paul Roub.  The creative work here was mostly theirs.

 

 

Comments disabled

Short Version

Sorry folks, until further notice, my blog does not support comments.

Long Version

I've been using Haloscan for comments on this blog.

Haloscan is being turned off by the company that acquired it.

That company offered a transition to a new service, but that transition requires more effort than I am willing to invest (zero).

I downloaded all the old comments in some sort of XML file, but doing anything with that file would require effort.

Investigating other ways of providing comments for this blog would also require effort.

A day may come when the laziness of this blogger fails, when I forsake my procrastination and break all bonds of inertia, but it is not this day.

 

My excuses for not blogging about the Microsoft/Teamprise deal

People keep asking me why I haven't blogged about the Microsoft acquisition of our Teamprise division.

Well, it's kind of complicated.

It all started three days before the signing of the deal when my laptop died.  And I mean it's really dead.  It won't boot, from any device.

Great timing, eh?

Fortunately, all I really needed for working on the deal was email and Microsoft Word, so I just switched over to my netbook.

I completely forgot about the MacBrick Pro until this weekend when I realized that the press coverage was going to hit Monday morning and the only installation of my blogging software was trapped in a lifeless piece of aluminum on my office floor.

So I ran out and bought a new Mac laptop, hoping to get everything going in time to write my blog entry for Monday.

And then I figured, heck, as long as I was doing a completely new setup, why not start off right with an Intel X-25M instead of the stock hard disk?

Getting everything configured wasn't too difficult, but the SSD ended up costing me a lot of time because Monday morning I had to tell the other coders on my project team that I can do a full build in 24 seconds.  All that gloating killed a couple of hours, and by the time I got back to my desk I figured I should check and see how the press coverage was going.

Whoa.  The Microsoft PR machine is amazing!  They got over 230 articles published about the acquisition.  I couldn't get that kind of press coverage without committing a felony.

Right about then I got into an argument with my daughter because I wanted her to walk four blocks from her school over to my office and she said it was too far.  I wish my Mom would call me more often to tell me how much she appreciates the fact that I was a model teenager who never caused my parents any trouble.

So anyway, with hundreds of people already writing about the deal, I needed a new angle.  I figured I had to come up with something cool or not post anything at all.  So I started drafting something, but I got stuck when I couldn't find anybody to confirm whether Kanye West jokes are still funny or not.  (Yo Eric!  I'm really happy for you and I'mma let you finish, but Groove was the greatest Microsoft acquisition of all time!)

A short time later our sales VP walked in to let me know that SourceGear's name was mentioned in the "New York Freaking Times".  Cool.

The next morning I resumed working on this blog entry, or rather, on the infrastructure to support same.  I restored the VMware image from my Time Machine disk, but I couldn't get the product serial number to work.  So I figured maybe it was one of those stupid Snow Leopard bugs that everybody is complaining about, and decided to upgrade to 10.6.2.  But that took hours, because apparently every Steve Jobs disciple on the planet was upgrading their Mac on the same day, so Apple's download servers were really slow.

While I was doing that, the aforementioned daughter asked me to drive her to the mall and I refused.  So she walked FIVE MILES to get there by herself.

Keep that in mind next time you're having trouble understanding the mind of a teenager:  FIVE MILES to the mall is a shorter walk than FOUR BLOCKS to your Dad's office.

Suddenly I realized it had been a whole day since I told any of my coworkers that I can build the whole tree in 24 SECONDS, and well, you know what happened to the rest of my morning.

So then I walked across the street to the coffee shop to pick up a copy of the local newspaper.  As usual, they did a very nice job on the press coverage for us.  And, as usual, our story was below the fold because the main story of the day was about farming.

Keep that in mind next time you're having trouble understanding the mind of Champaign:  If you want your big-time corporate acquisition to be the top story, make sure you work something about corn yields into the deal.

For those of you keeping score at home, that's 232 points for the Microsoft PR team and one point for me.  I'm sure there's some PR guy at Microsoft trying to take credit for Don Dodson's piece in the Tuesday morning edition of the Champaign-Urbana News-Gazette, but that one was MINE.  They may be able to place stories in the New York Freaking Times, but I've got connections too.

I'm not kidding -- building this project's code on some machines can take several minutes, but my new Mac can do a whole build in 24 seconds.  The X-25M is way cool.  I am now seriously considering putting a $700 SSD into my $300 netbook.

This morning I gave up and paid VMware for a new serial number, and here I am writing in my blog once again.

As I write this, the realization hits me.  I got frustrated because I couldn't move my VMware installation to my new machine.  My company just had a liquidity event.  I could have paid VMware $79 to solve the problem, but instead, I decided it would be better to thrash on that problem for three days and THEN pay the $79.  Yep, I'm in the big leagues now.

So anyway, if you haven't heard, Microsoft announced Monday morning that it has acquired our Teamprise division.  I think the deal ended up being a nice win for both Microsoft and SourceGear.

I'll be at PDC next week.  Stop by the SourceGear booth and say hi.

 

Vault 5.0 has shipped

Hooray!  Vault 5.0 has shipped!

The release notes contain an overview of what's new.

 

Vault 5.0 Beta 2

Last week's beta 2 release means that the long-awaited version 5.0 of SourceGear Vault is coming soon.  This includes the regular edition of Vault as well as the "much more better" edition which has integrated bug-tracking.  (The latter product is actually called SourceGear Fortress and carries the version number 2.0, but its heart is still Vault.)

This release has numerous improvements, but for now I want to highlight one new feature which we call "VSS Handoff".  Basically, Handoff is a simpler and faster way of importing a SourceSafe database.  Instead of converting all your old history, Vault simply wraps your VSS database and makes it part of your Vault repository.  After that, all new checkins will go into the regular Vault database.  For history operations which need to access stuff that happened before the Handoff, the VSS database is seamlessly referenced.  The transition from SourceSafe can't get more painless than this.

Bottom line:  If you are still using SourceSafe, Vault 5 will remove your last excuse.

In fact, shortly after Vault 5 is released, I plan to go on a world tour.  If you are still clinging to SourceSafe, I will visit your office.  I will taunt you mercilessly and suggest an MRI to confirm that there is nothing between your ears but bone.  And I will drench you with my new Super Soaker Max Infusion Flash Flood Water Blaster

And I will be morally justified.  You've been given many opportunities to switch to any one of several dozen competent version control tools.  And yet, it's 2009 and you're still using SourceSafe.  Surely you didn't expect this to end well?

BTW, for more details about Vault 5, check out the recent blog entries by Jeremy or Paul:-)

 

IBM Rational Software Conference

Anybody attending the Rational Software Conference in Orlando next week?

I've been making very last-minute plans to be there for some meetings, but I'll have some free time, and it's always cool to connect a face with an email address.  So if you're a reader of my blog and will be at the Rational conference next week, drop me an email.

And yes, yes I know this blog entry should really have been a tweet.  I just haven't gotten into the Twitter thing at all yet, but this very moment is the first time I've thought maybe I should.  :-)

 

Time and Space Tradeoffs in Version Control Storage

Storage is one of the most difficult challenges for a version control system.  For every file, we must store every version that has ever existed.  The logical size of a version control repository never shrinks.  It just keeps growing and growing, and every old version needs to remain available.

So, what is the best way to store every version of everything?

As we look for the right scheme, let's remember three things we consider to be important:

  • Data integrity is paramount.  In a version control tool, nothing can be considered to be more important than guarding the safety of the data.

  • Performance is critical.  Software developers have about as much patience as a German Shepherd sitting in front of a pot roast.

  • Space matters too.  We're going to be storing lots of data, much of which is being kept almost entirely for the purpose of archiving history.  We'd prefer to keep this archive as compact as possible.

In this blog entry I will report the results of some exploration I've been doing.  I am experimenting with different ways of storing the full history of one source code file.  In this case, the file comes from the source code for SourceGear Vault.  It has been regularly edited for almost seven years.  There are 508 versions of this file.

As I describe the various things I have tried, a running theme will be the classic tradeoff of space vs. speed.  In physics, we know that matter and energy are interchangeable.  In computer science, we know that time and space are interchangeable.  Usually, we can find a way to make things faster by using more space, or make things smaller by taking more time.

As I said, I'll be storing 508 versions of the same file.  It's a C# source code file.  For each attempt, I will report two things:

  • The total amount of space required to store all 508 versions.

  • The total amount of time required to retrieve (or decompress or decode) all 508 versions, one at a time.

Before we get started, a few caveats:

  • I realize that these experiments would yield different results for a different kind of file.  If you're storing source code, there might be some things here you can apply.  If you're storing JPEG images, not so much.

  • All these experiments were done on my Mac Book Pro laptop.  The CPU is a Core 2 Duo, which I consider to be decently fast.  But like most laptops, this machine has an I/O system which I consider to be quasi-crappy.  I would probably get somewhat different results if I were running on a more serious piece of hardware.

OK, how should we store these 508 versions of the file?

No compression at all

As a first attempt, let's just store them.  No compression or funky encoding.  Each of the 508 versions will be stored in full and uncompressed form.

This is the starting point, even if it is not very practical.

Size:  112,643 KB

Time:  2.5 s

#ifdef DIGRESSION

Yes, dear reader, I admit that this file is far too long.

You can do the math.  If the archive takes 112 MB and there are 508 versions, then each one is 230 KB.  That's pretty big for a source code file.

Actually, it's worse than you think.  The 230 KB figure is just the average.  The first version of the file is around 90 KB.  The latest version is over 400 KB. 

In our defense, I'd like to point out that this piece of code needs to stay compatible with .NET 1.1, so the entire class must be in a single file.  However, I'd still have to answer to the charge of "First Degree Failure to Refactor".  Fine.  I'll have my attorney contact you to plead out on a lesser charge.  I'm thinking maybe "Third Degree Contributing to the Delinquency of an Intern", or something like that.

#endif

This "full and uncompressed" format uses an awful lot of space, but it is also the fastest.  We will find ways of making this smaller, but all of those ways will be slower.

The relevant questions are:

  • How much smaller?
  • How much slower? 

Some solutions will allow us to make this a lot smaller and only a little slower.  Those are interesting.  Other possibilities will be only a little smaller but a lot slower.  Those are not so interesting.

Simple compression

OK, for our next idea, let's just compress every version with zlib.

Size:  22,516 KB

Time:  4.0 sec

The results of this idea are surprisingly impressive.  The archive is over 80% smaller, and only about 60% slower.  That's darn good, considering that I didn't have to be terribly clever.

This tradeoff is probably worth it.  In fact, it establishes a new baseline that might be tough to beat.

How do we get better than this?

Deltas

Instead of just compressing every file independently, we could store things as deltas.  Think of a delta as simply the difference between one version and the next.

Compression with zlib takes one standalone thing and makes an equivalent standalone thing which is smaller.

In contrast, a delta is a representation of the differences between two files.  Suppose that somebody takes file X and makes a few changes to it, resulting in file Y.  With a delta algorithm, we could calculate the delta between X and Y, and call it D.  Then, instead of storing Y, we can store D.

The nice thing here is that D will be approximately the size of the edits, regardless of the size of the two files.  If X was a 100 MB file and Y was the same file with an extra 50 bytes appended to the end, then D will be somewhere around 50 bytes,

A delta is a concept which might be implemented in a lot of different ways.  In my case, the delta algorithm I am using is VCDIFF, which is described in RFC 3284.  We have our own implementation of VCDIFF.  Other implementations include xdelta and open-vcdiff.

The important thing to remember about deltas for storage is that you must have the reference item.  D is a representation of Y, but only if you have X handy.  X is the reference.

OK, it should be obvious that this concept can be helpful in storing a repository, but how do we set things up?

One big delta chain

As a first attempt, let's store all 508 versions as a big chain of deltas.  Every version is stored as a delta against the version just before it.  Version 1 is the reference, and is the only version that is not stored as a delta. 

Size:  7,682 KB

Time:  Way too long to wait

Wow -- this is really small.  It's over 93% smaller than the full/uncompressed form.  It'll be hard to find a general purpose approach that is smaller than this.

But good grief this is slow.  Fetching version 508 takes an eternity, because first you have to construct a temporary version of 507.  And to construct version 507, you first have to construct a temporary version of 506.  And so on.

Key frames

Let's try something else.  The problem with the chaining case above is that retrieving version 508 requires us to go all the way back to version 1, which is incredibly inefficient.  Instead, let's insert "key frames" every 10 versions.  We borrow this idea from the video world where compressed video streams store every frame as a delta, but every 10 seconds they insert a full, uncompressed frame of video.

By using key frames with chaining deltas, we can cut the time required to fetch the average version of the file.  For example, with a key frame every 10 versions, we get most of the benefits of chaining, but in the worst case, we only need 9 delta operations to retrieve any version.

Size: 18,024 KB

Time: 41.0 sec

This is better, but still not very good.  The compression here isn't much better than zlib, and the perf is still a lot worse.  Compared to zlib, we don't want to pay a 10x speed penalty just to get 20% better compression.

All the key frames are stored as full and uncompressed files, and they're taking up a lot of space.  Maybe we should zlib those key frames?

Size: 9,092 KB

Time: 42.7 sec

Now at least the compression is starting to look interesting.  This is less than half the size of the zlib case, and 91.9% smaller than the full form, which is a level of compression that is probably worth the trouble.  But the overall perf is still quite slow.  In fact, it's even slower here than plain chaining with key frames, because we have to un-zlib the key frame.

Flowers

The big problem here is that chains of deltas are killing our performance.  Chained deltas can be used to make things very small because each delta matches up nicely with one set of user edits.  But chained deltas are slow because we need multiple operations to retrieve a given file.

Another approach would be to use each reference for more than one delta.  I call this the flower approach.  With a flower, we deltify a line of versions by picking one version (say, the first one) and using it as the reference to make all the others into deltas.

Flower deltas should be much faster, since any file can be reconstructed with just one undeltify operation.

So let's try to flower all 508 versions using version 1 as the reference for all of them.

Size:  35,851 KB

Time:  10.9 sec

As expected, the performance here is much better.

But the overall space savings is lousy.  Only version 2 was based directly on version 1.  Every version after that has less and less in common with version 1, so the delta algorithm can't draw as much stuff from the reference.

This particular approach isn't going to win.  Plain zlib is both smaller and faster.

Flowers with key frames

Maybe we should try the flower concept with key frames?

Like before, every 10 frames go together as a group.  But instead of chaining, we're going to run each group as a flower.  The first version in the group will serve as the reference for the other 9.  We can reasonably assume that the deltification of frame 10 won't be as good as frame 2, but hopefully 10 and 1 still have enough in common to be worthwhile.

Size:  18,648 KB

Time:  12.2 sec

Wow.  This looks a lot better than chaining.  The space used is about 17% smaller than zlib, but instead of being 10 times slower, it's only 3 times slower.

Of course, we can use the same trick we tried before.  Let's zlib all those key frames.

Size:  9,716 KB

Time:  13.6 sec

This seems like a potentially useful spot.  It's less than half the size of zlib.  The perf still a lot slower than zlib, but at only about 3X slower, the tradeoff is the best we've seen so far.

OK.  So we've made a lot of progress on saving space, but 3X slower than zlib still seems like a high price to pay.  Do we really want to make that trade?  Do we have to?

Some things get retrieved more often than others

Let's look at the patterns for how this data is going to be accessed.

I've been reporting the total time required to fetch all 508 versions of the file.  However, this benchmark doesn't reflect real usage very well at all.  In practice, the recent stuff gets retrieved a LOT more often than the older stuff.  Most of the time, developers are updating their working copy to whatever is latest.

As a rough guess, I'm going to say that version 508 gets retrieved twice as often as 507, which gets retrieved twice as often as 506, and so on.  A timing test based on that assumption gives us results something like this:

Full                                           1.1 sec

Zlib                                          1.7 sec

One big flower                          4.0 sec

Flower with key frames             5.1 sec

Chain with key frames               24.5 sec

Not too surprising.

In the spirit of optimizing performance for the most common operations, why not keep all the more recent versions in a faster form?  We could still use something more aggressive for the older stuff, but we can probably get a nice performance boost if we just refuse to use deltification for the most recent 10 versions of the file.

But how should we store those 10 versions?  In full format?  Or zlib?  This is an arbitrary choice with a clear tradeoff.  For now, I choose zlib.  If we wanted a little more speed at the expense of using a little more space, we could just keep those 10 versions in full form.

By choosing zlib for the most recent 10 revisions, now my "get the recent stuff" benchmark runs in 1.7 seconds no matter what scheme I use.

But we still care about performance for the case where somebody fetches an older version, even if that fetch doesn't happen as often.  That's the point of version control storage.  Every version has to be available.  And when somebody does fetch version 495, we want our version control system to still be reasonably fast.

Reversing the direction of the chains

Since the more recent versions are retrieved more often, obviously, our chains are all going the wrong direction.  If we had them go the other way, then retrieval would get slower as the versions get older instead of as the versions get newer.

But this approach doesn't lend itself well to the way version control repositories naturally grow in the wild.  In these tests, I have mostly ignored the issues of constructing each storage scheme.  I've already got all 508 versions, so I'm just fiddling around with different schemes of storing them all, comparing size and retrieval time.

In practice, those 508 versions happened one at a time, in order.  If we're going to store the versions with backward chains, then each time we commit a new version of the file, we're going to need to re-encode something that was previously stored.  This makes the commit operation slower.  It is also a questionable idea from the perspective of data integrity.  The safest way to maintain data is to not touch it after it has been written.  Once it's there, leave it alone.

One case where we might want to be a bit more liberal toward rewriting data is in a "pack" operation, such as the one Git has.  It wouldn't be terribly crazy to consider a standalone pack operation in a DVCS to be better than rewriting data for each commit, for several reasons:

  • It allows us to keep commit fast.

  • Since pack would be done offline, its implementation can be focused more on data integrity and space savings than on performance.

  • Since the pack code can be separated from the commit code, all the risky code can be kept together where it is easier to maintain.

  • Since the pack operation is separate from commit, a user that does not want to run pack does not have to.

  • A pack operation in a DVCS is happening on just one instance (clone) of the repository, not on the only copy.

Anyway, a pack operation would allow us to use storage schemes that do not work well on the fly, incrementally updating as each version comes in.

Visualizing the results

This plot makes it easier to see which schemes are better than others. 

In my experimentation, I actually did a lot more schemes.  For example, instead of key frames every 10 versions, I also tried every 5, 15 and 20.  However, all those extra data points really cluttered the graph.  So I only included the most important ones here.

  • In the lower right, we find "full".  Very fast and very large.

  • In the upper left, we find "chains".  Very slow and very small.

  • We can ignore any point which is both above AND to the right of any other point.  The "1flower" point is the one where I made one big flower, using version 1 as the reference for every other version.  This scheme ends up being useless since zlib is better in both ways that matter.

  • All the other points represent possible tradeoffs which might be interesting, depending upon our priorities

Intuitively, the schemes which are closer to the origin are better.  This graph makes it easy to see that "zlib" and "flowers" are probably the two most interesting options I have discussed here.

 

Ten Quirky Issues with Cross-Platform Version Control

A big chunk of the software industry today can mostly ignore the issues of multiple platforms, for one of the following reasons:

  1. They only support Windows.  It's got like 90% market share, so why not?

  2. They serve a web application and don't care what the end user is actually using as long as their browser works.

But version control tools involve more cross-platform concerns than most other kinds of software.  Neither of the reasons above tends to work very well.

  1. If a software team has 450 Windows users and 50 people on Mac or Unix, then a Windows-only solution just won't do.

  2. Since a primary task of a version control tool is to manage source code trees on the user's hard disk, a web application just won't do.

So, even as most coders have moved on to a world where they can remain blissfully ignorant of the problems of writing software for multiple operating systems, those of us who create version control tools are still wrestling with those problems.

And in fact, I claim that our challenges are tougher than most.  Version control users ask for the darndest things, especially in the big enterprise companies.  It's easy to believe that all you need is Windows, Mac, Linux and maybe Solaris.  Then you find out just how prevalent things like AIX and HPUX are.  Terms like "Irix" and "Win95" and "mainframe" get tossed around until you're numb and nothing surprises you anymore.  When somebody asks for a port to an arcane platform, you roll your eyes and wonder is if it uses 8-bit bytes or not.

Worse than that, version control vendors aren't just porting to oddball operating systems.  We actually have to make our software interoperate across all those environments.

And that's where things start to get quirky.

  1. On a Linux system, create a file called "README".  In the same directory, create a file called "readme".  Check them both in.  Now go to a Mac and check them both out.  Since the Mac file system is [usually] case insensitive, something bad is going to happen.  Same goes for Windows/NTFS.

  2. On a Mac, checkin a file called "PRN".  Check it out on a Windows system.  That file name is not allowed under Windows, for backward compatibility with MS-DOS..

  3. Under Linux, checkin a file with a name that ends in a dot.  Check it out under Windows.  The trailing dot is probably gone.  Now check the file back in and go back to your Unix system.  If your version control system handled this badly, you've probably got two copies of the file, one with the trailing dot, and one without.  Same goes for a trailing space.

  4. On a Linux system, checkin a file with a path that is 261 characters long.  Check it out under Windows.  This might work.  It probably won't.  It kind of depends on whether .NET is involved or not.  There's a \\?\ trick to get around the limitations of the Win32 layer, but the .NET libraries don't use it.

  5. On a Mac, checkin a file that has a resource fork and some Finder info.  Check it out on a Linux machine.  What happens?  Did stuff show up as xattrs?  Should it have?  On that same Linux machine, make a change and check it back in.  Then check it out on the Mac again.  Is the Finder info still there?

  6. On a Linux machine, checkin a file with a colon in the name.  Check it out on a Mac.  Not sure what'll happen, but it probably won't be what you want.

  7. On a Windows machine, checkin a file with a name that begins with a dash.  Now check it out under Linux and try manipulating it with command-line utilities.  Apps will think the filename is a command-line option.  What if somebody creates a file named "-rf *" and a Linux user tries to rm it?  OK, bad example.  The point remains:  Filenames which begin with a dash may cause more problems on some platforms than on others.

  8. On a Linux machine, create a source code file and check it in.  Check it out on Windows and open it with notepad.  The line-endings are Unix-standard LF, but Windows apps expect CRLF, so notepad shows the entire file as one line.  Now open the same file under Visual Studio.  The file looks fine now.  Now edit a few lines in the middle of the file, check it back in, and check it out on Linux again.  The lines you edited are messed up.

  9. On a recent Ubuntu Linux system, create a file called "Espaņol".  Do the same thing on Mac OS X 10.5.  These two files have the same name, but even though you are [probably] using the utf-8 encoding of Unicode on both systems, the bytes which represent that name do not match.  On the Linux machine, the file name [probably] will be in NFC normalized form (Espa\u00f1ol).  On the Mac, everything gets normalized to NFD (Espan\u0303ol).  When you check these files in and start working with them, bad things will happen unless your version control tool understands what's going on and deals with it appropriately.

  10. On a Unix machine, checkin a symbolic link.  Check it out on Windows.  What happens?

Like I said, things get quirky.

 

Mercurial, Subversion, and Wesley Snipes

People keep asking me why I don't talk more about Mercurial in this series of blog entries.  There's a simple answer to that question:

Mercurial isn't very interesting.

Wait, that didn't come out quite right.  Let me try again:

Git is Wesley Snipes.

Mercurial is Denzel Washington

Hmm, that probably needs further explanation.  First let me give a little background.

I am the founder of a version control software company.  I've done lots of writing about the topic here on my blog.  Currently I am in the process of revising and expanding all those writings to turn them into a book. 

As part of that effort, I have undertaken an exploration of the DVCS world.  Several weeks ago I started writing one blog entry every week, mostly focused on DVCS topics.  In chronological order, here they are:

  • The one where I gripe about Git's index
  • The one where I whine about the way Git allows developers to rearrange the DAG
  • The one where it looks like I am against DAG-based version control but I'm really not
  • The one where I fuss about DVCSes that try to act like centralized tools
  • The one where I complain that DVCSes have a lousy story when it comes to bug-tracking
  • The one where I lament that I want to like Darcs but I can't
  • The one where I speculate cluelessly about why Git is so fast

Along the way, I've been spending some time getting hands-on experience with these tools.  I've been using Bazaar for several months.  I don't like it very much.  I am currently in the process of switching to Git, but I don't expect to like it very much either.

Why am I using these tools if I don't like them?  Because I want the experience.  I don't want to write hearsay.  I want to live with these tools and see what I learn.

So why don't I write about Mercurial?  Because I'm pretty sure I would like it.

I chose Bazaar and Git for the experience.  But if I were choosing a DVCS as a regular user, I would choose Mercurial.  I've used it some, and found it to be incredibly pleasant.  It seems like the DVCS that got everything just about right.  That's great if you're a user, but for a writer, what's interesting about that?

Denzel Washington is a great actor.  Other than that, he lives a pretty normal life.  What's interesting about that?

In contrast, Wesley Snipes gives the world lots of things to write about.  Tax evasion.  Conviction for reckless driving.  Martial arts.

People admire Denzel Washington.  But Wesley Snipes is simply more interesting.

I admire Mercurial.  But Git is more interesting.  Like Snipes, Git is an odd juxtaposition of great power with some quirky flaws and failings.

One more thing:

People also ask me why I don't write more about Subversion.  That's easy too:

Subversion is Morgan Freeman.