| |
|
|
|
|
|
|
Friday, January 29, 2010 Obstacles to an enterprise DVCS
|
|
On 26 January 2010 I gave a presentation to Software GR. The talk was an overview of
several trends that we have seen in the version control tools market over the
last 40 years. I often like to follow a talk like this by publishing the same
content here on my blog in the form of a complete article. This time I think
I'll just eliminate a lot of the tldr problem
and summarize the highlights:
- The two big trends in version
control today are Integration and Decentralization.
- Integration is driven by ALM.
It is the desire to have all tools used by a development team fully
integrated together.
- Decentralization is driven by
the recent wave of DVCS tools like Git and Mercurial. They offer
compelling benefits such as performance, a different kind of scalability,
and more flexible workflows.
- The ALM trend is happening in
the enterprise market. Enterprises want everything integrated with
everything else, and they want everything to support their ability to
enforce process.
- The DVCS trend is happening in
the open source community. Born of the legendary cat fight between
BitKeeper and the Linux kernel developers, Git and Mercurial are maturing
and gathering momentum at a remarkable rate.
- These two trends are going to
clash in a big way. SourceGear's graphic designer drew me a nice diagram
to depict this.

- The two trends cannot stay
separate. Each one has advantages which are too important for the other community
to ignore.
- But the two trends and their
respective communities are a bit like oil and water.
- Enterprises want tools that
constrain. The open source community wants tools that empower.
- The benefits of a DVCS would
be diluted by integrating it with a bunch of other tools that are highly
centralized.
- Enterprises need a least a
little centralization for things like user administration. In their eyes,
complete decentralization without accountability and auditing features is
a bug.
- Even as enterprise attitudes
about open source are changing, that change is happening slowly, and the
GPL (used by both Git and Mercurial) is still considered the scariest
license.
- So Git and Mercurial are not even
close to being enterprise-ready. Similarly, none of the leading
enterprise ALM tools are even close to being a DVCS.
- I believe that the main enterprise
ALM providers (IBM/Rational, Microsoft, Serena and Borland) will all
attempt to add DVCS features to their products. At least two of these
companies (IBM/Rational, in a talk by Jean-Michel Lemieux at the Rational
Conference in 2009, and Microsoft)
have already made public remarks about a desire to move in that direction.
- And I predict that they will
all fail. It is impossible to turn any of these systems into a true DVCS
without a nearly complete rewrite. The D in DVCS is not a feature which
can be added.
- But all of them will do it
anyway, by making compromises. They will try to add "just enough"
Decentralization. Some of their customers will find the results to be
sufficient.
- Meanwhile, the true DVCS tools
will continue to move forward, but their progress toward credible ALM will
be slow. Enterprise-level integration is grunge work, not the kind of coding
that hackers do as a labor of love. Nobody does this stuff without
getting paid.
- So these two trends will
continue to be distinct for a while, but the pressure and tension between
them will remain, and the areas of overlap are going to continue getting
messier.
|
| |
Friday, January 15, 2010 Reflecting on our "SourceSafe Must Die" Campaign
|
|
"Do
I really look like a guy with a plan?
You know what I am? I'm a dog chasing cars.
I wouldn't know what to do with one if I caught it.
You know, I just ... do ... things.
-- The Joker

On the product side of marketing, planning has served me
well.
But on the marcomm side, you know, I just ... do ... things.
And since the whole point of marcomm is to draw attention, I
try to do things which are at least a little outrageous:
|
|
|
|
Years ago we wanted to tell people about SourceOffSite as
a telecommuting solution, so we gave away boxer shorts at trade shows and ran
ads advising people to "work in your skivvies".
|

|
|
|
|
|
When Microsoft did their Software Legends campaign, we
spoofed it with Not A Legend.
|

|
|
|
|
|
We started talking about Vault several months before its
release. Since it was vaporware, we showed up at Tech-Ed with a fog machine in our booth.
|

|
|
|
|
|
When Vault was released, we promoted the product as a
"compelling replacement for Visual SourceSafe" with a movie themed
campaign. We hired Hal Douglas to
voice our trailer. And yes, he
started with "In a world...".
|

|
|
|
|
|
We created a cartoon character called The Evil Mastermind, with twelve
full-page print ads forming a complete story arc, plus two full-length comic
books distributed at trade shows.
|

|
|
|
|
|
When we were doing Guitar Hero in our trade show booth, we
gave away actual custom
guitars with The Evil Mastermind graphics.
|

|
We did those things because nobody had done them before. I'll
try just about anything. I just want to see what works. And afterward, I
usually report here on my blog about how these things go.
In the second half of 2009, we did a marketing campaign for
Vault.
This is your brain on SourceSafe
As I mentioned above, Vault was originally positioned to be
a compelling replacement for SourceSafe. Since our 5.0 release contains a new Handoff feature
which makes it easier than ever to make the transition, we decided to focus
this campaign on the product's original positioning, as a SourceSafe
replacement.
The creative on this campaign was ... edgy. In a nutshell, we
compared SourceSafe to an addiction. The ads were loosely patterned after billboards
and ads aimed at convincing people not to use illegal drugs. The basic idea
was to portray SourceSafe as something which might send your life into a
downward spiral toward a 12-Step program or rehab.
We knew from the beginning that some people were going to be
ticked off. We just weren't sure how many people and who.
We tried testing the ads by taking them home to our spouses
for feedback. This resulted in a few tweaks, but we didn't get any criticism
that caused us to change course. One guy's wife suggested that the ads would
be a better reflection of our industry if the models in the stock photos were
less attractive. :-)
So we moved ahead. The first thing we did was run the ads
in MSDN magazine, which is published by an outside firm in close coordination
with Microsoft. Nobody complained, so we kept going. We ran banner ads on
several Microsoft-centric websites.
And then we showed up in November at the Microsoft
Professional Developers Conference with this campaign as the theme for the
entire booth. We gave away hundreds of T-shirts saying "VSS Must Die".
You can see the whole thing at vssisdead.com
This campaign was darker and more negative than anything
we've done before. We knew we were pushing the envelope.
Did we go too far?
Well, we certainly didn't expect to win any Most Admired
Company awards by running a marketing campaign which portrays the users of our
competitor's product as drug addicts. :-) But the campaign was intended to be
funny, in a "humor noir" sort of way.
Somewhere during the execution of this campaign, I realized
that SourceSafe is very much like that dorky kid in high school that gets
teased by everybody.
Why do high schoolers pick on other kids? Because it
works. In high school, popularity is correlated with several factors, but one
of them is unkindness. The meaner you are (to the right people), the more
popular you are. Once the crowd has observed who is getting bullied by the
popular kids, others join in. Even if they don't know the kid, they start ripping
on them, just to try and identify with the "in" crowd.
As grownups, most of us know that this is reprehensible. No
kid deserves to be treated this way just because they're different.
But teenagers do it anyway. And they do it because it gets
them what they want.
This particular kid is an easy target. SourceSafe isn't
just a little bit dorky. We're talking pocket protector, greasy hair, and a
sport coat with elbow patches.
When it comes to poking fun, if any company is fair game,
it's Microsoft. And if any Microsoft product is safe to pick on, it's gotta be
SourceSafe. SourceSafe is the bullying target that everyone can agree on.

Dissing SourceSafe is so common that folks do it whether
they have used SourceSafe or not. Just like in high school, people join the
bashing just because they think it makes them look cool.
And SourceSafe has basically nobody defending it. When
emacs people get an attitude, the vi fans speak up. When Visual Studio fans
start trash talking, the Eclipse crowd starts showing features. But nobody
stands up for SourceSafe. People bash it, and SourceSafe just mopes down the
hall wearing flood pants and a shirt with the top button done.
So anyway, we chose to go negative on this, but we figured we
were in plenty of company.
And while I'm rationalizing and making lame excuses for
being a jerk, please note that SourceSafe is NOT REALLY A PERSON.
Results
In general, this campaign worked. People at PDC loved the
T-shirts. The click-through rate on the banner ads was the highest we have
ever seen.
Some people were offended, but we received far more positive
feedback than negative.
However, just like the awkward kid at school, SourceSafe
doesn't really deserve this. I've admitted it before, and I'll do it again now:
SourceSafe isn't really that bad.
In its day, SourceSafe was awesome. When it was created by
One Tree in the early 90s, it was nearly revolutionary. SourceSafe brought
ease of use in version control to a whole new level. Microsoft acquired this
product because it was outstanding.
Over the years, SourceSafe hasn't always aged well. Its
architecture didn't fit with the Internet. By relying on file sharing
protocols for network access, SourceSafe ended up with data corruption problems
that triggered a tidal wave of criticism.
But overall, SourceSafe has been a very successful piece of
software. Most of us would love to create something that has thousands of
happy users 15 years later.
At trade shows, people come up to us and ask why they should
switch from SourceSafe to Vault. We always respond by asking them if they are
happy with SourceSafe. If they say yes, we tell them not to switch. This conversation
has happened at every show I have ever attended. Lots of people use SourceSafe
every day without problems.
And despite those negative ads, I believe SourceGear is
doing more for SourceSafe users than any other company, including Microsoft.
We recently shipped SourceOffSite 5.0,
a major upgrade with improved performance, new features, and a fancy new ribbon
UI.
Eric, if this is an apology, it sucks.
It's not. I don't owe SourceSafe an apology. SourceSafe
isn't a person. It's just a bunch of code.
And it's hard to imagine the need to apologize to Microsoft
as a company when so many of its employees stopped by the booth at PDC to join
the bashing. Some of them took shirts.
So I'm not really apologizing. I'm just sharing about my
experience and my reactions to it.
One exception: SourceSafe's principal author was Brian
Harry. In working through the Microsoft acquisition of Teamprise I came to
know Brian and developed a great deal of admiration for him. He is incredibly
smart, and his accomplishments are amazing, including SourceSafe, the CLR, and
Team Foundation Server. Brian, if our campaign caused you any personal
offense, please accept my public apology.
So anyway, there you have it. I went negative. And it
worked.
I'm not eager to do it again.
But I have no regrets. You know, I just ... do ... things.
Credit and blame
I wrote most of this piece in the first person, but the
truth is I deserve more blame than credit. If you were offended or
disappointed by this marketing campaign, blame me. On the other hand, if you
liked this marketing campaign, credit John Woolley and Paul Roub. The creative
work here was mostly theirs.
|
| |
Wednesday, December 30, 2009 Comments disabled
|
Short Version
Sorry folks, until further notice, my blog does not support
comments.
Long Version
I've been using Haloscan for comments on this blog.
Haloscan is being turned off by the company that acquired
it.
That company offered a transition to a new service, but that
transition requires more effort than I am willing to invest (zero).
I downloaded all the old comments in some sort of XML file,
but doing anything with that file would require effort.
Investigating other ways of providing comments for this blog
would also require effort.
A day may come when the laziness of this blogger fails, when
I forsake my procrastination and break all bonds of inertia, but it is not this
day.
|
| |
Friday, November 13, 2009 My excuses for not blogging about the Microsoft/Teamprise deal
|
|
People keep asking me why I haven't blogged about the
Microsoft acquisition of our Teamprise division.
Well, it's kind of complicated.
It all started three days before the signing of the deal
when my laptop died. And I mean it's really dead. It won't boot, from any
device.
Great timing, eh?
Fortunately, all I really needed for working on the deal was
email and Microsoft Word, so I just switched over to my netbook.
I completely forgot about the MacBrick Pro until this
weekend when I realized that the press coverage was going to hit Monday morning
and the only installation of my blogging software was trapped in a lifeless
piece of aluminum on my office floor.
So I ran out and bought a new Mac laptop, hoping to get
everything going in time to write my blog entry for Monday.
And then I figured, heck, as long as I was doing a
completely new setup, why not start off right with an Intel X-25M instead of
the stock hard disk?
Getting everything configured wasn't too difficult, but the SSD
ended up costing me a lot of time because Monday morning I had to tell the
other coders on my project team that I can do a full build in 24 seconds. All
that gloating killed a couple of hours, and by the time I got back to my desk I
figured I should check and see how the press coverage was going.
Whoa. The Microsoft PR machine is amazing! They got over
230 articles published about the acquisition. I couldn't get that kind of
press coverage without committing a felony.
Right about then I got into an argument with my daughter
because I wanted her to walk four blocks from her school over to my office and
she said it was too far. I wish my Mom would call me more often to tell me how
much she appreciates the fact that I was a model teenager who never caused my
parents any trouble.
So anyway, with hundreds of people already writing about the
deal, I needed a new angle. I figured I had to come up with something cool or
not post anything at all. So I started drafting something, but I got stuck
when I couldn't find anybody to confirm whether Kanye West jokes are still
funny or not. (Yo Eric! I'm really happy for you and I'mma let you finish,
but Groove was the greatest Microsoft acquisition of all time!)
A short time later our sales VP walked in to let me know
that SourceGear's name was mentioned in the "New York Freaking Times". Cool.
The next morning I resumed working on this blog entry, or
rather, on the infrastructure to support same. I restored the VMware image
from my Time Machine disk, but I couldn't get the product serial number to
work. So I figured maybe it was one of those stupid Snow Leopard bugs that
everybody is complaining about, and decided to upgrade to 10.6.2. But that
took hours, because apparently every Steve Jobs disciple on the planet was
upgrading their Mac on the same day, so Apple's download servers were really slow.
While I was doing that, the aforementioned daughter asked me
to drive her to the mall and I refused. So she walked FIVE MILES to get there
by herself.
Keep that in mind next time you're having trouble
understanding the mind of a teenager: FIVE MILES to the mall is a shorter walk
than FOUR BLOCKS to your Dad's office.
Suddenly I realized it had been a whole day since I told any
of my coworkers that I can build the whole tree in 24 SECONDS, and well, you
know what happened to the rest of my morning.
So then I walked across the street to the coffee shop to
pick up a copy of the local newspaper. As usual, they did a very nice job on
the press coverage for us. And, as usual, our story was below the fold because
the main story of the day was about farming.
Keep that in mind next time you're having trouble
understanding the mind of Champaign: If you want your big-time corporate
acquisition to be the top story, make sure you work something about corn yields
into the deal.
For those of you keeping score at home, that's 232 points
for the Microsoft PR team and one point for me. I'm sure there's some PR guy
at Microsoft trying to take credit for Don Dodson's piece in the Tuesday
morning edition of the Champaign-Urbana News-Gazette, but that one was MINE. They
may be able to place stories in the New York Freaking Times, but I've got
connections too.
I'm not kidding -- building this project's code on some
machines can take several minutes, but my new Mac can do a whole build in 24
seconds. The X-25M is way cool. I am now seriously considering putting a $700
SSD into my $300 netbook.
This morning I gave up and paid VMware for a new serial
number, and here I am writing in my blog once again.
As I write this, the realization hits me. I got frustrated
because I couldn't move my VMware installation to my new machine. My company just
had a liquidity event. I could have paid VMware $79 to solve the problem, but
instead, I decided it would be better to thrash on that problem for three days
and THEN pay the $79. Yep, I'm in the big leagues now.
So anyway, if you haven't heard, Microsoft announced Monday
morning that it has acquired our Teamprise division. I think the deal ended up
being a nice win for both Microsoft and SourceGear.
I'll be at PDC next week. Stop by the SourceGear booth and
say hi.
|
| |
Thursday, July 30, 2009 Vault 5.0 has shipped
|
|
Hooray! Vault 5.0 has shipped!
The release notes
contain an overview of what's new.
|
| |
Monday, July 06, 2009 Vault 5.0 Beta 2
|
|
Last week's beta 2 release means that the long-awaited version
5.0 of SourceGear Vault is coming soon. This includes the regular edition of
Vault as well as the "much more better" edition which has integrated
bug-tracking. (The latter product is actually called SourceGear Fortress and
carries the version number 2.0, but its heart is still Vault.)
This release has numerous improvements, but for now I want
to highlight one new feature which we call "VSS Handoff". Basically, Handoff
is a simpler and faster way of importing a SourceSafe database. Instead of
converting all your old history, Vault simply wraps your VSS database and makes
it part of your Vault repository. After that, all new checkins will go into
the regular Vault database. For history operations which need to access stuff
that happened before the Handoff, the VSS database is seamlessly referenced.
The transition from SourceSafe can't get more painless than this.
Bottom line: If you are still using SourceSafe, Vault 5
will remove your last excuse.
In fact, shortly after Vault 5 is released, I plan to go on
a world tour. If you are still clinging to SourceSafe, I will visit your
office. I will taunt you mercilessly and suggest an MRI to confirm that there
is nothing between your ears but bone. And I will drench you with my new Super
Soaker Max Infusion Flash Flood Water Blaster.
And I will be morally justified. You've been given many
opportunities to switch to any one of several dozen competent version control
tools. And yet, it's 2009 and you're still using SourceSafe. Surely you
didn't expect this to end well?
BTW, for more details about Vault 5, check out the recent blog
entries by Jeremy
or Paul.
:-)
|
| |
Thursday, May 28, 2009 IBM Rational Software Conference
|
|
Anybody attending the Rational Software
Conference in Orlando next week?
I've been making very last-minute plans to be there for some
meetings, but I'll have some free time, and it's always cool to connect a face
with an email address. So if you're a reader of my blog and will be at the
Rational conference next week, drop me an email.
And yes, yes I know this blog entry should really have been
a tweet. I just haven't gotten into the Twitter thing at all yet, but this
very moment is the first time I've thought maybe I should. :-)
|
| |
Tuesday, April 28, 2009 Time and Space Tradeoffs in Version Control Storage
|
|
Storage is one of the most difficult challenges for a
version control system. For every file, we must store every version that has
ever existed. The logical size of a version control repository never shrinks.
It just keeps growing and growing, and every old version needs to remain
available.
So, what is the best way to store every version of
everything?
As we look for the right scheme, let's remember three things
we consider to be important:
- Data integrity is paramount. In a version control tool,
nothing can be considered to be more important than guarding the safety of
the data.
- Performance is critical. Software developers have about
as much patience as a German Shepherd sitting in front of a pot roast.
- Space matters too. We're going to be storing lots of
data, much of which is being kept almost entirely for the purpose of
archiving history. We'd prefer to keep this archive as compact as
possible.
In this blog entry I will report the results of some
exploration I've been doing. I am experimenting with different ways of storing
the full history of one source code file. In this case, the file comes from
the source code for SourceGear Vault. It has been regularly edited for almost
seven years. There are 508 versions of this file.
As I describe the various things I have tried, a running
theme will be the classic
tradeoff of space vs. speed. In physics, we know that matter and energy
are interchangeable. In computer science, we know that time and space are
interchangeable. Usually, we can find a way to make things faster by using
more space, or make things smaller by taking more time.
As I said, I'll be storing 508 versions of the same file.
It's a C# source code file. For each attempt, I will report two things:
- The total amount of space required to store all 508
versions.
- The total amount of time required to retrieve (or
decompress or decode) all 508 versions, one at a time.
Before we get started, a few caveats:
- I realize that these experiments would yield different
results for a different kind of file. If you're storing source code,
there might be some things here you can apply. If you're storing JPEG
images, not so much.
- All these experiments were done on my Mac Book Pro laptop.
The CPU is a Core 2 Duo, which I consider to be decently fast. But like
most laptops, this machine has an I/O system which I consider to be
quasi-crappy. I would probably get somewhat different results if I were
running on a more serious piece of hardware.
OK, how should we store these 508 versions of the file?
No compression at all
As a first attempt, let's just store them. No compression
or funky encoding. Each of the 508 versions will be stored in full and
uncompressed form.
This is the starting point, even if it is not very
practical.
Size: 112,643 KB
Time: 2.5 s
#ifdef
DIGRESSION
Yes, dear reader, I admit that this file is far too long.
You can do the math. If the archive takes 112 MB and there
are 508 versions, then each one is 230 KB. That's pretty big for a source code
file.
Actually, it's worse than you think. The 230 KB figure is
just the average. The first version of the file is around 90 KB. The latest
version is over 400 KB.
In our defense, I'd like to point out that this piece of code
needs to stay compatible with .NET 1.1, so the entire class must be in a single
file. However, I'd still have to answer to the charge of "First Degree Failure
to Refactor". Fine. I'll have my attorney contact you to plead out on a
lesser charge. I'm thinking maybe "Third Degree Contributing to the
Delinquency of an Intern", or something like that.
#endif
This "full and uncompressed" format uses an awful lot of
space, but it is also the fastest. We will find ways of making this smaller, but
all of those ways will be slower.
The relevant questions are:
- How much smaller?
- How much slower?
Some solutions will allow us to make this a lot smaller and
only a little slower. Those are interesting. Other possibilities will be only
a little smaller but a lot slower. Those are not so interesting.
Simple compression
OK, for our next idea, let's just compress every version
with zlib.
Size: 22,516 KB
Time: 4.0 sec
The results of this idea are surprisingly impressive. The
archive is over 80% smaller, and only about 60% slower. That's darn good,
considering that I didn't have to be terribly clever.
This tradeoff is probably worth it. In fact, it establishes
a new baseline that might be tough to beat.
How do we get better than this?
Deltas
Instead of just compressing every file independently, we
could store things as deltas. Think of a delta as simply the difference
between one version and the next.
Compression with zlib takes one standalone thing and makes
an equivalent standalone thing which is smaller.
In contrast, a delta is a representation of the differences
between two files. Suppose that somebody takes file X and makes a few changes
to it, resulting in file Y. With a delta algorithm, we could calculate the
delta between X and Y, and call it D. Then, instead of storing Y, we can store
D.
The nice thing here is that D will be approximately the size
of the edits, regardless of the size of the two files. If X was a 100 MB file
and Y was the same file with an extra 50 bytes appended to the end, then D will
be somewhere around 50 bytes,
A delta is a concept which might be implemented in a lot of
different ways. In my case, the delta algorithm I am using is VCDIFF, which is
described in RFC 3284. We
have our own implementation of VCDIFF. Other implementations include xdelta and open-vcdiff.
The important thing to remember about deltas for storage is
that you must have the reference item. D is a representation of Y, but only if
you have X handy. X is the reference.
OK, it should be obvious that this concept can be helpful in
storing a repository, but how do we set things up?
One big delta chain
As a first attempt, let's store all 508 versions as a big
chain of deltas. Every version is stored as a delta against the version just
before it. Version 1 is the reference, and is the only version that is not
stored as a delta.
Size: 7,682 KB
Time: Way too long to wait
Wow -- this is really small. It's over 93% smaller than the
full/uncompressed form. It'll be hard to find a general purpose approach that
is smaller than this.

But good grief this is slow. Fetching version 508 takes an
eternity, because first you have to construct a temporary version of 507. And to
construct version 507, you first have to construct a temporary version of 506.
And so on.
Key frames
Let's try something else. The problem with the chaining
case above is that retrieving version 508 requires us to go all the way back to
version 1, which is incredibly inefficient. Instead, let's insert "key frames"
every 10 versions. We borrow this idea from the video world where compressed
video streams store every frame as a delta, but every 10 seconds they insert a
full, uncompressed frame of video.
By using key frames with chaining deltas, we can cut the
time required to fetch the average version of the file. For example, with a
key frame every 10 versions, we get most of the benefits of chaining, but in
the worst case, we only need 9 delta operations to retrieve any version.
Size: 18,024 KB
Time: 41.0 sec
This is better, but still not very good. The compression here
isn't much better than zlib, and the perf is still a lot worse. Compared to
zlib, we don't want to pay a 10x speed penalty just to get 20% better
compression.
All the key frames are stored as full and uncompressed
files, and they're taking up a lot of space. Maybe we should zlib those key frames?
Size: 9,092 KB
Time: 42.7 sec
Now at least the compression is starting to look
interesting. This is less than half the size of the zlib case, and 91.9%
smaller than the full form, which is a level of compression that is probably
worth the trouble. But the overall perf is still quite slow. In fact, it's
even slower here than plain chaining with key frames, because we have to
un-zlib the key frame.
Flowers
The big problem here is that chains of deltas are killing
our performance. Chained deltas can be used to make things very small because
each delta matches up nicely with one set of user edits. But chained deltas
are slow because we need multiple operations to retrieve a given file.
Another approach would be to use each reference for more
than one delta. I call this the flower approach. With a flower, we deltify a
line of versions by picking one version (say, the first one) and using it as
the reference to make all the others into deltas.
Flower deltas should be much faster, since any file can be reconstructed
with just one undeltify operation.

So let's try to flower all 508 versions using version 1 as
the reference for all of them.
Size: 35,851 KB
Time: 10.9 sec
As expected, the performance here is much better.
But the overall space savings is lousy. Only version 2 was
based directly on version 1. Every version after that has less and less in
common with version 1, so the delta algorithm can't draw as much stuff from the
reference.
This particular approach isn't going to win. Plain zlib is both
smaller and faster.
Flowers with key frames
Maybe we should try the flower concept with key frames?
Like before, every 10 frames go together as a group. But instead
of chaining, we're going to run each group as a flower. The first version in
the group will serve as the reference for the other 9. We can reasonably
assume that the deltification of frame 10 won't be as good as frame 2, but
hopefully 10 and 1 still have enough in common to be worthwhile.
Size: 18,648 KB
Time: 12.2 sec
Wow. This looks a lot better than chaining. The space used
is about 17% smaller than zlib, but instead of being 10 times slower, it's only
3 times slower.
Of course, we can use the same trick we tried before. Let's
zlib all those key frames.
Size: 9,716 KB
Time: 13.6 sec
This seems like a potentially useful spot. It's less than
half the size of zlib. The perf still a lot slower than zlib, but at only
about 3X slower, the tradeoff is the best we've seen so far.
OK. So we've made a lot of progress on saving space, but 3X
slower than zlib still seems like a high price to pay. Do we really want to
make that trade? Do we have to?
Some things get retrieved more often than others
Let's look at the patterns for how this data is going to be
accessed.
I've been reporting the total time required to fetch all 508
versions of the file. However, this benchmark doesn't reflect real usage very
well at all. In practice, the recent stuff gets retrieved a LOT more often
than the older stuff. Most of the time, developers are updating their working
copy to whatever is latest.
As a rough guess, I'm going to say that version 508 gets
retrieved twice as often as 507, which gets retrieved twice as often as 506,
and so on. A timing test based on that assumption gives us results something
like this:
Full 1.1
sec
Zlib 1.7
sec
One big flower 4.0
sec
Flower with key frames 5.1
sec
Chain with key frames 24.5
sec
Not too surprising.
In the spirit of optimizing performance for the most common
operations, why not keep all the more recent versions in a faster form? We
could still use something more aggressive for the older stuff, but we can
probably get a nice performance boost if we just refuse to use deltification
for the most recent 10 versions of the file.
But how should we store those 10 versions? In full format?
Or zlib? This is an arbitrary choice with a clear tradeoff. For now, I choose
zlib. If we wanted a little more speed at the expense of using a little more
space, we could just keep those 10 versions in full form.
By choosing zlib for the most recent 10 revisions, now my "get
the recent stuff" benchmark runs in 1.7 seconds no matter what scheme I use.
But we still care about performance for the case where
somebody fetches an older version, even if that fetch doesn't happen as often.
That's the point of version control storage. Every version has to be
available. And when somebody does fetch version 495, we want our version
control system to still be reasonably fast.
Reversing the direction of the chains
Since the more recent versions are retrieved more often,
obviously, our chains are all going the wrong direction. If we had them go the
other way, then retrieval would get slower as the versions get older instead of
as the versions get newer.
But this approach doesn't lend itself well to the way
version control repositories naturally grow in the wild. In these tests, I
have mostly ignored the issues of constructing each storage scheme. I've
already got all 508 versions, so I'm just fiddling around with different
schemes of storing them all, comparing size and retrieval time.
In practice, those 508 versions happened one at a time, in
order. If we're going to store the versions with backward chains, then each
time we commit a new version of the file, we're going to need to re-encode something
that was previously stored. This makes the commit operation slower. It is
also a questionable idea from the perspective of data integrity. The safest
way to maintain data is to not touch it after it has been written. Once it's
there, leave it alone.
One case where we might want to be a bit more liberal
toward rewriting data is in a "pack" operation, such as the one Git has. It
wouldn't be terribly crazy to consider a standalone pack operation in a DVCS to
be better than rewriting data for each commit, for several reasons:
- It allows us to keep commit fast.
- Since pack would be done offline, its implementation can
be focused more on data integrity and space savings than on performance.
- Since the pack code can be separated from the commit code,
all the risky code can be kept together where it is easier to maintain.
- Since the pack operation is separate from commit, a user
that does not want to run pack does not have to.
- A pack operation in a DVCS is happening on just one
instance (clone) of the repository, not on the only copy.
Anyway, a pack operation would allow us to use storage
schemes that do not work well on the fly, incrementally updating as each
version comes in.
Visualizing the results

This plot makes it easier to see which schemes are better
than others.
In my experimentation, I actually did a lot more schemes.
For example, instead of key frames every 10 versions, I also tried every 5, 15
and 20. However, all those extra data points really cluttered the graph. So I
only included the most important ones here.
- In the lower right, we find "full". Very fast and very
large.
- In the upper left, we find "chains". Very slow and very
small.
- We can ignore any point which is both above AND to the
right of any other point. The "1flower" point is the one where I made one
big flower, using version 1 as the reference for every other version.
This scheme ends up being useless since zlib is better in both ways that
matter.
- All the other points represent possible tradeoffs which
might be interesting, depending upon our priorities
Intuitively, the schemes which are closer to the origin are
better. This graph makes it easy to see that "zlib" and "flowers" are probably
the two most interesting options I have discussed here.
|
| |
Monday, April 13, 2009 Ten Quirky Issues with Cross-Platform Version Control
|
|
A big chunk of the software industry today can mostly ignore
the issues of multiple platforms, for one of the following reasons:
- They only support Windows. It's got like 90% market
share, so why not?
- They serve a web application and don't care what the end
user is actually using as long as their browser works.
But version control tools involve more cross-platform
concerns than most other kinds of software. Neither of the reasons above tends
to work very well.
- If a software team has 450 Windows users and 50 people on
Mac or Unix, then a Windows-only solution just won't do.
- Since a primary task of a version control tool is to
manage source code trees on the user's hard disk, a web application just
won't do.
So, even as most coders have moved on to a world where they
can remain blissfully ignorant of the problems of writing software for multiple
operating systems, those of us who create version control tools are still wrestling
with those problems.
And in fact, I claim that our challenges are tougher than most.
Version control users ask for the darndest things, especially in the big
enterprise companies. It's easy to believe that all you need is Windows, Mac,
Linux and maybe Solaris. Then you find out just how prevalent things like AIX
and HPUX are. Terms like "Irix" and "Win95" and "mainframe" get tossed around
until you're numb and nothing surprises you anymore. When somebody asks for a
port to an arcane platform, you roll your eyes and wonder is if it uses 8-bit
bytes or not.
Worse than that, version control vendors aren't just porting
to oddball operating systems. We actually have to make our software interoperate
across all those environments.
And that's where things start to get quirky.
- On a Linux system, create a file called "README". In the
same directory, create a file called "readme". Check them both in. Now
go to a Mac and check them both out. Since the Mac file system is
[usually] case insensitive, something bad is going to happen. Same goes
for Windows/NTFS.
- On a Mac, checkin a file called "PRN". Check it out on a
Windows system. That file name is not
allowed under Windows, for backward compatibility with MS-DOS..
- Under Linux, checkin a file with a name that ends in a
dot. Check it out under Windows. The trailing dot is probably gone. Now
check the file back in and go back to your Unix system. If your version
control system handled this badly, you've probably got two copies of the
file, one with the trailing dot, and one without. Same goes for a
trailing space.
- On a Linux system, checkin a file with a path that is 261
characters long. Check it out under Windows. This might work. It
probably won't. It kind of depends on whether .NET is involved or not.
There's a \\?\ trick to get around the limitations of the Win32 layer, but
the .NET libraries don't use it.
- On a Mac, checkin a file that has a resource fork and
some Finder info. Check it out on a Linux machine. What happens? Did
stuff show up as xattrs?
Should it have? On that same Linux machine, make a change and check it
back in. Then check it out on the Mac again. Is the Finder info still
there?
- On a Linux machine, checkin a file with a colon in the
name. Check it out on a Mac. Not sure what'll happen, but it probably
won't be what you want.
- On a Windows machine, checkin a file with a name that
begins with a dash. Now check it out under Linux and try
manipulating it with command-line utilities. Apps will think the
filename is a command-line option.
What if somebody creates a file
named "-rf *" and a Linux user tries to rm it? OK, bad example.
The point remains: Filenames which begin with a dash may cause more
problems on some platforms than on others.
- On a Linux machine, create a source code file and check it
in. Check it out on Windows and open it with notepad. The line-endings are Unix-standard
LF, but Windows apps expect CRLF, so notepad shows the entire file as one
line. Now open the same file under Visual Studio. The file looks fine
now. Now edit a few lines in the middle of the file, check it back in,
and check it out on Linux again. The lines you edited are messed up.
- On a recent Ubuntu Linux system, create a file called "Espaņol".
Do the same thing on Mac OS X 10.5. These two files have the same name,
but even though you are [probably] using the utf-8 encoding of Unicode on
both systems, the bytes which represent that name do not match. On the
Linux machine, the file name [probably] will be in NFC normalized form (Espa\u00f1ol).
On the Mac, everything gets normalized to NFD (Espan\u0303ol). When you
check these files in and start working with them, bad things will happen
unless your version control tool understands
what's going on and deals with it appropriately.
- On a Unix machine, checkin a symbolic link. Check it out
on Windows. What happens?
Like I said, things get quirky.
|
| |
Monday, April 06, 2009 Mercurial, Subversion, and Wesley Snipes
|
|
People keep asking me why I don't talk more about Mercurial
in this series of blog entries. There's a simple answer to that question:
Mercurial isn't very interesting.
Wait, that didn't come out quite right. Let me try again:
Git is Wesley Snipes.
Mercurial is Denzel Washington
Hmm, that probably needs further explanation. First let me
give a little background.
I am the founder of a version control software company.
I've done lots of writing about the topic here on my blog. Currently I am in
the process of revising and expanding all those writings to turn them into a
book.
As part of that effort, I have undertaken an exploration of
the DVCS world. Several weeks ago I started writing one blog entry every week,
mostly focused on DVCS topics. In chronological order, here they are:
- The one
where I gripe about Git's index
- The one
where I whine about the way Git allows developers to rearrange the DAG
- The one
where it looks like I am against DAG-based version control but I'm really
not
- The one
where I fuss about DVCSes that try to act like centralized tools
- The one
where I complain that DVCSes have a lousy story when it comes to
bug-tracking
- The one
where I lament that I want to like Darcs but I can't
- The one
where I speculate cluelessly about why Git is so fast
Along the way, I've been spending some time getting hands-on
experience with these tools. I've been using Bazaar for several months. I
don't like it very much. I am currently in the process of switching to Git,
but I don't expect to like it very much either.
Why am I using these tools if I don't like them? Because I
want the experience. I don't want to write hearsay. I want to live with these
tools and see what I learn.
So why don't I write about Mercurial? Because I'm pretty
sure I would like it.
I chose Bazaar and Git for the experience. But if I were
choosing a DVCS as a regular user, I would choose Mercurial. I've used it
some, and found it to be incredibly pleasant. It seems like the DVCS that got
everything just about right. That's great if you're a user, but for a writer,
what's interesting about that?
Denzel Washington is a great actor. Other than that, he
lives a pretty normal life. What's interesting about that?
In contrast, Wesley Snipes gives the world lots of things to
write about. Tax evasion. Conviction for reckless driving. Martial arts.
People admire Denzel Washington. But Wesley Snipes is
simply more interesting.
I admire Mercurial. But Git is more interesting. Like
Snipes, Git is an odd juxtaposition of great power with some quirky flaws and
failings.
One more thing:
People also ask me why I don't write more about Subversion.
That's easy too:
Subversion is Morgan Freeman.
|
| |
|
|
|
|