Interesting Merge problem

mlippert · Post by **mlippert** » Mon Sep 26, 2005 10:20 am

We just ran across an interesting merge problem, that I can't really think of a good fix for.

We uncovered the problem after a Merge Branches, but I think any automatic merge may have the same issue.

It turns out that when we merged the branches we were on a computer with a Japanese system codepage set.

The automatic merge modified ü (u umlaut) characters along with the character after them (in a german html file using charset windows-1252) into garbage.

This makes sense because ü is 0xFC in the Latin1 ANSI codepage, but is a lead byte in the Japanese shift-JIS codepage.

Now obviously there is no way for the merge tool to be able to determine what codepage a given text document is in, so assuming the system default is reasonable.

However, I do wonder about UTF-8 encoded text files. If they have the correct BOM at the front, will they be treated correctly?

The only real partial solution I can think of for this is to treat these text files as not mergable. However currently if I set this in Vault, I can no longer do diffs on those files either. And diffs on those files (and probably other "unmergable" files) can be quite useful, particularly historical diffs. I've actually configured Vault to use Beyond Compare for diffs and it can do diffs on many types of files.

So, what's the answer to the question about UTF-8 files (and UTF-16), and is it possible to separate out the mergable file types from the diff'able file types?

Mike

ericsink · Post by **ericsink** » Mon Sep 26, 2005 12:02 pm

Offhand I would say that handling these situations correctly would require Vault to know the encoding of the file. Vault currently knows if the file is text or binary, but if the file is text, it does not have an attribute to specify the encoding. Once the file has been read into memory as string data, it won't matter.

Right now, when the automerge code reads a file, it relies on the .NET framework to figure out the encoding. I don't know too much about how this works. Apparently the default encoding for the system is a big part of the answer, but if it doesn't properly handle the BOM for a UTF-8 file, I'd call that a bug in the framework.

mlippert · Post by **mlippert** » Mon Sep 26, 2005 4:55 pm

Yep, that's why I was requesting that the list of mergable extensions and the list of diff'able extensions be made different.

Speaking of which, how does the merge branches wizard handle binary (non-mergable) files? In particular since Vault doesn't keep track of the branch point (for the base file for a 3-way merge) how does it decide whether it can overwrite the file or if it should put it in a needs merge state?

I just ran into a different merge problem, again not one I can think of a possible Vault solution to.

This time the file being merged was an xml file. Both the target (trunk) and the origin (branch) were UTF-8 encoded, but the target had a BOM and the origin had lost it.

Vault's merge seems to have interpreted the file with the BOM as UTF-8 and the file without a BOM as ANSI (Windows-1252 was the system codepage) and the resulting auto merge was ugly.

I'm going to remove the extension of these file from the mergable list and see if the Merge Branches goes any better.

We tried unchecking the box that said "Automatically merge files", and that left this file in the "needs merge" state, but when we did a resolve mege status, instead of getting the origin file, it became the target file and was unmodified. Is that really how it is supposed to work?

Mike

mlippert · Post by **mlippert** » Mon Sep 26, 2005 6:21 pm

1st, for some reason "Show Differences..." was still active for files with the extension I removed from the mergable list.

However, when I went through the Merge Wizard this time, those files were either flagged as "Overwrite" or as "Not Mergable".

But, when I got to the attempting to check out target files step, I got the error "Attempt to check out files failed.". What I'd like to know when I get an error like that is why??? My guess would be that some files that are now not mergable need to be checked out exclusively, and someone else has them checked out. OK, but which files, and who has them checked out? Or is there perhaps something else that caused the failure?

Anyway, at this point I could use some help as it is important that we get this merge done. I'll be back working on this tomorrow probably around 11:30am, although I will try to at least read messages posted before that.

Thanks,
Mike

ericsink · Post by **ericsink** » Tue Sep 27, 2005 10:02 am

1. In general, you'll need to deal with binary files manually. The merge branches wizard can't really offer much help.

2. Regarding your second problem, if I understand correctly, yes, that's how Vault is supposed to work. Resolve merge status doesn't do anything to the files. It merely tells Vault to stop fussing about the needs merge state. By using this command, you are assuring Vault that you did the actual work to resolve the merge. (Note, I'm not saying that the automerge result you got is how it is supposed to work. These issues with encodings still sound like a bug of some kind to me.)

There is a reason why the merge branches wizard performs work in the local working folder for the target, leaving it for you to check in. Ultimately, your goal is to get that working folder into the state you want. Only you can verify that what you're about to checkin is correct. This may require you to edit some files by hand or even copy files from the origin in order for the merge to actually be resolved.

More of my philosophy on this topic can be found here:

http://software.ericsink.com/scm/scm_me ... nches.html

Even so, I am trying to defend the philosophy, not the weaknesses in our current implementation of merge branches. Our goal is for the merge branches wizard to be as automatic as possible, although it can never be smart enough to handle every situation. Unfortunately, you are bumping into a couple of these places where the merge branches wizard requires manual intervention. Our apologies.

mlippert · Post by **mlippert** » Tue Sep 27, 2005 10:38 am

In regards to binary files and resolve merge, I expected that the target working directory would contain exact copies of the origin files that are binaries (and are not mergable) or are in a needs merge state, so that I could do a resolve merge, meaning keep all of the origin changes, not throw away all of the origin changes (which if I wanted to do that, all I'd need to do is an undo checkout).

In support of that, consider that I'd want to build my target to test it before committing the merge. In addition think about how resolve merge is usually used. I check out and modify version 3 of a file. Some one checks in version 4 with changes that are incompatible with my changes. The file is in "needs merge" state. I examine the changes in version 4 and incorporate them by hand into my local copy in my working directory. Then I tell Vault to "resolve the merge status". If I've just done a Merge Branches, then I expect that the file in the target working directory is an exact copy of the origin as that was the change I was making locally.

As to the current automatic merge based on what you said earlier, you are letting .NET process the text file, which means that it is being converted to Unicode (since all .NET string are Unicode) when you read it in. However you have to remember the file's encoding for when you write it out again, since presumably if the original file was ANSI you write the resultant merged file as ANSI, and similarly for UTF-8 and UTF-16 files.

If you know that much (the text file's encoding among ANSI, UTF-8 and UTF-16), I'd suggest NOT merging when the 2 text files being merged have different encodings.

I'd say that the one thing an auto-merge of files should NEVER do is produce a bad result file. When in doubt it should do nothing and require manual intervention.

I do understand that in a merge there will always be files that have to be dealt with manually. The merge process should just make that as easy and straightforward as possible, and right now I don't think it's either.

Mike

mlippert · Post by **mlippert** » Tue Sep 27, 2005 12:37 pm

My next question is how does the merge branches wizard decide which binary files can be overwritten and which need manual intervention?

I just did my merge again with the xml file type (.xmcd) removed from the mergeable list. The majority of those files were put in the pending changeset and were listed by the wizard as overwrite. A few were listed by the wizard as "Not Mergeable", and were in the report at the end of the wizard:

WARNING: Not all changes from the origin were applied to the target.
You may wish to pay special attention to the following changes in the
origin which were not applied to the target:

Add File Documentation/Help/JP/Graphics/op_for.gif
Add File Documentation/Help/JP/Graphics/op_while.gif
Modify File Documentation/ResourceCenter/FR/Qsheet/Vector_and_Matrix/spctypes.xmcd
Modify File Documentation/ResourceCenter/FR/Qsheet/Visualization/unitsplt.xmcd
Modify File Documentation/ResourceCenter/FR/Qsheet/Visualization/tangraph.xmcd
Modify File Documentation/ResourceCenter/JP/qsheet/symbols.xmcd

So I'm curious what made those 4 xmcd files different. Also I would really like the wizard to tell me WHY these particular changes (Adds & Modifies) were not applied to the target.

Thanks, Mike

ericsink · Post by **ericsink** » Tue Sep 27, 2005 4:24 pm

I don't have the Vault source code handy to verify this right now, but here's how I *think* this works:

There are three files in play:

1. The target file.

2. The beginning version of the origin file.

3. The ending version of the origin file.

For a normal text file, automerge takes the differences between 2 and 3 and tries to apply them as a context diff to 1. This is just background you already know.

For a binary file, it can't do this. I *think* what it does is this:

If the target file happens to be identical to the beginning version of the origin file, then it overwrites the target with the ending version of the origin file.

Or in other words, if 1 and 2 are identical, then choose 3.

The basic idea here is this: If 1 and 2 are identical, then we assume we don't really have a merge to deal with. The file has been modified in the origin, but not in the target. So we just take the changes from the origin as being correct.

Does this make sense?

mlippert · Post by **mlippert** » Tue Sep 27, 2005 4:34 pm

Yes that makes sense. Thanks, I figured there had to be a "base" file at play somewhere, but since Vault isn't really tracking the branch and merge operations (at least for this purpose) I wasn't sure where the base was coming from. I should have been able to guess the beginning of the origin changesets.

Mike