UTF charset issue with Json files

Support for our DiffMerge utility.

Moderator: SourceGear


Posts: 5
Joined: Wed Jun 28, 2017 3:26 am
PostPosted: Wed Jun 28, 2017 3:34 am
Hi, I'm using DiffMerge to compare Json files that we use to update the text content on our web portal and having problems with UTF-8 and special characters.

For example, in the last Json file I made on Windows, Notepad++ (Windows), TextEdit (Mac) and DiffMerge all see the text "Curaçao".

But when I copy the contents of this Json into a new text document on Mac using TextEdit, DiffMerge sees that same text as ""Curaçao", so it's spotting a lot of discrepencies between different versions of the Json doc, making updates to our website difficult!!

Any help would be much appreciated, thanks.

Posts: 3471
Joined: Tue Dec 16, 2003 1:17 pm
Location: SourceGear
PostPosted: Wed Jun 28, 2017 8:09 am
All of this will be dependent on what is stored in the Byte Order Mark (BOM) in the beginning of your file. Do you know if one is present?

Next, take a look at the individual Rulesets for the extensions of your configuration files. Is the BOM option checked in the Ruleset? Perhaps you need define or try a specific encoding?
Jeff Clausius
SourceGear

Posts: 5
Joined: Wed Jun 28, 2017 3:26 am
PostPosted: Wed Jun 28, 2017 10:14 am
jclausius wrote:All of this will be dependent on what is stored in the Byte Order Mark (BOM) in the beginning of your file. Do you know if one is present?

Next, take a look at the individual Rulesets for the extensions of your configuration files. Is the BOM option checked in the Ruleset? Perhaps you need define or try a specific encoding?



Thanks for your reply. The Json file doesn't contain a BOM in the beginning of the text as far as I can tell, I'll look into the BOM option in ruleset. If I my colleague compares the two Json files on Windows in DiffDog, they are identical, but for me on Mac in DiffMerge, those special character discrepancies are showing up.

Posts: 5
Joined: Wed Jun 28, 2017 3:26 am
PostPosted: Wed Jun 28, 2017 10:21 am
I tried disabling custom rule set, then enabling but unticking 'Search for Unicode BOM' - and making sure that the Named Character Encoding was set to UTF-8. No difference.

In the Json I make on my Mac, DiffMerge sees "Cura√ßao" -- even if it's just an empty plain text document and I copy/paste the original json (where it shows in DiffMerge as "Curaçao") into it.

I'm a bit stumped, because I don't know what the difference would be between these files. Should the "ç" show correctly in UTF-8?

Posts: 3471
Joined: Tue Dec 16, 2003 1:17 pm
Location: SourceGear
PostPosted: Wed Jun 28, 2017 12:25 pm
Can you try a quick test on the Mac?

a) Make a copy of both files you are working with, but rename the files so the copies extensions end with '.utf8'. For example, 'somefile.json' would be 'somefile.utf8'.

b) Run these two *.utf8 files in DiffMerge. What does the diff look like for these two files?
Jeff Clausius
SourceGear

Posts: 5
Joined: Wed Jun 28, 2017 3:26 am
PostPosted: Fri Jun 30, 2017 8:24 am
jclausius wrote:Can you try a quick test on the Mac?

a) Make a copy of both files you are working with, but rename the files so the copies extensions end with '.utf8'. For example, 'somefile.json' would be 'somefile.utf8'.

b) Run these two *.utf8 files in DiffMerge. What does the diff look like for these two files?



1. I downloaded our current json file. Opened it in Text edit and copied the entire contents to a new document saved with UTF-8. Compared the two in Json and the copy still displays "Cura√ßao" while the original displays "Curaçao".

2. I made a duplicate of both and changed the file extension to ".utf8". Comparing them in Diffmerge still shows differences in how special characters are displayed.

Posts: 3471
Joined: Tue Dec 16, 2003 1:17 pm
Location: SourceGear
PostPosted: Fri Jun 30, 2017 9:09 am
I made a sample here.

If you take the 1st set of files (files encoded with the byte order marks), those use the correct encoding on Windows, Mac, and Linux. However, if a file is missing the BOM, it will depend on what character set ends up displaying the text. The default character set on the Mac apparently doesn't like the 'ç'.

Try to open the four files on the Mac. The set of QuizKid1.utf8 and QuizKid2.utf8 do not open because they do *not* contain any byte order marks. The set of QuizKid3.utf8 and QuizKid4.utf8 do open correctly.

Now try to this. Change the option of the ruleset which applies to json to 'Ask for Each File in Each Window' for the character encodings. Also, rename the file extesions so they all end in .json (ie QuizKid?.json). Try again in DiffMerge. When 'asked', choose 'Western European'. Now try all 4 files. Do they now display correctly? You should notice the QuizKid3.json/QuizKid4.json does not prompt for an encoding because it has identifying BOM marks.

A couple of suggestions:

- Switch the editor you use to one that will properly insert BOM to the UTF-8 file
- Configure a different encoding with your ruleset for the files.
Attachments
QuizKid-WithBOM.zip
2 utf8 files saved with BOM
(746 Bytes) Downloaded 16 times
QuizKid.zip
2 utf8 files saved without BOM
(726 Bytes) Downloaded 17 times
Jeff Clausius
SourceGear

Posts: 5
Joined: Wed Jun 28, 2017 3:26 am
PostPosted: Mon Jul 03, 2017 11:05 am
jclausius wrote:I made a sample here.

If you take the 1st set of files (files encoded with the byte order marks), those use the correct encoding on Windows, Mac, and Linux. However, if a file is missing the BOM, it will depend on what character set ends up displaying the text. The default character set on the Mac apparently doesn't like the 'ç'.

Try to open the four files on the Mac. The set of QuizKid1.utf8 and QuizKid2.utf8 do not open because they do *not* contain any byte order marks. The set of QuizKid3.utf8 and QuizKid4.utf8 do open correctly.

Now try to this. Change the option of the ruleset which applies to json to 'Ask for Each File in Each Window' for the character encodings. Also, rename the file extesions so they all end in .json (ie QuizKid?.json). Try again in DiffMerge. When 'asked', choose 'Western European'. Now try all 4 files. Do they now display correctly? You should notice the QuizKid3.json/QuizKid4.json does not prompt for an encoding because it has identifying BOM marks.

A couple of suggestions:

- Switch the editor you use to one that will properly insert BOM to the UTF-8 file
- Configure a different encoding with your ruleset for the files.



Thanks for all your help. As I didn't have any more time to figure this out, I ended up switching to a different program, which didn't have this issue.

Posts: 3471
Joined: Tue Dec 16, 2003 1:17 pm
Location: SourceGear
PostPosted: Mon Jul 03, 2017 3:25 pm
NP.

If you end up coming back, if you cannot change editors, choosing the Western European code page would probably do the trick.

Thanks for the update.
Jeff Clausius
SourceGear

Return to Support (DiffMerge)

Who is online

Users browsing this forum: No registered users and 1 guest

cron