IMmerge
Trillian/ICQ/MSN Instant Messaging Log Merger by zAlbee

IMmerge 1.03 – Smarter Display Name Resolution

March 10, 2011

Today, I’d like to announce the release of IMmerge 1.03. Several important bug fixes are included (thanks to all who reported them!), so I highly recommend you download the new version. Aside from that, this release mainly improves the accuracy and usability of the display name resolution. The number of times where IMmerge now asks you for confirmation is greatly reduced compared to 1.02. In the rest of this post, I will share with you why display name resolution is needed in the first place, and how IMmerge solves the problem.

Names Without an Identity

Display name resolution is mainly needed when IMmerge converts from one log format to another. If you are simply merging one type of log and don’t need conversion, there’s no issue! IMmerge will copy over the information exactly, without caring which part is someone’s display name and which part is what they said. But when format conversion is needed, IMmerge needs to parse the log and understand it. Most plain-text formats like Trillian LOG, and even some XML formats like Windows Live Messenger (MSN), only log the person’s name next to their message, and not the user ID. However, other log formats might want to know who really sent the message (e.g. to colour code the messages, like Trillian Pro), and some formats (such as ICQ) only store whether a message is incoming or outgoing, and do not store the display name at all! So to do the conversion properly, we need a reliable way to match each display name to its user ID.

Back in 2007, I solved this problem for MSN using a heuristic algorithm that compared new names to previously seen names, with good success. When implementing Trillian LOG conversion in IMmerge 1.0, I re-used the same logic. Unfortunately, when run on real-world Trillian logs, it produced a lot of user prompts and many false positives. In IMmerge version 1.03, I have made several tweaks to the algorithm which  improves behaviour on Trillian logs.

The Problem with Plain Text

With Trillian plain-text logs, a single message looks like this:

Alice: Hi Bob!

MSN logs use an XML format, and it looks more like this (greatly simplified):

<Message>
 <From>Alice</From>
 <To>Bob</To>
 <Text>Hi!</Text>
</Message>

This takes a lot longer for a human to read, but the advantage of the structured XML format is that it delineates all the fields, so we are always sure which part is the display name and which part is the message. With Trillian logs, we need to guess based on context.

The following is a more complicated example (left: original, right: indented):

Alice: Hi Bob, here is my info:              Alice: Hi Bob, here is my info:
Address: 123 Alley Way                        Address: 123 Alley Way
Phone: 456-7890                               Phone: 456-7890
Bob: Thanks Alice!                           Bob: Thanks Alice!

All 4 lines on the left look the same: a single word followed by a colon, then more words. Yet there are only 2 true messages here, first from Alice, then from Bob. The indented version (right) makes this clear. Indeed, IMmerge already assumes that lines that are indented are not new messages; however, many IM clients (including several versions of Trillian) do not indent each message. Another hint would be if each message is timestamped, then any line without a timestamp should not be an individual message. Unfortunately, the timestamp option in Trillian is turned off by default, so we still need the method to be reliable for those users logging without timestamps. In the future, IMmerge may include this timestamp heuristic, but not in this version.

With MSN logs, we knew exactly where the name starts and ends. With Trillian, there is a 3rd possibility: the message is not a new message at all, but a continuation of a previous one.

What happens in this scenario? In previous IMmerge releases, the program might ask:

IMmerge thinks
"Address" is <Alice>.
"Address" is NOT <Bob>.
Is this correct? (yes/no)

Users were confused, and rightfully so, because IMmerge only got half of it right — it’s not <Bob>, but it’s definitely not <Alice> either!

In 1.03, IMmerge will give the user up to 4 choices:

  1. NAME is person 1
  2. NAME is person 2
  3. NAME is not any person in the conversation
  4. NAME is someone else (e.g. in a chat room)

This allows the user the ability to correctly convert their logs without error. However, this does not reduce the number of times the user is prompted. Let’s look at ways to actually make the algorithm more intelligent.

The MSN Algorithm (for lack of a better name…)

When porting the display name resolution algorithm from MSN to Trillian, we suffered a usability problem. IMmerge started asking the user a lot of questions. It seems it was a lot less certain about display names than ever before. This is partially understandable — there could be lots of false positives like in the above example with the address. However, there is a second reason why MSN logs were more “well-behaved.” MSN gave us more comparisons to work with.

The idea behind IMmerge’s original name detection algorithm was to compare the unknown display name to a set of all previously known display names. If the name has an exact match, the answer is easy, but what if they are not equal? Well, running the similarity metric against all previous names could get very slow, so we only run it on the most recently seen one. The second nice thing about MSN’s XML format is that each message is logged with both the display name of the sender and the receiver. So we run a [link]similarity metric[/link] against both sender and receiver names, and whoever is closest wins. There are 4 comparisons we can do:

a) New sender name vs Person 1’s old name
b) New sender name vs Person 2’s old name
c) New receiver name vs Person 1’s old name
d) New receiver name vs Person 2’s old name

High similarity in (a) and (d) both suggest the message sender is Person 1. High similarity in (b) and (c) both suggest the sender is Person 2. The final score is a+d-b-c in favour of Person 1. The larger the difference in scores, the more certain we are.

This worked great for MSN. Even though MSN users tend to change their display names more frequently than users of other services, this algorithm often can make a decision with absolute certainty. Say two people have frequent conversations. It is unlikely that both people will both change their names at the same time; usually only one user will change their name. Since MSN stores both the sender and receiver’s names, if we match just one of them exactly, then we can safely infer the identity of the unknown, never-seen-before name!

In contrast, with the Trillian LOG format, we can only do two comparisons:

a) New sender name vs Person 1’s old name
b) New sender name vs Person 2’s old name

So unlike MSN, we will encounter uncertainty with every display name change. As such, users of IMmerge 1.0 – 1.02 may notice higher number of prompts asking for confirmation, which can be quite tedious.

Trillian’s Saving Grace or “Why didn’t I see that before?”

For every conversation (or session), Trillian writes a session header in the log. The header is very important because it divides and timestamps the log into sessions, but we overlooked the fact that Trillian also stores the display name of the contact there! Since a session ends when the Trillian user closes the chat window, and most people close their chat windows after they’re finished talking, it is highly unlikely that either user would change his/her display name while the conversation is ongoing. This almost makes the task trivial :-). If the name matches the session display name, the message was sent from the contact, otherwise we use the similarity metric.

Tie-Breaking

Finally, there are the cases where the similarity metric is a tie. Usually it happens because the new name bears no similarity to either of the previously known names, or we are just starting out, so we don’t know any previous names. Aha! If the latter case is true, then we can assign the new name to whomever doesn’t yet have a name. If both people already have names, and there still isn’t any similarity, then it’s a good bet that the new name is a false positive and not a name at all. So IMmerge will now make the following guess in case of ties:

  1. If log owner does not have a previously known name: owner
  2. Else If contact does not have a previously known name: contact
  3. Else: Neither/Not a Name

Since it’s a tie, the algorithm’s certainty score is by definition low, so IMmerge will still ask you to confirm (unless you chose fully automatic mode).

Final Touches

The last optimization is a usability one. After repeatedly testing the improvements, I have gotten just as tired as anyone else has from answering all the prompts. So in both the GUI and CLI versions of IMmerge, the best choice will be automatically selected, and if it’s right, you only need to press Enter to confirm. Otherwise, a couple of up- or down-arrow keystrokes should do it. I’ve also included an accuracy counter.  When I first started working on this version, the accuracy was a dismal 15% on my test set. With all the tweaks, IMmerge now gets 29/30 (96%) of all its guesses right. Of course, your mileage may vary. Enjoy the new version!

-zAlbee


Filed under: Algorithms,IMmerge v1,Release | No Tag
No Tag
March 10th, 2011 05:30:10
no comments

Sorry, the comment form is closed at this time.