Monday 28 April 2008

Easy product or class rating system

So you've got a lovely little ratings system going on your site. All of a sudden though you get slashdotted, dugg or just your marketing starts working and you have thousands of users all rating your products / services / systems / posts / videos etc and your pages start to creak.

"It's the shared web space you're on," say your techies, "it can't handle the users" and duly bounce you to a better hosting environment at triple the cost along with the migration charges.

From time to time I come across this problem when I've either picked up code from someone else or else a techie asks me how to optimise a page that's running really slowly. In this particular instance it was caused by a ratings system in the style of Amazon or YouTube - basically a user is displayed a product and then people rate it as to whether it's any good. The real problem came when they had a list of products, each of which had it's individual ratings displayed.

The cause of this very slow page however had nothing to do with shared hosting or otherwise or direct server load - it was all down to some naive coding executing what my old CS lecturer would call an O(n)2 process.

What the coder had done was get a list of products, then for each product gone back to the database and got a list of all the rankings ever made and then averaged them out. Nice and simple but frightfully inefficient and that which caused the problem I've highlighted.

This isn't the first time I've seen this and I've been asked how to build them numerous times as well so here's a well optimised method of doing it in general terms.

Consider first that calculating the average when you insert into the database is going to be computationally less expensive than calculating it every time you perform a select when a user hits the page. This sounds obvious but it's stunning how often it's overlooked.

Make two extra fields for your product table, one called average and the other called user_count or something. On your insert of the rating into the ratings table, run a trigger or else add some code that will update the product table with the updated count and a new average calculated from the ratings info.

Now when you select the product data you pull down the average and user count as part of that select and they are just simple static fields, thus adding no more computational load than the original select or view does already.

This gives you a nice little rating system that's not heavy in terms of processor load. However we can improve things once step further if you aren't interested in the data.

The option I'm providing below is good if you are just after a running average and don't care about the individual ratings being kept. I did a project recently where we weren't worried about keeping individual ratings data because the site wasn't going to be up for very long and it didn't add anything to our system to have it.

This option uses a running weighted average in order to just update the data in the product table without requiring a ratings table at all.

Some useful background maths though:

If I have a set {3, 4, 4} and take it's average I need to add the numbers and divide by the number of entries. Thus this set's average is (3+4+4)/3 = 3.67

Now suppose I've precalculated this average as I've suggested above and stored it without the individual ratings, I now want to add another rating, 2 to the set.

Intuition says to do something like this: (2 + 3.67)/2 = 2.83 which is actually wrong. Looking at the set {3, 4, 4, 2} we can guestimate that the average is going to be somewhere more between 3 and 4 than it is 2 and 3 as we've calculated above.

Thankfully a technique from statistics gives us an option here which is to use a weighted average instead. This is useful for adding sets together that have different numbers of elements within them but maintain the averages by skewing the data using proportional averages (or a weighted average).

The general formula for this is:

Avgw = (Avg1 * (n1 / (n1+n2))) + (Avg2 * (n2/ (n1+n2)))

Where:

Avg1 is the average of the first set
Avg2 is the average of the second set
n1 is the number of elements in the first set
n2 is the number of elements in the second set

In our example this simplifies even further because our second set is actuall only one item. So let's work this through:

Avgw = (3.67 * (3/(3+1))) + (2 * (1/(3+1)))

= (3.67 * 3/4) + (2 * 1/4)

= 2.75 + 0.5

= 3.25

Which is the answer we're after for our average.

As we know all the base line average data in the product table and we know the value of the rating we're tracking, it's a very simple function to update this instead of doing another insert into a ratings table and we just keep on doing it for every rating that has been added.

Computationally this is a very inexpensive process and whilst I'm more than happy to be shown otherwise I think this is about as good as it gets in terms of optimisation.

The key thing is we've now reduced an O(n)2 operation to O(n) which is a drastic improvement as n tends towards infinity.

Tuesday 22 April 2008

Phorm over function?

Phorm is, and will continue to be for some time I think a hugely divisive issue online. BBC have another story today about it, this time having spoken to the various security companies like F-Secure, McAffee etc about whether they will flag a message to the user about whether Phorm has been enabled or not.

Phorm management have come out saying "it's only a cookie", the same as many other sites use to provide tracking (such as Google Analytics), interactivity (such as shopping carts or ID maintenance on numerous retail sites), or a small amount of memory (configuration information for the BBC home page for example).

The difference, though, is that the information is being used differently because data is being shared.

This is what got the Information Commissioners Office's back up because sharing data between companies without users opting in is a breach of the Data Protection Act - "But not if it's anonymous data" say the legal eagles from Phorm - and technically they are correct. This is a case of adopting the letter of the law rather than the spirit of it.

Tim Berners-Lee came out saying he would move ISP if he found out they were using Phorm and whilst I admire his line I fear the vast majority of consumers won't care or rather just won't be bothered to switch - just see how many people actually switch bank or utilitiy companies.

For me this is a case of the slow erosion of privacy at the hands of our ISPs. In a massively competitive market where margins are being squeezed ever tighter, the sale of their user data to Phorm must have seemed like the proverbial golden goose.

It won't take long for someone to cotton onto the flip side of this and market aggressively on the privacy front. Talk Talk made huge inroads as an ISP on the back of their "The Internet should be free" campaign with regard to price (being bundled as it was with other services). Who will be the first to play the "Internet should be private" card and sign up to a deal not using Phorm or other tracking software?

In my cynical world view, I think the security firms have realised this and it is 99% of the reason for why they are looking at it all as the anti-spy, -mal and -virus software is worth billions.

In real terms Phorm isn't actually that clever a piece of technology - most of what has been achieved is in the brokering of deals between ISPs and content owners and then a bit of clever gluing in the middle.

In the end Phorm will either be a great white elephant and just slip off the radar the way many technologies and companies have done or else it may actually be a spur to drive privacy legislation forward in line with our digital behaviour - how long it will take to do this however is the question as government is typically a long way behind technology in terms of law-making.

Monday 21 April 2008

Can Yahoo really get things so wrong?

Update - The guys at Yahoo came to our rescue after tracing through the "network" somewhat to find someone that knows someone at Yahoo to help us out. Unfortunately their techies couldn't explain why we'd been bloack listed either but we are now officially on their whitelist so big thanks to the guys for helping us out.

Yahoo are one of the original dotcoms. They've been around for a long time so they should know their business. Imagine my surprise when one of my clients starts complaining that their confirmation emails to yahoo email accounts are permanently being binned as is everything else they send - including personal communications.

Like most mail providers, free or otherwise, Yahoo have a spam policy that will look at an inbound email and then drop it in your inbox or spam folder depending on how it is classified.

As with most techies I have about a dozen email addresses at various providers in order to test exactly these sorts of issues. Especially given that the goalposts are changing all the time.

Sure enough even a personally addressed confirmation email was killed as it came into my yahoo account. "Ah ha," said I, "they've been blacklisted". So off one goes and checks the various blacklisting sites and there's nothing there. Hmmm.

It transpires that yahoo have just taken it on themselves to block that domain. Weirdly though, a personally addressed mail to me from the client with only the word "test" in the subject line is still considered Spam yet an email from some random address that doesn't reply, containing several instances each of the words "penis", "cock", "viagra" and "cialis" made it through to my inbox completely unscathed. At this point the phrase about arses and elbows definitely comes to mind.

Trying to get Yahoo to do anything about this issue is similarly problematic as there are no feedback channels to deal with this problem at all.

So overall we've just had to advise people to not use Yahoo or to check their junk mail periodically and read the mail there.