A node.js bot in XMPP

Yesterday Danbri and I invested a few hours in a rapid node.js prototype for NoTube. We’ll blog about it properly elsewhere, but this is just a little note about a bit of curious behaviour we found, in case anyone runs into the same problem.

Just before christmas Dan made a node.js xmpp bot that could appear in N-Screen and take commands via drag and drop (if you don’t care about this, the point is that it uses node.js with node-xmpp). Dan got it showing up in the N-Screen interface and reflecting back whatever you dropped on it. I thought it would be trivial to hook in one of the item-to-item recommenders I have working via a json http api. Basically to do http GETs in node.js you need something like this:

var http = require('http');

var data_client = http.createClient(80, "nscreen.notu.be");
var request = data_client.request('GET', '/iplayer_dev/api/suggest?pid='+pid', {'host': 'nscreen.notu.be'});
request.end();

request.on('response', function (response) {
response.setEncoding('utf8');
var myStr = "";
response.on('data', function (chunk) {
console.log('BODY: ' + chunk);
myStr += chunk;
});
response.on('end', function (response) {
console.log(myStr);
}
});

Then you can parse myStr e.g.

var j = JSON.parse(myStr);

What we wanted to do was send that myStr as a string content of an xmpp message like this:

var cl = new xmpp.Client({
jid: jid + '/bot',
password: password
});

cl.on('stanza', function(stanza) {
var msg = { to: stanza.attrs.from , type: 'chat' };

... get myStr of recommendations based on that message

cl.send(new xmpp.Element('message', msg ).c('body').t(myStr) );
});

It really should have been simple, but in practice the myStr existed when printed out to console but was never sent. At first we thought it was something to do with Buffers – if you don’t do response.setEncoding(…) you get Buffers back and perhaps some stringification of those wasn’t working, but no. To add to the confusion we could create Json objects from the string but they still would not send, even if copied to new strings.

In retrospect it’s obvious it was the XMPP that was the issue, but it took line by line trial and error sending of some sample input before I found it: & characters in the input were causing silent failure when the message was sent – presumably due to an XML parsing problem that for some reason didn’t show up as an error (I know almost nothing about node.js so there could have been some user error there too). Escaping these & didn’t seem to work, so for now we’ve just removed them and it all works fine. We need to raise a bug report, but thought I’d blog for now in case it helps anyone else. The code is here.

Archiving a Mediawiki Installation

NoTube uses a Mediawiki installation kindly provided by sti2.org for its internal documentation. As the project draws to a close (we finish officially on January 31st 2012, with our final review in late March) we wanted to make sure we had a copy of everything we had done over the last few years. Much of this is and will remain private to the partners but there are some interesting ideas and usecases we wrote down early on that we don’t want to lose track of. I hadn’t realised that by default Mediawiki has an API, but once I did, it was pretty simple to download all the pages. I’ve put the Ruby script on github in case it’s useful to anyone else. Basically the only fiddly bit is the cookies. You do, of course, need a username and password for the wiki you want to download, but thereafter, there’s an API call you can call recursively to get a list of all pages, and then download them individually.

Web [on|and|in|for|with|via|through] TV Workshop

In September I participated in the programme committee of the W3C’s Web On TV workshop, which was held in Japan. Because of some existing committments I was not able to go to the face-to-face meeting, so to try and make up for it, I read through all the papers instead of just my allocated ones. My notes are below. These are just my personal opinions, and I’m not an expert in the TV field (although Web and TV is my thing – I work on the NoTube project). All the papers are public. There is also a Draft Web and TV Interest Group Charter. The title of this post is stolen from danbri who was pointing out that Web AND TV need not be Web ON TV.

These reviews are very short – most of the papers are themselves very short, being expressions of interest. In some cases I just use a representative quote from the paper. The attempt here is to summarise not to evaluate them, though I indicate where I am interested in a particular topic. The workshop summary is here.

Summary

A large group were interested in BML and explaining why it’s important, perhaps indicating that they do not see a reason to change from using that.

A group are interested in HTML5 and how it might work with BML for interactive applications, and a subgroup interested in user interfaces for TV and common UIs for TV and other devices using HTML5.

There is an overlapping group who see the TV as being a hub for home entertainment, which seems to mean that everything is controlled via the TV, web pages are viewed on the TV etc.

There is also a group interested in APIs for TV and other devices (such as controls).

There is a strong sense that IPTV is very important and standards for it are important, especially DRM and efficiency.

I get the impression that there are a lot of participants who have specific scenarios in mind and also a number who are looking for interesting aplications of HTML5 to TV.

Papers 21 and 27, 30 are the most interesting from my point of view. 31 makes an important point. I’ve a lot of sympathy with 36. 39 and 41 are also interesting.

Papers

1. Shinichi Matsui (Panasonic)

Would like to attend in a personal capacity. His view is that “TVs are the most important components, not only for displaying contents, but of “Ubiquitous Home Appliances” which will evolve to “Web Appliances” surrounding consumers.”

2. Tatsuo Matsuoka (Innovative IP Architecture Center, NTT Communications Corporation)

They have made an IPTV service. They are interested in what functions are done by different devices and APIs, and IPTV standards and DRM.

3. Katsuhiko Kageyama (Hitachi)

Interested in consumer electronics: user interfaces for TV especially HTML5 capabilities, control and communications between devices.

4. Sunghan Kim (ETRI/W3C Korea Office)

Interested in the relationships between devices, and content provision, e.g. start watching on one device and continue on another, and the various W3C and other standards that could be employed to make it happen.

5. Wayne Carr (Intel)

Interested in HTML5 as a way to provide web experience across a range of devices, TV in particular.

6. Masakazu Muraoka (HTML5-WEST.jp)

Interested in APIs to TV and HTML on TV.

7. Aaron Zhang (Huawei)

They are an IPTV provider and suggest an architecture for improving the user experience of the web on TV (avoid bad UI experiences of early PCs)

8. Masakazu Kobayashi (KDDI)

Interested in HTML5 as a common interface to Web TV, avoiding situations such as the different standards for e-books.

9. Yusuke Kawabe (NTV (Nippon Television))

Would like to talk about BML and the usecases for it, and see what any new usecases are.

10. Hidekazu Bunne (TV Asahi)

As (9)

11. Tomokazu Yamada (IPTV Forum)

Would like to talk about IPTV, specifically DRM and EPG metadata, and have some usecases to share.

12. Tatsuto Murayama (NTT)

Describe their requirements for HTML5:

“1. Layout optimization with reflowable materials
2. Requirements for vertical writing/reading and ruby annotations
3. HTML5 widgets as containers for digital books”

Seem most interested in digital books, but also talk about easy to use layout optimisation on TV screens.

13. Koichi MARUYAMA (NTT Cyber Solutions Lab.)

Interest is in a markup language for IPTV with
” – Easy multimedia description like BML/LIME
– Interactivity as rich as that of native application
– Service integration and linkage for multiple devices”

Social networking, performance and DRM are their main interests.

14. Limin Yu (DragonTec)

They have developed a BML IPTV browser and next are doing a LIME browser. They would like to demonstrate their browser. They are interested in standards suitable for the chinese market. They think that W3C technoloigies have potential for interactive TV.

15. Shigeru Owada (Sony CSL)

Interesting ideas of devices as ‘fairies’ that can communicate with each other and that humans can communicate with. “We are interested more on fun usage of ubiquitous home network than protocol layer implementation”

16. Yoshikazu Seki (Fuji Television)

as (9)

17. Kazunori Tanikawa (NEC)

Interested in IPTV, scenarios, and the potential of HTML5.

18. Kenji Sugihara (TV Tokyo)

Similar to (9) especially for broadcaster controlled interactive appliactions using BML.

19 is missing

20. Hiroyuki Aizu (Toshiba)

Would like to show some usecases of HTML5 on TV as the hub within a hiome network, and some ideas about communication technology and TV.

21. Shuhei Habu (Allied Resources Communications)

Interesting usecases and a proposal for privacy for TV in HTML5 based on BML and APIs for TV.

22. Kenji Fukuda (Wowow)

Similar to (18)

23. Jan Lindquist (Ericsson)

A member of the Open IPTV Forum (OIPF) standardization group who woudl like to talk about his experiences in standardisation in the subgroup responsible for the web latform (javascript, embedded video).

24. Yoshiaki Ohsumi (Panasonic R&D)

Interested in possible future usecase and smarter integration of TV and web technologues; TV as a hub.

25. Ishidoshiro Takashi (Melco)

Make TVs and perpherals. Interested in the future relationship between BML and HTML5 and traditional over the air and HTML5 and from the user’s point of view how to improve the experience.

26. Keiya Motohashi (NHK)

Interested in public service usecsaes such as disaster information, BML and interactive applications, connecting TV with the web.

27. Hyojin Park (KAIST)

Researchers on TV. Interested in device APIs for the browser to control the TV, architecture and standards to allow appropriate UIs for different devices.

28: Toshio Watanabe, TOKYO BROADCASTING SYSTEM TELEVISION,INC

The paper is about BML, which is a markup language widely used in Japan and is ‘is basically an extension for existing Web standards, e.g., XHTML 1.1′.

They would be able to provide usecases and are interested in seeing how TVs will become more of a hub for entertainment in the home, and how these changes fit with html5.

29. Makoto Nishimura at Cisco Systems

“Our interest is the integration of LIME and HTML5 on to our video products such as IP-STB, RF-STB and other related solutions.”

30. Hiroshi Omata (Jig.jp)

They have made remote controls from mobile phones and are therefore interested in devices APIs for TV. Also interested in standardising HTML5 for TV.

31. Naomi Nakamura (ACCESS)

Think that people don’t use TV but watch it – i.e. lean back exterience; therefore new usecases will need to be thought through that accept this.

32: Tatsuya Igarashi, TDG, Sony Corporation

Again interested in the TV as a hub for home entertainment and integrated web technology; interested in HTML5; provide usecases.

33: Shozo FUKUI, Tomo-Digi Corporation

They would like to explain why BML was useful, explain the diffrence between BML and HTML5 and have several usecases to discuss, including extensibility in the future.

34: Tatsuki Matsuda, NTT-Resonant Inc.

They would like to join the workshop, but don’t offer a paper – they are provders of web portal services and would like to be able to integrate with TV services.

35. Masahito Kawamori (ITU-T)

From ITU-T: would like to present their experiences standardising for IPTV: declarative languages, Lua, SVG, ECMAscript.

36. Charles McCathieNevile (Opera Software)

Proposes concrete steps (e.g. testcases) for ensuring “use of HTML on TV [is] more closely aligned with its usage in general”, and that this should happen in W3C or in close colaboration with W3C.

37. Daniel Park (Samsung)

“We are supporting on developing best practices and guidelines for Web on TV as well as easy of connection with other Web-capable devices from Web application.”

38. Diot Christophe (Technicolor)

They can help bring the views of services providers and content producers to the table. Interested in web on TV applications, why HTML5 not CE-HTML.

39. Asanobu Kitamoto (NII)

Describes the concept of ‘Bayesian TV’ – not just TV on the web or vice versa, but a personalised push system, rather than the pull of the web, with recommendations and user interactions.

40. Manabu Shimobe (UIEvolution)

“we are very interested in contributing to defining the additional standards needed for smarter integration of web technologies and broadcast services” particularly user interface aspects.

41. Kiyoshi Oura (Airframe)

Interesting points made – the only one to mention advertising – describing some of the different watching scenarios of the future including different devices. Interested in HTML5 and the potential for continuing enolving of content, especially flexible data storage mechanisms.

Some FOAF stats

Some FOAF stats from Sindice for something I had to write last week.

All classes

“Agent”, 3.84 million
“Document”, 6.15 million
“Group”, 5.78 thousand
“Image”, 711.23 thousand
“OnlineAccount”, 15.47 thousand
“OnlineChatAccount”, found 324
“OnlineEcommerceAccount”, found 242
“OnlineGamingAccount”, found 240
“Organization”, 10.05 thousand
“Person”, 2.64 million
“PersonalProfileDocument”, 11.7 thousand
“Project”, found 726

All properties

“accountName”, 8.02 thousand
“accountServiceHomepage”, 7.24 thousand
“aimChatID”, 9.54 thousand
“based_near”, 7.35 thousand
“birthday”, 2.48 thousand
“currentProject”, found 648
“depiction”, 696.31 thousand
“depicts”, 617.16 thousand
“dnaChecksum”, found 65
“family_name”, 2.46 thousand
“firstName”, 4.2 thousand
“fundedBy”, found 237
“geekcode”, found 107
“gender”, 15.8 thousand
“givenname”, 24.17 thousand
“holdsAccount”, 9.88 thousand
“homepage”, 1.22 million
“icqChatID”, 22.8 thousand
“img”, 684.38 thousand
“interest”, 64.77 thousand
“isPrimaryTopicOf”, 1.54 million
“jabberID”, 2.98 thousand
“knows”, 1.08 million
“logo”, found 374
“made”, 1.97 million
“maker”, 1.97 million
“mbox”, 3.7 thousand
“mbox_sha1sum”, 43.9 thousand
“member”, 5.53 thousand
“membershipClass”, found 58
“msnChatID”, 7.68 thousand
“myersBriggs”, found 154
“name”, 1.77 million
“nick”, 96.7 thousand
“openid”, 80.24 thousand
“page”, 5.84 million
“pastProject”, found 179
“phone”, found 999
“plan”, found 139
“primaryTopic”, 278.11 thousand
“publications”, found 202
“schoolHomepage”, found 644
“sha1”, found 60
“surname”, 25.32 thousand
“theme”, found 282
“thumbnail”, 2.51 thousand
“tipjar”, found 73
“title”, 2.02 thousand
“topic”, 3.13 million
“topic_interest”, found 90
“weblog”, 300.06 thousand
“workInfoHomepage”, found 505
“workplaceHomepage”, 1.68 thousand
“yahooChatID”, 6.72 thousand

Displaying Guardian book reviews for quick buying on Amazon

I read the Saturday Guardian every week, and quite often buy a bunch of books reviewed in it. But equally, I don’t buy quite a lot of them as they’re only available in expensive and bulky hardback (plus I resent being market segmented like that, sorry). The Guardian’s reviews are very good but they only really review hardbacks in any depth or breadth, so it’s hit and miss whether I actually get to read any of them by the time they get to paperback. I just forget. I bet a lot of people do this.

Anyway, a couple of months ago I realised there was a Guardian content API as well as a data API. I applied for a developer key and, to my surprise, got one (the docs said they were giving out very few). This weekend I finally got around to having a play with it. It’s pretty neat. I’ve not explored it very thoroughly – I’m sure people can think of much more profound applications to make – but for book reviews there is lots of interesting data, and it’s available in JSON and XML.

My initial plan was to programmatically create an Amazon list – but this isn’t possible using the Amazon ECS API. However it is possible to search (on books, title, and authors) and get XML back, including a link to the Amazon page that describes it. I made a very simple page that does a request for book reviews with the appropriate date, and then for each result returned, identify the author and title and do an Amazon lookup to get the URL (I just pick the first one returned – I’m feeling lucky). It’s not as covenient as I’d hoped, but it does make it that tiny bit easier to

  • Buy things from the list straight away
  • Put things that are only available in hardback into my wishlist so I don’t forget about them

There are a couple of issues:

  • The title and author aren’t available as separate fields in the Guardian API. Usually the linktext is very formulaic and the information can be parsed out of that, but sometimes there are non-standard items and these fail
  • Characters with accents are returned as HTML entities so those need to be swapped back to characters in order to do the Amazon search
  • There’s no data about whether the book is in paperback or not, annoyingly. Amazon seems to mostly return the paperback version first if available, but maybe this is just good luck, and it probably needs more thought

The result isn’t too bad though and maybe I’ll buy a few more books. The Ruby code is here – you’ll need your own API keys for the Guardian and for Amazon though (they are both free and you can just get an Amazon one if you have an account with them)

Generating specs from RDFS / OWL docs

I’ve been hacking away at danbri’s version of specgen so we can rev the foaf spec. The idea is that you take an RDFS / OWL schema and generate some human-readable HTML from it, by taking the classes and properties and writing out their basic constituents. Optionally you can add some introductory text in a template, plus some individual bits of text for each property and class, eventually in different languages too.

I slapped in some RDFa yesterday because we needed a replacement for the ugly addition of RDF directly into the html, which makes it invalid. I realise some people may think this is back to front, but the foaf spec’s ‘original’ format has always been RDFS/OWL so it makes sense for us. I’m not actually sure we need two RDF versions (as there is alternate pointing to RDFS/OWL version from the HTML) but heck why not, and danbri’s consulting the community so there’s probably an argument I’ve missed.

There are several specgens available and at some point it might be nice to rationalise, or maybe go for functional equivalence. These are probably better in some senses than the one I’ve been working on, especially as I’m new to Python.

The ones I’ve found:

I think the two things that unite the first three is that they are (a) self-described hacks (b) in python. The Foaf one uses RDFlib rather than Redland because danbri was having trouble with Redland installation on the Mac I believe.

Next things I’d like to look at are

  • Generating specs from sample data (maybe someone’s done this already? It wouldn’t be complete but could be a start)
  • Defining application profiles or Argots and using them to generate, say, useful Sparql queries
  • Pictures!

CharBotGreen for Identica

CharBotGreen is stilll suspended on Twitter but fortunately she’s still announcing away on Identi.ca.

It’s trivial to move a bot from one to the other. In the source for CharBotGreen there’s a line

u = "http://twitter.com/statuses/update.json"

Using the Twitter-compatible Identica API you I can just replace that line with:

u = "http://identi.ca/api/statuses/update.json"

The only thing to watch for is that Identica stores names as lowercase and the authorisation fails if you don’t send it in lowercase.

Doesn’t work in Identi.ca:

req.basic_auth 'CharBotGreen', 'sekret'

works in Identi.ca:

req.basic_auth 'charbotgreen', 'sekret'

Thats it though – easy!

Web Unperson

A couple of times this week people pinged me to say their browser was reporting my site as a phisher like this. I thought little of it since we’d been hacked before on Dreamhost and WordPress and asssumed we had got on a blacklist somewhere. I rechecked the site, couldn’t find anything, and reported it as an error.

Last night though I found that my twitter bot, CharBotGreen had been suspended as a phisher, and tonight I find I’ve been suspended from twitter too. This is a bit of a blow, and the cause in both cases seems to be that I linked to my blog.

Using Google webmaster tools I discovered that several pages had links to viagra etc pages on them, invisible except in the source, with html inserted through the header php. Firefox and Safari made it difficult to find this out by inserting buggy ‘this is a phisher’ text (with broken links) over the source as well as the page itself.

I’ve now moved off Dreamhost completely – though it might have been simply that I had not updated WordPress enough. I’m on wordpress.com now, so I hope that’ll remove this riskiness.

The whole episode has made me rather depressed. Google has basically killed my online identity. I’m on various lists asking to be taken off, but there’s been no movement since last night, and I had no warning. It seems that there’s a blacklist being used in both cases, not competely sure what it is yet.

Anyway, if it happens to you, take it seriously and deal with it as soon as you can.

Update: I’m actually not on google’s suspended list any more. Hurrah! But still no Twitter. Guess it’s time to move to Identica with that and the madness of #fixreplies. Meh!

2nd Update: I got my Twitter account back this morning (2nd June, 3 days later). CharBotGreen is still suspended.

Useful links:

Google – My Site’s been hacked
Google webmaster tools
Google apps admin page: Google MX Records

iPhone working with PoGo

I’m so chuffed about this -

I bought a Polaroid PoGo inkless bluetooth mini sticker printer having been entranced by psd’s one, but knowing it didn’t work with the iPhone and that I’d have to get my laptop out to print anything. The PoGo is a lovely toy but I was getting a bit irritated by this limitation. The problem was twofold:

  •  iPhone bluetooth is crippled – you can only use bluetooth headphones, and not use it for file transfer. Annoying.
  • iPhone stores pictures as (peculiar) pngs and PoGo only accepts jpegs (which I found by trial and error – I can’t find any PoGo docs on that at all)

The first issue was easily solved – I have a jailbroken 1st gen phone and I just installed iBluetooth with Cydia, which is a app installer based on .deb packages.

The second was more tricky. I looked at ImageMagick for iphone (it’s on Cydia) but didn’t get anywhere. I think I needed to install gcc which was a step too far. Instead I put ssh on it (pretty cool in itself), found some hints on the web, and found that iPhone actually creates jpgs as well as pngs (in /private/var/mobile/media/DCIM/100APPLE – the pngs are in /private/var/mobile/media/DCIM/999APPLE). Weird! Anyway, iBluetooth allows you to browse the filesystem, and send files you find there, and that worked.

So all you really need is iBluetooth as it turns out. Hope this is useful to someone.

Companies House XML and Rewired State

I was at Rewired State last weekend and so a week or so ahead, I got around to applying to an XML Gateway account in order to get some interesting data out of there – this blog was supposed to be about a few technical aspects of using the gateway, but first, I hope you’ll forgive a shortish rant about the difficulties of getting data from Companies House and the highly annoying economy around the Companies House data. If you like, skip to the technical bit.

Companies House Data

First a little background. Companies House contains all details about all the companies in the UK, including names, company number (their primary identifier), status (if suspended, function, in liquidation etc), the official filings of the companies such as annual reports, and information about company directors and other appointments, including usually, the home addresses of the directors (except for some exeptions for security concerns, MPs and the like). You can get some of this information for free, and some you have to pay a bit for, either as XML or RTF.

Companies house has a SOAP gateway, called the ‘XML gateway’. It’s a pretty simple SOAP interface with good documentation (pdf). The costs are the same as for the RTF format – a pound a piece for the more interesting bits, free for the basic information (still interesting) but you pay 6 quid a month for access (prices), which seems pretty reasonable. It does however take a few days to get the account, as it’s a credit account designed for businesses who want to resell the information, so you need to get a temporary account, do a test, then apply on paper to create a direct debit; they aim for 5 working days maximum from when they get the forms. I sent mine last Friday, it got here today, Thursday, so by my calculation they made it with a day to spare.

So I misunderstood what the timings meant (not 5 days from first contact), so it became clear by last Friday that my XML account wouldn’t be ready in time but I wanted to show what would be done, if we had that information.

Now, in theory at least anyone can buy this information from Companies House directly, on the day needed and available immediately using their WebCheck service, (which bizzarely claims only to be open 7am to midnight). Reports and lists of directors cost a pound apiece and basic information about a company is free on the site (name, company number, main contact person and address, and status). In practice it’s a laborious task to actually get the information about the company, partly because the site’s fairly unusable, partly because sometimes it’s hard to know which specific company you are interested in because there are so many with similar names. Companies Open House is trying to remedy some of these issues

My interest was in getting a few lists of directors in order to demonstrate foafcorp UK, a kind of data-focused They Rule at Rewired State. I bought a few (which went fine, uses a credit card and worldpay) and then tried to download them. You search for it in the web interface, which is hard to use because it ‘s very stateful and you can’t link to aparticular company; you create a login; you buy it (‘Appointments report’), you get an email about it straight away, and the report gets put in your ‘download area’. Clicking on this makes a window pop up instructing you to rightclick and download. This is because it’s an ftp file! Anyway, the problem is that you can’t download it:

curl -O ftp://wck2.companieshouse.gov.uk/image/5b/29/c7/1b/d1/75/e7/c3/42/77/da/1f/b0/bd/c0/60/repA_01631639_506-143015-03619061_12.rtf


% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:03:13 --:--:-- 0
curl: (56) FTP response reading failed

I started to think that maybe the FTP ports were blocked – not impossible since the Rewired State team had had to ask for specific ports to be opened, and some of the guys at lunch explained to me that ftp was quite complex in terms of ports, choosing random ones – so I tried from a remote machine, but still nothing.

I did, finally, manage to get 4 of the 17 I’d paid for down, just be repeatedly trying with curl. An answer to my email enquiry (phones are only open Monday – Friday) came on Monday. It said you needed to wait for an hour before downloading as the documents could appear to be there when they weren’t. I’m not sure that this was the problem because I’m still having the issue (the reports are available for 10 days). But it seems fairly clear that few people are using this technique to get his information.

Instead they use the various resellers of that information. Try doing a search for “uk company directors” and see what you find for sponsored links. You’re tempted in by free searches (the same information available for free on the Companies House site) and then you can buy what looks like the same information as costs a quid on the CH site for between 6 – 10 pounds. I’m wondering if these companies are simply using the XML gateway to create these reports? hm.

Ruby and SOAP and the XML Gateway

Anyway, down to the technical stuff. You can apply right away for a temporary usename and password to access the XML gateway. The CH people are very efficient and you get this pretty much right away by email. This is the same as the real system except you only get information about one company back, so you can test the SOAP interface, and you can show that you could actually use it.

I looked at some SOAP libraries for Ruby but then just decided to try with net/http, uri and post. Probably not ideal, but time was short, the libraries were undocumented (even by the standards of Ruby) and this was a very quick way of seeing if their sytsem worked and what information it would return.

The basic idea was posting some data to a url:


require 'rubygems'
require 'net/http'
require 'uri'
require 'open-uri'


def Data.post(u,data)
begin
puts "Checking url #{u}"
url = URI.parse u
http = Net::HTTP.new(url.host, url.port)
res, body = http.post(url.path, data,
{'Content-type'=>'text/xml;charset=utf-8'})
case res
when Net::HTTPSuccess, Net::HTTPRedirection
puts "response #{res.body}"
else
puts "problem"
end
rescue URI::InvalidURIError
puts "URI is no good"
end
end

the url (the soap endpoint) for CH is

http://xmlgw.companieshouse.gov.uk/v1-0/xmlgw/Gateway

This method just prints out the XML response you get back (res.body) – but you could then use an XML parser like Hpricot to get the data out after that (Hpricot’s really an HTML parser so isn’t great at namespaced elements, but it can do them – and CH XML doesn’t have namespaces anyway).

The other part you need to do is authentication, again very simple. CH uses the name and password they give you, plus a random transaction identifier you provide:


require 'digest/md5'


user = "XMLGatewayTestUser" #you need to request these from CH; or ask me nicely
pass = "XMLGatewayTestPass"
transactionId = rand(7)
digest = Digest::MD5.hexdigest("#{user}#{pass}#{transactionId}")

Then you just slot the digest into the XML that you need to send. There are lots of examples of the XML and the general documentation is in the this Data Usage guide PDF. The FAQ is also useful.

You can see the code I made here. It’s not nice, but does get the results back. Here’s some samples: search, directors, details.

And that’s it!