Companies House XML and Rewired State

I was at Rewired State last weekend and so a week or so ahead, I got around to applying to an XML Gateway account in order to get some interesting data out of there – this blog was supposed to be about a few technical aspects of using the gateway, but first, I hope you’ll forgive a shortish rant about the difficulties of getting data from Companies House and the highly annoying economy around the Companies House data. If you like, skip to the technical bit.

Companies House Data

First a little background. Companies House contains all details about all the companies in the UK, including names, company number (their primary identifier), status (if suspended, function, in liquidation etc), the official filings of the companies such as annual reports, and information about company directors and other appointments, including usually, the home addresses of the directors (except for some exeptions for security concerns, MPs and the like). You can get some of this information for free, and some you have to pay a bit for, either as XML or RTF.

Companies house has a SOAP gateway, called the ‘XML gateway’. It’s a pretty simple SOAP interface with good documentation (pdf). The costs are the same as for the RTF format – a pound a piece for the more interesting bits, free for the basic information (still interesting) but you pay 6 quid a month for access (prices), which seems pretty reasonable. It does however take a few days to get the account, as it’s a credit account designed for businesses who want to resell the information, so you need to get a temporary account, do a test, then apply on paper to create a direct debit; they aim for 5 working days maximum from when they get the forms. I sent mine last Friday, it got here today, Thursday, so by my calculation they made it with a day to spare.

So I misunderstood what the timings meant (not 5 days from first contact), so it became clear by last Friday that my XML account wouldn’t be ready in time but I wanted to show what would be done, if we had that information.

Now, in theory at least anyone can buy this information from Companies House directly, on the day needed and available immediately using their WebCheck service, (which bizzarely claims only to be open 7am to midnight). Reports and lists of directors cost a pound apiece and basic information about a company is free on the site (name, company number, main contact person and address, and status). In practice it’s a laborious task to actually get the information about the company, partly because the site’s fairly unusable, partly because sometimes it’s hard to know which specific company you are interested in because there are so many with similar names. Companies Open House is trying to remedy some of these issues

My interest was in getting a few lists of directors in order to demonstrate foafcorp UK, a kind of data-focused They Rule at Rewired State. I bought a few (which went fine, uses a credit card and worldpay) and then tried to download them. You search for it in the web interface, which is hard to use because it ‘s very stateful and you can’t link to aparticular company; you create a login; you buy it (‘Appointments report’), you get an email about it straight away, and the report gets put in your ‘download area’. Clicking on this makes a window pop up instructing you to rightclick and download. This is because it’s an ftp file! Anyway, the problem is that you can’t download it:

curl -O ftp://wck2.companieshouse.gov.uk/image/5b/29/c7/1b/d1/75/e7/c3/42/77/da/1f/b0/bd/c0/60/repA_01631639_506-143015-03619061_12.rtf


% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:03:13 --:--:-- 0
curl: (56) FTP response reading failed

I started to think that maybe the FTP ports were blocked – not impossible since the Rewired State team had had to ask for specific ports to be opened, and some of the guys at lunch explained to me that ftp was quite complex in terms of ports, choosing random ones – so I tried from a remote machine, but still nothing.

I did, finally, manage to get 4 of the 17 I’d paid for down, just be repeatedly trying with curl. An answer to my email enquiry (phones are only open Monday – Friday) came on Monday. It said you needed to wait for an hour before downloading as the documents could appear to be there when they weren’t. I’m not sure that this was the problem because I’m still having the issue (the reports are available for 10 days). But it seems fairly clear that few people are using this technique to get his information.

Instead they use the various resellers of that information. Try doing a search for “uk company directors” and see what you find for sponsored links. You’re tempted in by free searches (the same information available for free on the Companies House site) and then you can buy what looks like the same information as costs a quid on the CH site for between 6 – 10 pounds. I’m wondering if these companies are simply using the XML gateway to create these reports? hm.

Ruby and SOAP and the XML Gateway

Anyway, down to the technical stuff. You can apply right away for a temporary usename and password to access the XML gateway. The CH people are very efficient and you get this pretty much right away by email. This is the same as the real system except you only get information about one company back, so you can test the SOAP interface, and you can show that you could actually use it.

I looked at some SOAP libraries for Ruby but then just decided to try with net/http, uri and post. Probably not ideal, but time was short, the libraries were undocumented (even by the standards of Ruby) and this was a very quick way of seeing if their sytsem worked and what information it would return.

The basic idea was posting some data to a url:


require 'rubygems'
require 'net/http'
require 'uri'
require 'open-uri'


def Data.post(u,data)
begin
puts "Checking url #{u}"
url = URI.parse u
http = Net::HTTP.new(url.host, url.port)
res, body = http.post(url.path, data,
{'Content-type'=>'text/xml;charset=utf-8'})
case res
when Net::HTTPSuccess, Net::HTTPRedirection
puts "response #{res.body}"
else
puts "problem"
end
rescue URI::InvalidURIError
puts "URI is no good"
end
end

the url (the soap endpoint) for CH is


http://xmlgw.companieshouse.gov.uk/v1-0/xmlgw/Gateway

This method just prints out the XML response you get back (res.body) – but you could then use an XML parser like Hpricot to get the data out after that (Hpricot’s really an HTML parser so isn’t great at namespaced elements, but it can do them – and CH XML doesn’t have namespaces anyway).

The other part you need to do is authentication, again very simple. CH uses the name and password they give you, plus a random transaction identifier you provide:


require 'digest/md5'


user = "XMLGatewayTestUser" #you need to request these from CH; or ask me nicely
pass = "XMLGatewayTestPass"
transactionId = rand(7)
digest = Digest::MD5.hexdigest("#{user}#{pass}#{transactionId}")

Then you just slot the digest into the XML that you need to send. There are lots of examples of the XML and the general documentation is in the this Data Usage guide PDF. The FAQ is also useful.

You can see the code I made here. It’s not nice, but does get the results back. Here’s some samples: search, directors, details.

And that’s it!