CharBotGreen for Identica

CharBotGreen is stilll suspended on Twitter but fortunately she’s still announcing away on Identi.ca.

It’s trivial to move a bot from one to the other. In the source for CharBotGreen there’s a line

u = "http://twitter.com/statuses/update.json"

Using the Twitter-compatible Identica API you I can just replace that line with:

u = "http://identi.ca/api/statuses/update.json"

The only thing to watch for is that Identica stores names as lowercase and the authorisation fails if you don’t send it in lowercase.

Doesn’t work in Identi.ca:

req.basic_auth 'CharBotGreen', 'sekret'

works in Identi.ca:

req.basic_auth 'charbotgreen', 'sekret'

Thats it though – easy!

Web Unperson

A couple of times this week people pinged me to say their browser was reporting my site as a phisher like this. I thought little of it since we’d been hacked before on Dreamhost and WordPress and asssumed we had got on a blacklist somewhere. I rechecked the site, couldn’t find anything, and reported it as an error.

Last night though I found that my twitter bot, CharBotGreen had been suspended as a phisher, and tonight I find I’ve been suspended from twitter too. This is a bit of a blow, and the cause in both cases seems to be that I linked to my blog.

Using Google webmaster tools I discovered that several pages had links to viagra etc pages on them, invisible except in the source, with html inserted through the header php. Firefox and Safari made it difficult to find this out by inserting buggy ‘this is a phisher’ text (with broken links) over the source as well as the page itself.

I’ve now moved off Dreamhost completely – though it might have been simply that I had not updated WordPress enough. I’m on wordpress.com now, so I hope that’ll remove this riskiness.

The whole episode has made me rather depressed. Google has basically killed my online identity. I’m on various lists asking to be taken off, but there’s been no movement since last night, and I had no warning. It seems that there’s a blacklist being used in both cases, not competely sure what it is yet.

Anyway, if it happens to you, take it seriously and deal with it as soon as you can.

Update: I’m actually not on google’s suspended list any more. Hurrah! But still no Twitter. Guess it’s time to move to Identica with that and the madness of #fixreplies. Meh!

2nd Update: I got my Twitter account back this morning (2nd June, 3 days later). CharBotGreen is still suspended.

Useful links:

Google – My Site’s been hacked
Google webmaster tools
Google apps admin page: Google MX Records

iPhone working with PoGo

I’m so chuffed about this -

I bought a Polaroid PoGo inkless bluetooth mini sticker printer having been entranced by psd’s one, but knowing it didn’t work with the iPhone and that I’d have to get my laptop out to print anything. The PoGo is a lovely toy but I was getting a bit irritated by this limitation. The problem was twofold:

  •  iPhone bluetooth is crippled – you can only use bluetooth headphones, and not use it for file transfer. Annoying.
  • iPhone stores pictures as (peculiar) pngs and PoGo only accepts jpegs (which I found by trial and error – I can’t find any PoGo docs on that at all)

The first issue was easily solved – I have a jailbroken 1st gen phone and I just installed iBluetooth with Cydia, which is a app installer based on .deb packages.

The second was more tricky. I looked at ImageMagick for iphone (it’s on Cydia) but didn’t get anywhere. I think I needed to install gcc which was a step too far. Instead I put ssh on it (pretty cool in itself), found some hints on the web, and found that iPhone actually creates jpgs as well as pngs (in /private/var/mobile/media/DCIM/100APPLE – the pngs are in /private/var/mobile/media/DCIM/999APPLE). Weird! Anyway, iBluetooth allows you to browse the filesystem, and send files you find there, and that worked.

So all you really need is iBluetooth as it turns out. Hope this is useful to someone.

Companies House XML and Rewired State

I was at Rewired State last weekend and so a week or so ahead, I got around to applying to an XML Gateway account in order to get some interesting data out of there – this blog was supposed to be about a few technical aspects of using the gateway, but first, I hope you’ll forgive a shortish rant about the difficulties of getting data from Companies House and the highly annoying economy around the Companies House data. If you like, skip to the technical bit.

Companies House Data

First a little background. Companies House contains all details about all the companies in the UK, including names, company number (their primary identifier), status (if suspended, function, in liquidation etc), the official filings of the companies such as annual reports, and information about company directors and other appointments, including usually, the home addresses of the directors (except for some exeptions for security concerns, MPs and the like). You can get some of this information for free, and some you have to pay a bit for, either as XML or RTF.

Companies house has a SOAP gateway, called the ‘XML gateway’. It’s a pretty simple SOAP interface with good documentation (pdf). The costs are the same as for the RTF format – a pound a piece for the more interesting bits, free for the basic information (still interesting) but you pay 6 quid a month for access (prices), which seems pretty reasonable. It does however take a few days to get the account, as it’s a credit account designed for businesses who want to resell the information, so you need to get a temporary account, do a test, then apply on paper to create a direct debit; they aim for 5 working days maximum from when they get the forms. I sent mine last Friday, it got here today, Thursday, so by my calculation they made it with a day to spare.

So I misunderstood what the timings meant (not 5 days from first contact), so it became clear by last Friday that my XML account wouldn’t be ready in time but I wanted to show what would be done, if we had that information.

Now, in theory at least anyone can buy this information from Companies House directly, on the day needed and available immediately using their WebCheck service, (which bizzarely claims only to be open 7am to midnight). Reports and lists of directors cost a pound apiece and basic information about a company is free on the site (name, company number, main contact person and address, and status). In practice it’s a laborious task to actually get the information about the company, partly because the site’s fairly unusable, partly because sometimes it’s hard to know which specific company you are interested in because there are so many with similar names. Companies Open House is trying to remedy some of these issues

My interest was in getting a few lists of directors in order to demonstrate foafcorp UK, a kind of data-focused They Rule at Rewired State. I bought a few (which went fine, uses a credit card and worldpay) and then tried to download them. You search for it in the web interface, which is hard to use because it ‘s very stateful and you can’t link to aparticular company; you create a login; you buy it (‘Appointments report’), you get an email about it straight away, and the report gets put in your ‘download area’. Clicking on this makes a window pop up instructing you to rightclick and download. This is because it’s an ftp file! Anyway, the problem is that you can’t download it:

curl -O ftp://wck2.companieshouse.gov.uk/image/5b/29/c7/1b/d1/75/e7/c3/42/77/da/1f/b0/bd/c0/60/repA_01631639_506-143015-03619061_12.rtf


% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:03:13 --:--:-- 0
curl: (56) FTP response reading failed

I started to think that maybe the FTP ports were blocked – not impossible since the Rewired State team had had to ask for specific ports to be opened, and some of the guys at lunch explained to me that ftp was quite complex in terms of ports, choosing random ones – so I tried from a remote machine, but still nothing.

I did, finally, manage to get 4 of the 17 I’d paid for down, just be repeatedly trying with curl. An answer to my email enquiry (phones are only open Monday – Friday) came on Monday. It said you needed to wait for an hour before downloading as the documents could appear to be there when they weren’t. I’m not sure that this was the problem because I’m still having the issue (the reports are available for 10 days). But it seems fairly clear that few people are using this technique to get his information.

Instead they use the various resellers of that information. Try doing a search for “uk company directors” and see what you find for sponsored links. You’re tempted in by free searches (the same information available for free on the Companies House site) and then you can buy what looks like the same information as costs a quid on the CH site for between 6 – 10 pounds. I’m wondering if these companies are simply using the XML gateway to create these reports? hm.

Ruby and SOAP and the XML Gateway

Anyway, down to the technical stuff. You can apply right away for a temporary usename and password to access the XML gateway. The CH people are very efficient and you get this pretty much right away by email. This is the same as the real system except you only get information about one company back, so you can test the SOAP interface, and you can show that you could actually use it.

I looked at some SOAP libraries for Ruby but then just decided to try with net/http, uri and post. Probably not ideal, but time was short, the libraries were undocumented (even by the standards of Ruby) and this was a very quick way of seeing if their sytsem worked and what information it would return.

The basic idea was posting some data to a url:


require 'rubygems'
require 'net/http'
require 'uri'
require 'open-uri'


def Data.post(u,data)
begin
puts "Checking url #{u}"
url = URI.parse u
http = Net::HTTP.new(url.host, url.port)
res, body = http.post(url.path, data,
{'Content-type'=>'text/xml;charset=utf-8'})
case res
when Net::HTTPSuccess, Net::HTTPRedirection
puts "response #{res.body}"
else
puts "problem"
end
rescue URI::InvalidURIError
puts "URI is no good"
end
end

the url (the soap endpoint) for CH is

http://xmlgw.companieshouse.gov.uk/v1-0/xmlgw/Gateway

This method just prints out the XML response you get back (res.body) – but you could then use an XML parser like Hpricot to get the data out after that (Hpricot’s really an HTML parser so isn’t great at namespaced elements, but it can do them – and CH XML doesn’t have namespaces anyway).

The other part you need to do is authentication, again very simple. CH uses the name and password they give you, plus a random transaction identifier you provide:


require 'digest/md5'


user = "XMLGatewayTestUser" #you need to request these from CH; or ask me nicely
pass = "XMLGatewayTestPass"
transactionId = rand(7)
digest = Digest::MD5.hexdigest("#{user}#{pass}#{transactionId}")

Then you just slot the digest into the XML that you need to send. There are lots of examples of the XML and the general documentation is in the this Data Usage guide PDF. The FAQ is also useful.

You can see the code I made here. It’s not nice, but does get the results back. Here’s some samples: search, directors, details.

And that’s it!

CharBotGreen – a Twitter Radio 4 announcement bot

Update – it’s now charbotgreen2, as twitter never unsuspended charbotgreen.

I wanted to try out the Twitter API and since I was finding myself repeatedly going through the tedium of flipping browser tabs to see what was on Radio 4, I figured I’d make a bot that tweeted what was on Radio 4 instead. This had the added advantage that I could use some half-written code I’d started for a more complex event bot that was turning out to be too hard. I neglected to do a twitter search, however, which would have shown me that there were at least two similar services already working. Ah well. Here’s CharBotGreen

Thanks to: Damian for the name and technology suggestions, @psd for the picture, and Charlotte Green for being a great Radio 4 announcer (as are they all!)

Be warned – do not use my Ruby code as an example of good practice, as it most certainly is not.

What it does

Once a day – pulls down the Radio 4 programmes json (details – what an excellent service that is – beeb++) – and stores it in an H2 database like this, having wiped the database over night (sometime between 1am and 5.20am, when it’s on the world service and no detailed schedule is available anyway):


CREATE TABLE if not exists beeb(DT TIMESTAMP, PID VARCHAR(8), D DATE, T TIME, NAME VARCHAR(255));

So basically I start the Radio 4 day with an SQL representation of today’s schedule page. I started with PID as UNIQUE but then realised that the same PID could be broadcast twice a day.

Every 5 minutes – checks in the database for anything starting in the next 5 minutes and sends a tweet, either ‘starting now’ or ‘starting in a few minutes’ depending on the exactness of the match

SELECT * FROM beeb WHERE D = '#{d}' AND T >= '#{t}' AND T < '#{t1}';

where t is the current time and t1 is the time in 5 minutes (d is today’s date).

Technology

I use ruby and H2 over JDBC. You can see the every 5 minutes and daily scripts and the readme.txt. Why these technologies? Well, I wanted to learn Ruby and using Jruby means that you can use many ruby libraries but you can also access Java classes which is handy for using the H2 database. Why H2? well it’s a self contained, in-memory, SQL-compatible database written in pure Java, so I could keep everything in one directory. For something this lightweight there’s almost no point in using SQL but I wanted it for something a little more complex as well so it made sense (and makes it nice and easy). I use Json pure for the json parsing (it has to be pure to use it with Jruby). If you want to use Ruby rather than Jruby the SQL bit will take some fiddling with; the rest should be ok as is.

Hashtags

I jumped into a little chat on twitter about what hashtags to use and settled on #pid: and then the PID (such as b00h4r7x). I’m still not sure about this; I put the URL in as well.

It’s all super-simple

But good fun to do. Psd suggested that some Charlotte Green-style amusing incidents would be fun to put in there, though I’ve not worked out how to do that. Another improvement would be if it gave you a little more notice about what’s coming up as @bbcradio4live does.