Some FOAF stats

Some FOAF stats from Sindice for something I had to write last week.

All classes

“Agent”, 3.84 million
“Document”, 6.15 million
“Group”, 5.78 thousand
“Image”, 711.23 thousand
“OnlineAccount”, 15.47 thousand
“OnlineChatAccount”, found 324
“OnlineEcommerceAccount”, found 242
“OnlineGamingAccount”, found 240
“Organization”, 10.05 thousand
“Person”, 2.64 million
“PersonalProfileDocument”, 11.7 thousand
“Project”, found 726

All properties

“accountName”, 8.02 thousand
“accountServiceHomepage”, 7.24 thousand
“aimChatID”, 9.54 thousand
“based_near”, 7.35 thousand
“birthday”, 2.48 thousand
“currentProject”, found 648
“depiction”, 696.31 thousand
“depicts”, 617.16 thousand
“dnaChecksum”, found 65
“family_name”, 2.46 thousand
“firstName”, 4.2 thousand
“fundedBy”, found 237
“geekcode”, found 107
“gender”, 15.8 thousand
“givenname”, 24.17 thousand
“holdsAccount”, 9.88 thousand
“homepage”, 1.22 million
“icqChatID”, 22.8 thousand
“img”, 684.38 thousand
“interest”, 64.77 thousand
“isPrimaryTopicOf”, 1.54 million
“jabberID”, 2.98 thousand
“knows”, 1.08 million
“logo”, found 374
“made”, 1.97 million
“maker”, 1.97 million
“mbox”, 3.7 thousand
“mbox_sha1sum”, 43.9 thousand
“member”, 5.53 thousand
“membershipClass”, found 58
“msnChatID”, 7.68 thousand
“myersBriggs”, found 154
“name”, 1.77 million
“nick”, 96.7 thousand
“openid”, 80.24 thousand
“page”, 5.84 million
“pastProject”, found 179
“phone”, found 999
“plan”, found 139
“primaryTopic”, 278.11 thousand
“publications”, found 202
“schoolHomepage”, found 644
“sha1”, found 60
“surname”, 25.32 thousand
“theme”, found 282
“thumbnail”, 2.51 thousand
“tipjar”, found 73
“title”, 2.02 thousand
“topic”, 3.13 million
“topic_interest”, found 90
“weblog”, 300.06 thousand
“workInfoHomepage”, found 505
“workplaceHomepage”, 1.68 thousand
“yahooChatID”, 6.72 thousand

Displaying Guardian book reviews for quick buying on Amazon

I read the Saturday Guardian every week, and quite often buy a bunch of books reviewed in it. But equally, I don’t buy quite a lot of them as they’re only available in expensive and bulky hardback (plus I resent being market segmented like that, sorry). The Guardian’s reviews are very good but they only really review hardbacks in any depth or breadth, so it’s hit and miss whether I actually get to read any of them by the time they get to paperback. I just forget. I bet a lot of people do this.

Anyway, a couple of months ago I realised there was a Guardian content API as well as a data API. I applied for a developer key and, to my surprise, got one (the docs said they were giving out very few). This weekend I finally got around to having a play with it. It’s pretty neat. I’ve not explored it very thoroughly – I’m sure people can think of much more profound applications to make – but for book reviews there is lots of interesting data, and it’s available in JSON and XML.

My initial plan was to programmatically create an Amazon list – but this isn’t possible using the Amazon ECS API. However it is possible to search (on books, title, and authors) and get XML back, including a link to the Amazon page that describes it. I made a very simple page that does a request for book reviews with the appropriate date, and then for each result returned, identify the author and title and do an Amazon lookup to get the URL (I just pick the first one returned – I’m feeling lucky). It’s not as covenient as I’d hoped, but it does make it that tiny bit easier to

  • Buy things from the list straight away
  • Put things that are only available in hardback into my wishlist so I don’t forget about them

There are a couple of issues:

  • The title and author aren’t available as separate fields in the Guardian API. Usually the linktext is very formulaic and the information can be parsed out of that, but sometimes there are non-standard items and these fail
  • Characters with accents are returned as HTML entities so those need to be swapped back to characters in order to do the Amazon search
  • There’s no data about whether the book is in paperback or not, annoyingly. Amazon seems to mostly return the paperback version first if available, but maybe this is just good luck, and it probably needs more thought

The result isn’t too bad though and maybe I’ll buy a few more books. The Ruby code is here – you’ll need your own API keys for the Guardian and for Amazon though (they are both free and you can just get an Amazon one if you have an account with them)

Generating specs from RDFS / OWL docs

I’ve been hacking away at danbri’s version of specgen so we can rev the foaf spec. The idea is that you take an RDFS / OWL schema and generate some human-readable HTML from it, by taking the classes and properties and writing out their basic constituents. Optionally you can add some introductory text in a template, plus some individual bits of text for each property and class, eventually in different languages too.

I slapped in some RDFa yesterday because we needed a replacement for the ugly addition of RDF directly into the html, which makes it invalid. I realise some people may think this is back to front, but the foaf spec’s ‘original’ format has always been RDFS/OWL so it makes sense for us. I’m not actually sure we need two RDF versions (as there is alternate pointing to RDFS/OWL version from the HTML) but heck why not, and danbri’s consulting the community so there’s probably an argument I’ve missed.

There are several specgens available and at some point it might be nice to rationalise, or maybe go for functional equivalence. These are probably better in some senses than the one I’ve been working on, especially as I’m new to Python.

The ones I’ve found:

I think the two things that unite the first three is that they are (a) self-described hacks (b) in python. The Foaf one uses RDFlib rather than Redland because danbri was having trouble with Redland installation on the Mac I believe.

Next things I’d like to look at are

  • Generating specs from sample data (maybe someone’s done this already? It wouldn’t be complete but could be a start)
  • Defining application profiles or Argots and using them to generate, say, useful Sparql queries
  • Pictures!

CharBotGreen for Identica

CharBotGreen is stilll suspended on Twitter but fortunately she’s still announcing away on Identi.ca.

It’s trivial to move a bot from one to the other. In the source for CharBotGreen there’s a line

u = "http://twitter.com/statuses/update.json"

Using the Twitter-compatible Identica API you I can just replace that line with:

u = "http://identi.ca/api/statuses/update.json"

The only thing to watch for is that Identica stores names as lowercase and the authorisation fails if you don’t send it in lowercase.

Doesn’t work in Identi.ca:

req.basic_auth 'CharBotGreen', 'sekret'

works in Identi.ca:

req.basic_auth 'charbotgreen', 'sekret'

Thats it though – easy!

Web Unperson

A couple of times this week people pinged me to say their browser was reporting my site as a phisher like this. I thought little of it since we’d been hacked before on Dreamhost and WordPress and asssumed we had got on a blacklist somewhere. I rechecked the site, couldn’t find anything, and reported it as an error.

Last night though I found that my twitter bot, CharBotGreen had been suspended as a phisher, and tonight I find I’ve been suspended from twitter too. This is a bit of a blow, and the cause in both cases seems to be that I linked to my blog.

Using Google webmaster tools I discovered that several pages had links to viagra etc pages on them, invisible except in the source, with html inserted through the header php. Firefox and Safari made it difficult to find this out by inserting buggy ‘this is a phisher’ text (with broken links) over the source as well as the page itself.

I’ve now moved off Dreamhost completely – though it might have been simply that I had not updated WordPress enough. I’m on wordpress.com now, so I hope that’ll remove this riskiness.

The whole episode has made me rather depressed. Google has basically killed my online identity. I’m on various lists asking to be taken off, but there’s been no movement since last night, and I had no warning. It seems that there’s a blacklist being used in both cases, not competely sure what it is yet.

Anyway, if it happens to you, take it seriously and deal with it as soon as you can.

Update: I’m actually not on google’s suspended list any more. Hurrah! But still no Twitter. Guess it’s time to move to Identica with that and the madness of #fixreplies. Meh!

2nd Update: I got my Twitter account back this morning (2nd June, 3 days later). CharBotGreen is still suspended.

Useful links:

Google – My Site’s been hacked
Google webmaster tools
Google apps admin page: Google MX Records