Monday, October 13, 2008

Statements that come back to haunt you

Just read this article in Forbes, which says:
"I laugh when I hear that you can make a map by community input alone," says Tele Atlas founder De Taeye. He says that if tens of thousands of users travel a road without complaining, then Tele Atlas can be fairly certain that its map of the road is correct.
The first statement is demonstrably false already (see my earlier post about Oxford University and OpenStreetMap for just one example). But then it's a little bizarre that he follows that up by saying that the reason they know that their data is correct is through (lack of) feedback from the community. I've done my share of interviews and it's quite possible that these two statements were taken out of context, but it's an odd juxtaposition. I'm sure the owners of Encyclopedia Britannica laughed at the notion that you could produce a comparable online encyclopedia by community input alone, but they aren't laughing any more and are even moving to accommodate community input, as are most of the main mapping data providers of course (including Tele Atlas).

Community input isn't the answer to all data creation and maintenance problems, but it provides an excellent solution in a number of areas already, and the extent of its applicability will increase rapidly.

Sunday, October 12, 2008

New whereyougonnabe release, including support for Fire Eagle and GeoRSS

This week we added quite a number of new capabilities to whereyougonnabe, including support for Fire Eagle and GeoRSS, substantially expanded help documentation, improved synchronization capabilities, support for SMS, and exclusive use of Google Local Search in the whereyougonnabe browser application, as discussed previously (having introduced this approach for external sync in the previous release). For more detail on the new functionality from a user perspective see the whereyougonnabe blog - I'll touch on aspects of this here, but also discuss a little more of the technical background.

As most readers of this blog probably know, Fire Eagle is a service from Yahoo that is essentially an independent location broker supported by a number of applications which use location. You can sign up for free, and authorize applications to update and/or read your location. In the case of whereyougonnabe, we update your location automatically based on the plans you have told us about in advance. So for example, if you are using our Tripit interface, just by forwarding a single email with your itinerary, we will update your location throughout your trip, indicating the airports and cities you are in.

I really like our new RSS feed capability, which lets you easily view data from whereyougonnabe from outside the application, for example in web home pages like My Yahoo or iGoogle. The following is an example:

whereyougonnabe RSS feed

The content of the feed uses a similar algorithm to the one we use for WYTV, with a somewhat random mix of activities from your friends which will close to you, and some which won't. While a major focus of the system is of course identifying opportunities to meet friends, another important aspect is just keeping in touch with your friends. This is similar to using Facebook status or Twitter, but we have more information about an activity, including its location and time, which lets us select "interesting" items more effectively (for example, activities further from home will probably be more interesting than those close to home, in general).

You can click through on any element in the RSS feed to see more details in whereyougonnabe:

whereyougonnabe RSS feed clickthrough to map

This ability to link directly to a specific activity in whereyougonnabe is new with this new release. We now use these links in various places where you see data outside whereyougonnabe, including Facebook status and newsfeeds, Twitter messages, and iCal feeds. People will only be able to see the details if they are authorized to do so (i.e. typically if they are your friend on Facebook). We see this as an important development in terms of integration with other systems, and in encouraging people to use whereyougonnabe more often.

As I mentioned above, the RSS feed is actually a GeoRSS feed, so you can display elements on a map in software which supports GeoRSS, such as Google Maps (just copy the feed URL into the search box):
whereyougonnabe GeoRSS feed

Lastly, we have made various enhancements to the synchronization capabilities, including a sync on demand capability which gives you feedback on what is happening with the sync, which is helpful in confirming that it is working as expected, and/or diagnosing any problems.

There are several other functional improvements in the release too - as I mentioned, you can read more about these at the whereyougonnabe blog.

Wednesday, October 8, 2008

OpenStreetMap mapping party in Denver

Just a quick note to say that there will be a mapping party in Denver the weekend of October 18-19, following on from the first one we had back in July. Richard Weait from Cloudmade will be here to coordinate things, and I'll be hosting it at my loft in downtown Denver, as I did last time. I think we'll be starting at 1pm each day, and migrating to the Wynkoop Brewing Company downstairs at around 5pm.

If you haven't been to a mapping party before (or if you have!) I encourage you to come along. Last time we had a decent turnout, about 5 or 6 people each day, and we made some good progress on mapping downtown Denver.

OpenStreetMap in downtown Denver

The screen shot above shows some details that you won't find in Google or Microsoft maps of Denver, including various footpaths and the light rail line. Click here for a live link. I hope we will finish up a pretty complete map of the downtown area at this level of detail over that weekend. And of course if you have other interests like cycle paths, etc, you are free to work on whatever you like! If you have any questions let me know. More information will be posted at Upcoming shortly.

Which city is this airport in? (An exercise in using geonames)

Today I spent some time working updating on updating the airport dataset we use with whereyougonnabe. I thought that it was an interesting process to iteratively come up with a reasonable algorithm for what I needed to do, so I thought I would explain it here. I'd be interested in any feedback both on improvements to the technical approach (or entirely different approaches), and also on what the "right answer" should be from a user perspective in some of the questionable cases I mention.

As I've mentioned previously, we use Google Local Search for geocoding in general, but in our experience this really doesn't work well for geocoding airports based on the three letter airport codes. When we import travel itineraries from Tripit, the three letter airport codes are the best way to unambiguously identify which airports you will be in, so we need a reliable way of geocoding those.

We have been using the geonames dataset to do this, a rich open source database, which includes cities, airports and other points of interest from around the world. There are about 6.6 million points in the whole dataset, of which about 23000 are airports, and of those about 3500 have three letter IATA codes. Coverage seems to be fairly complete, but we have found a few airports with IATA codes missing - Victoria, BC, is one (code YYJ), and today I found that Knoxville was missing its code (TYS). I updated Victoria in the source data online, and it shows up in the online map but for some reason still does not appear in the downloaded data, I have that on my list of things to look into.

In general, the geonames airport records usually have null data in the city field. This is something I really wanted for each airport, so I can show a short high level description like "departing from Denver" or "arriving in London". Often the city name is included in the airport name, but not always - for example, the main airport in Las Vegas is just named "McCarran International Airport", which would not be an immediately obvious location to many people. And even where the city is included, the name is often longer than I want for a short description, for example "Ronald Reagan Washington National Airport".

For our first quick pass at populating the city for each airport, we used the Yahoo geocoding service, as it has a reverse geocoding function (and we thought we would try it out). You just pass this a coordinate and it gives you a city name (and other data) back. This is the data we are using in the live version of whereyougonnabe today. However, I noticed that this doesn't always give me what I would like - for example, it tells me that London Heathrow airport is in Ashford, and Vancouver airport is in Richmond. This may be technically correct (or not, I don't know for sure), but I'd really find it more useful to see the main city nearby in my high level description - so it will say I am arriving in London or Vancouver, in these two examples.

One nice aspect of the geonames dataset is that it includes the population of (many) cities, so we can use this in determining which is the largest nearby city. So in my first pass at solving the problem, I ran a PostGIS query for each airport to look for "large" cities nearby. Somewhat arbitrarily I initially specified population thresholds of 250000, 100000 and 0, and distance thresholds of 5, 25 and 50 miles. I would first search for cities with population > 250000 within 5 miles, and if I didn't find any I would extend to 25 miles and then 50 miles. If I still hadn't found anything, I would decrease the population to the next threshold and try again. If I found more than one in a given category I chose the closest, in this initial approach (the largest is obviously another alternative, but neither will be the right answer in all cases).

This worked reasonably well, returning London and Vancouver for the examples already mentioned. But it gave some wrong answers too, including Denver Airport which was associated with Aurora (a large suburb of Denver), and for a lot of small town airports it returned a larger town nearby, which in some cases might have been reasonable but in many cases wasn't.

So I decided to extend my algorithm to include matching on names - if there was a city called 'Denver' near 'Denver International Airport' then that is the one I was after. This required a little bit of SQL trickery. If I knew the city name and wanted to search for airports this is easy, you just use a SQL clause like:

where name like '%Denver%'

But it was the other way around, the value I had was a superstring of the value in the field I was querying. I found that in PostgreSQL you can do a statement like the following:

where 'Denver International Airport' like '%' || name || '%'

(|| is the string concatenation in PostgreSQL, so if the value in the field "name" is "Denver", the expression evaluates to 'Denver International Airport' like '%Denver%', which is true). I wouldn't really want to be running that sort of clause alone against the very large geonames table, for performance reasons, but since I also had clauses based on location and population to use as primary filters, performance was fine. And in fact to eliminate issues of upper versus lower case, the clause actually looked like:

where 'denver international airport' like lower('%' || name || '%')

So this would match the city names I found to any substring of the airport name (whether the city name was at the beginning or not). If I found a matching city name within a specified distance (I am currently using 40 miles), I use that, if not then I fall back to using the first approach to look for a "large city" nearby. On my first pass through with this approach, I was able to match names for 2356 of the 3562 airports with an IATA code.

I noticed that some cases were not matching when they looked like they should, where the relevant city names had accented characters. I realized that in some cases, the airport name had been entered with non-accented characters, so it was not matching the version of the city name which did have accented characters. The geonames data includes both an accented and non-accented (ascii) version of each city name, so I extended my where clause to do the wild card matching above on either the (accented) name field or the (non-accented) asciiname field. This increased the number of matching names to 2674, which was a pretty good improvement.

I noticed that in a few places, there were two potential "city" names in the airport name - in particular London Gatwick was matching on a "city" called Gatwick, rather than London, and Amsterdam Schiphol was matching a "city" called Schiphol. Looking at maps of the local area, I'm actually not convinced there is a city (or even a village) called Gatwick, and on further examination the geoname record shows its population as zero, and the same turned out to be true of Schiphol. But regardless, I decided to handle this by checking whether there was more than one city name match with the airport name, within the search radius, and if so to choose the largest city found (rather than the closest). There are some airports which serve two real cities also, and have both in their title, so this is a reasonable approach in those cases (no offense intended to the smaller cities!). This change returned me London instead of Gatwick and Amsterdam instead of Schiphol, as I wanted.

I was now getting pretty close, but I stumbled on a couple of small local airports where I was choosing a really really small town which happened to be closest, rather than a not quite so small town nearby which was really the obvious choice. For example, for Garfield County Airport in the mountains of Colorado, I was picking the town of Antlers (population listed as 0, though I think it might be a few more than that) rather than Rifle, population 7897! This was fixed just by added one more value to my population thresholds (I used 5000).

Obviously tweaking the population and distance thresholds would change some results, but from a quick skim I think the current results are pretty reasonable. One example which I am not sure whether to regard as right is JeffCo airport, a small airport which is in Broomfield, just outside Denver. The current thresholds associate this with Denver, which is probably reasonable, but you could argue that Broomfield is large enough to be named (45000).

If you're interested in checking out the full list, to see if your favorite airport is assigned to the city you think it should be associated with, you can check out this text file (the encoding of accented characters is garbled here, though it is correct in our database - but I haven't taken the time to figure out how to handle this using basic line printing in Java). Let me know if you think anything is wrong! The format of a line in the report is as follows:

13: GRZ Graz-Thalerhof-Flughafen -- 8073 Feldkirchen to Graz (Graz), true

The IATA airport code is first (GRZ), then the airport name from geonames (Graz-Thalerof-Fluhafen), then the city name we got from Yahoo reverse geocoding (8073 Feldkirchen), then the associated city name generated by our algorithm (Graz), followed by the ascii version of the name in parentheses (the same in this case), and lastly the "true" indicates that a match was found on the city name, rather than just using proximity and population.

So anyway, I thought this was an interesting example of how you can use the geonames dataset. I would be happy to contribute these city names back to geonames if people think this would be valuable - it is on my list to look into how best to do this, but if anyone has particular suggestions on this front let me know.