I was asked a couple of questions about converting an Excel sheet into another format. Seems the target format had a city field, but the source format did not. The source only had an address field. This is not where the problems started tho, the problems started with the fact that the address field contained addresses from various sources, and that those addresses were completely unstructured. How do you parse completely unstructured data...
To give an example, the following could have been an address from the source Excel sheet:
10620 Jasper Avenue NW, Edmonton, AB T5J 2A3, Canada
Now this could have been easily parsed using a string tokenizer and getting the second token. However, in other cases an address could just as well look like (and no, I am not kidding!):
Now try and parse the city out of this one...
I ended up telling this person to use another field that was called branch and that this field contained a city name in a lot of cases, but not always (e.g. sometimes the branch was not a city name, but the name of a department). This would give us the best result possible, as trying to parse the unstructured address data would have been an impossible task.
But was it really impossible?
I kept thinking about the problem and I soon realized that maybe there was another approach. Instead of parsing the city out of the address field, this approach would scan the address field to see if it contained a city name from a table of known cities. E.g. if we had a list of cities like "Edmonton, Leduc, Nisku, Spruce Grove", we could take each city and see if we could find a match in the address field. There was only one problem: the sheet contained addresses from around the world. How do you find a complete list of all the known cities in the entire world, let alone all the known cities for one country... A colleague told me that there is such a list tho, available at some US geographical department. Googling around I found:
The National Geospatial-Intelligence Agency's GEOnet Names Server (GNS)
You can download a zip file for a country, which contains a text file with information of the cities of that country. For every record in that file, one could take the name of the city and then scan the address field for this name. If the city name was found in the address, then we known what city the address is in, we can map this city to the city field in the new format and we would all live happily ever after. Even though the amount of processing would be tremendous, it would probably yield a much more reliable result than just taking the value of the branch field. The processing can be reduced if the country is known (e.g. this Excel sheet I received contained addresses from India, so using the India city names file only would suffice).
Now I was getting all excited about how this would be a very interesting exercise and how it would be even more interesting to see the results. Unfortunately, using e.g. the Canada file, the following address in Edmonton:
2874 Calgary Trail Edmonton, Alberta, T6J 6V7 Canada
Would cause Outback Steakhouse in Edmonton to end up in Calgary...