Tuesday, September 14, 2010

Creating spatial data files

At my school, there is a class on Geographic Information Systems (GIS). When you complete the class, you now know how to analyze data using ArcGIS. But you do not know how to create the data files you used. And you do not know how to do anything that ESRI did not include in ArcGIS. As a researcher, I find that unacceptable. This year, there is a senior project team including someone who has taken this class doing a project with the local chapter of the Red Cross that will involve using spatial data. But since they cannot create a spatial file, I will create the data sets that they will use to populate their models. If I am going to do something, I may as well learn something along the way. Those somethings are Spatialite and Google Maps API for geocoding.

Step 1. Given a list of addresses, find their latitude and longitude. The official way to do this is to get the TIGER line files of street maps. These encode each block by name and the end points of addresses for each block along with their geospatial coordinates. Then you can interpolate. The more pragmatic way of doing this is to find someone who has already processed the TIGER line files and created an API or other interface to work with them. There are many that will then create Keyhole Markup Language (KML) files as output, and I have used one before.

The way I will do it is to use the Google Maps API. There is a Python package called googlemaps that will find lat/long given a properly formulated address. (note: long is a keyword in Python, so we use lng for longitude)

from googlemaps import GoogleMaps
def getlatlong(gmaps, streetaddress, city, state, zipcode):
address = streetaddress + " "+ city + ", "+ state +" "+ zipcode
lat, lng = gmaps.address_to_latlng(address)
return lat, lng
# get api_key from http://code.google.com/apis/maps/signup.html
api_key = "APIKEY"
gmaps = GoogleMaps(api_key)
. . .
lat, lng = getlatlong(gmaps, streetaddress, city, state, zipcode)

Step 2. Create an Spatialite database. Spatialite is the SQLite database with extensions for Simple Features for SQL, i.e. a relational database that allows for queries on geometric objects. The other options are Oracle Spatial, DB2 Spatial, PostGIS and MS Server with spatial extensions. There are also spatial extensions to MySQL, but they do not meet the SFSQL standard, and they are simplistically implemented (which would lead to many errors if you actually needed correct answers instead of approximations.) Of these, Spatialite does not require an administrator and contains its data in a single file, both of which simplify their use. To use Spatialite, the PROJ and GEOS libraries should already be on your machine. (see Spatialite installation instructions)

To create a database, use spatialite the same way as you would have used sqlite. But after creating the database, load init_spatialite.sql This SQL file loads some required tables. Sometimes, Spatialite will create the tables on its own, but sometimes not (especially if the database was originally an SQLite database).

Next, load the data (using any of a variety of different methods for SQLite databases.) Given address data, determine lat/long and enter each of them into their own fields. (you probably have to create these fields after importing the address data.

Step 3. Create the geometric representation of the data. What makes data spatial is the connection of a geometric representation of the data and the information about the spatial object. Before you do this, it would help to look at other datasets you are using. Spatial data is usually distributed as ESRI Shapefiles. Each Shapefile is actually a collection of files that include the polygons of the geographic objects, the data associated with those polygons, and a projection that describes how to make a representation of a surface of a sphere into the two dimensions of the computer screen (or piece of paper). The projection of a shapefile is usually found in a *.prj file in the same directory as the *.shp file. Most GIS viewers allow you to look at the shapefile properties from within the GIS viewer. The value needed is an SRID, usually a 4-digit number (sometimes 5). From here, a set of spatial queries are run from within Spatialite to add the geometry to the database. See prior post on loading lat/long data in Spatialite. When these spatial functions/queries are run, Spatialite will perform a series of operations that use the information in tables loaded by the init_spatialite.sql to generate the geometry representation or convert representations between projections. At this point, the data is now spatially enabled.

Step 4. View/Export. At this point, only a few GIS viewers can use Spatialite databased directly. One is QGIS. So after performing any wanted spatial queries or other operations you should export the data to another format, such as Shapefiles. The OGR/GDAL tools of the Geospatial Data Abstraction Library are able to do this. Within Spatialite you can use the .dumpshp function to create a shapefile.

The format of .dumpshp is
.dumpshp table Geometry sitelocations CP1252 POINT
table is the TABLE to be exported. I think this could have been a VIEW as well. Geometry is the name of the field within the table that has the geometric representation. Note that Spatialite stores this as a BLOB object. sitelocations is the name of the shapefile to be created. CP1252 is the name of a character set. Presumably there are others, but I have been using CP1252 for now. POINT is the spatial data type. Other types include LINESTRING, POLYGON, and MULTIPOINT.

What would be very useful right now is if I could automate steps 3 and 4 so I could run them from within Python instead of doing them inside Spatialite. However, this only works if SQLite was compiled with the ability to load extensions. Most distributers do not do so. One workaround is to install the pysqlite2 extension and compile it so that it can load extensions (as opposed to using the sqlite3 extension that is included in Python).

No comments: