A Pittsburgher back from the Sandbox: GIS

Showing posts with label GIS. Show all posts

Friday, September 24, 2010

Installing pyspatialite

Goal: To use spatial databases to perform operations such as centroid, nearest point, and distances to generate a table of distances along a road network from area to area, using the nearest node to the centroid of an area to represent the area.

Backup plan: Create a list of nearest node to centroid of areas. Use some code I have previously written in R to implement Dijkstra's algorithm.

Preferred method: Programmatically access Spatialite and perform all functions in spatialite.

Barrier: Most distributions of SQLite (including the sqlite3 library included in Python) do not enable loading of extentions.

Solution: Install Python module pyspatialite.

Installing pyspatialite is done through easy_install: sudo python setup.py install

The documentation mentions that you need geos and proj libraries installed. What it does not tell you is that you need the development versions (which include header files) of these libraries, as well as the Python headers. After installing the developmental files, then you can install pyspatialite.

Tuesday, September 14, 2010

Creating spatial data files

At my school, there is a class on Geographic Information Systems (GIS). When you complete the class, you now know how to analyze data using ArcGIS. But you do not know how to create the data files you used. And you do not know how to do anything that ESRI did not include in ArcGIS. As a researcher, I find that unacceptable. This year, there is a senior project team including someone who has taken this class doing a project with the local chapter of the Red Cross that will involve using spatial data. But since they cannot create a spatial file, I will create the data sets that they will use to populate their models. If I am going to do something, I may as well learn something along the way. Those somethings are Spatialite and Google Maps API for geocoding.

Step 1. Given a list of addresses, find their latitude and longitude. The official way to do this is to get the TIGER line files of street maps. These encode each block by name and the end points of addresses for each block along with their geospatial coordinates. Then you can interpolate. The more pragmatic way of doing this is to find someone who has already processed the TIGER line files and created an API or other interface to work with them. There are many that will then create Keyhole Markup Language (KML) files as output, and I have used one before.

The way I will do it is to use the Google Maps API. There is a Python package called googlemaps that will find lat/long given a properly formulated address. (note: long is a keyword in Python, so we use lng for longitude)

from googlemaps import GoogleMaps

def getlatlong(gmaps, streetaddress, city, state, zipcode):

address = streetaddress + " "+ city + ", "+ state +" "+ zipcode

lat, lng = gmaps.address_to_latlng(address)

return lat, lng

# get api_key from http://code.google.com/apis/maps/signup.html

api_key = "APIKEY"

gmaps = GoogleMaps(api_key)

. . .

lat, lng = getlatlong(gmaps, streetaddress, city, state, zipcode)

Step 2. Create an Spatialite database. Spatialite is the SQLite database with extensions for Simple Features for SQL, i.e. a relational database that allows for queries on geometric objects. The other options are Oracle Spatial, DB2 Spatial, PostGIS and MS Server with spatial extensions. There are also spatial extensions to MySQL, but they do not meet the SFSQL standard, and they are simplistically implemented (which would lead to many errors if you actually needed correct answers instead of approximations.) Of these, Spatialite does not require an administrator and contains its data in a single file, both of which simplify their use. To use Spatialite, the PROJ and GEOS libraries should already be on your machine. (see Spatialite installation instructions)

To create a database, use spatialite the same way as you would have used sqlite. But after creating the database, load init_spatialite.sql This SQL file loads some required tables. Sometimes, Spatialite will create the tables on its own, but sometimes not (especially if the database was originally an SQLite database).

Next, load the data (using any of a variety of different methods for SQLite databases.) Given address data, determine lat/long and enter each of them into their own fields. (you probably have to create these fields after importing the address data.

Step 3. Create the geometric representation of the data. What makes data spatial is the connection of a geometric representation of the data and the information about the spatial object. Before you do this, it would help to look at other datasets you are using. Spatial data is usually distributed as ESRI Shapefiles. Each Shapefile is actually a collection of files that include the polygons of the geographic objects, the data associated with those polygons, and a projection that describes how to make a representation of a surface of a sphere into the two dimensions of the computer screen (or piece of paper). The projection of a shapefile is usually found in a *.prj file in the same directory as the *.shp file. Most GIS viewers allow you to look at the shapefile properties from within the GIS viewer. The value needed is an SRID, usually a 4-digit number (sometimes 5). From here, a set of spatial queries are run from within Spatialite to add the geometry to the database. See prior post on loading lat/long data in Spatialite. When these spatial functions/queries are run, Spatialite will perform a series of operations that use the information in tables loaded by the init_spatialite.sql to generate the geometry representation or convert representations between projections. At this point, the data is now spatially enabled.

Step 4. View/Export. At this point, only a few GIS viewers can use Spatialite databased directly. One is QGIS. So after performing any wanted spatial queries or other operations you should export the data to another format, such as Shapefiles. The OGR/GDAL tools of the Geospatial Data Abstraction Library are able to do this. Within Spatialite you can use the .dumpshp function to create a shapefile.

The format of .dumpshp is

.dumpshp table Geometry sitelocations CP1252 POINT

table is the TABLE to be exported. I think this could have been a VIEW as well. Geometry is the name of the field within the table that has the geometric representation. Note that Spatialite stores this as a BLOB object. sitelocations is the name of the shapefile to be created. CP1252 is the name of a character set. Presumably there are others, but I have been using CP1252 for now. POINT is the spatial data type. Other types include LINESTRING, POLYGON, and MULTIPOINT.

What would be very useful right now is if I could automate steps 3 and 4 so I could run them from within Python instead of doing them inside Spatialite. However, this only works if SQLite was compiled with the ability to load extensions. Most distributers do not do so. One workaround is to install the pysqlite2 extension and compile it so that it can load extensions (as opposed to using the sqlite3 extension that is included in Python).

Monday, June 15, 2009

Using shp2pgsql to import Shapefiles to PostGIS

I covered some of this in a previous post on Converting Lat/Long data to PostGIS. But now my problem is to import Shapefiles into PostGIS.

First, why do this. While Shapefiles are good for distributing and looking at data, it is difficult to do any analysis. To do anything, you end up using OGR/GDAL to examine the shapefile and then do something with it. PostGIS takes the spatial and data components of the Shapefile and exposes them in database format, allowing anything that can be done with a database to be done with the Shapefile. In addition, it enables Simple Features for SQL which allows for using set operations on geographic objects (as opposed to only sets). Think of all the Venn diagram pictures back in High School. And other spatial operations that may come to mind when dealing with points, lines and shapes.

The key, of course, is to get the PostGIS database started. So, first, within PostgreSQL, create a database

CREATE DATABASE PA;

Next, load PostGIS onto the database. The first step is to load PL/PGSQL

CREATE LANGUAGE plpgsql;

Next, within the database, load the lwpostgis.sql and spatial_ref_sys.sql into the database. I used the pgAdmin III GUI.

Next, use the shp2pgsql tool The format is:

shp2pgsql [] > .sql

This should be run from the directory with the .shp file. refers to the name of the shapefile. is the name of the table to be created within the database. You can also enter this in schema.tablename form if you want. database name is the name of the database on the local server that it is being entered into. Usual practice is to pipe this to an SQL file that will be read in later (with all the necessary permissions. For option, the common one is to enter the -s of the EPSG (if known).

So, for example, using the NHPN shapefile for Pennsylvania (FIPS: 42), which is to be imported into a database named PACities, in the 'public' schema as the table 'NHPN' with the projection NAG 83 (ft) (Pennsyvania South) EPSG=2272, I get:

shp2pgsql -s 2272 S42NHPN public.NHPN PACities > nhpnImport.sql

And now it is in.

Friday, February 06, 2009

Book Review: Desktop GIS: Mapping the Planet with Open Source by Gary Sherman

Desktop GIS: Mapping the Planet with Open Source by Gary Sherman

My review

rating: 4 of 5 stars
Desktop GIS covers Open Source software for use as a Geographic Information Systems (GIS). In particular, it covers the following programs and libraries:

- Quantum GIS
- uDig
- GRASS(and its Java front end JGrass http://www.jgrass.org)
- PROJ.4
- GDAL/OGR
- PostGIS
- FWTools
- GMT

At the beginning of the book, the author outlines three classes of users. A casual user who only needs to look at data found from elsewhere, an intermediate user who visualizes but also creates or converts data, and an advanced user who has the need to do spatial analysis. To cover all of these is an ambitious goal, which is further diluted by the authors felt need to cover the entirety of open source mapping in one book.

What the author does well is to identify tools, and explains what can do what with enough to get you started. So Quantum GIS and uDig are the viewers, able to read almost any GIS format (in particular the readily-available ESRI Shapefiles as well as PostGIS). GDAL and OGR that can convert anything to anything (including delimited text. These are often distributed as FWTools). PROJ.4 that converts projections from one to another (and is embedded in everything). PostGIS which is the spatial database that enables spatial analysis. And GRASS, which is the full-fledged can-do-everything-but-is-hard-to-learn tool. And then some random programs that either do something completely different (e.g. OSSIM, which is an imagery analysis tool) or can make a picture of a map with lots of options (e.g. GMT).

What he provides are the basics for the casual or intermediate user. The advanced GIS analyst would only have a taste of what GRASS can do, but would not know what can be done with it. Similarly, while the intermediate user will get a sense of what PostGIS can do, the lack of space to cover spatial extensions to SQL supported by PostGIS loses its value to the advanced user. What could have made this book better was more focus. If the author was compelled to have a survey of all open source mapping, it may have gone into an appendix with a few paragraphs for each of the miscellaneous tools. But for the book, one good focus would have been the GIS stack comprising of data storage, data analysis and data viewing. Basically the open source counterpart to the ArcGIS/Oracle with Spatial Extensions. And everything that does not play a role in the stack, gets pushed into the appendix.

What could have been done? The book tended to be organized by tool. But once past the casual user, almost all tasks required multiple tools. I would have gone:

1. Viewing data (raster, vector, introduction to QGIS, uDig, GRASS)
2. Converting data/data formats (GDAL/OGR, Maybe PROJ.4)
3. Creating/Editing data (digitizing, importing)
4. Spatial databases (PostGIS)
5. Geoprocessing/spatial analysis - GRASS, PostGIS, R-spatstats
6. Tools integration (QGIS/uDig with GRASS/PostGIS)
7. Scripting
8. Customization

and everything that did not fall into this gets a page in an appendix.

Knowing where to start is a big help. Most of the websites either focus on one product, or try to teach everything as being equally important. Gary Sherman at least identifies the main building blocks. (QGIS or uDig, GDAL/OGR/PROJ.4, PostGIS, GRASS) and gives enough to get started. And this can be very helpful, so at least the starting analyst knows where to start.

What would be next? For the person still working within the GIS stack (as opposed to a completely different topic, like imagery analysis which is OSSIM's territory) there are a few obvious topics.

1. PostGIS - Spatial databases with SQL. Maybe even connections with ArcGIS. Even a short (10 pg) appendix would have done wonders here.
2. GRASS - The author devotes an additional appendix to this. But to do this right, you probably need to refer to Open Source GIS: A GRASS GIS Approach
3. R spatial statistics packages. Applied Spatial Data Analysis with R would cover this.

Mostly, a good book to get started in GIS using Open Source tools. Casual users would be well served. Intermediate users would get started and can find the rest using the internet. Advanced users are going to miss alot (to the point they don't even realize that these tools were a worthy alternative).

View all my reviews at Goodreads.

Wednesday, February 04, 2009

Loading lat/long data into PostGIS

Base scenario. I have a datatable with longitude and latitude data that I want to look at using spatial tools. And I have shape files for the metropolitan area that I want to use them with. So, in other words I have the following.

ESRI shapefiles
Data table with latitude and longitude

The plan is to load the data table into PostGIS, then I can pull them into a GIS system like uDig or Quantum GIS or various ESRI products. Because PostGIS can do transforms, but ESRI shapefiles are what they are, the procedure is as follows:

Identify SRID (spatial reference ID) for the ESRI shapefiles. This should include units (feet, meters, or degrees)
Identify a corresponding SRID that used lat long degrees
Load data table into PostGIS
Add geometry (points) data to datatable, including transformation from degrees to feet.
Optional: export PostGIS to ESRI Shapefile if needed.

1. Identify SRID (spatial reference ID) for the ESRI shapefiles. This should include units (feet, meters, or degrees)

Shapefiles usually come as a set of files inside a directory. When they come from an official source, one of these should be a *.prj file. This is an example of one

PROJCS["NAD_1983_StatePlane_Pennsylvania_South_FIPS_3702_Feet",
GEOGCS["GCS_North_American_1983",
DATUM["D_North_American_1983", SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]], PROJECTION["Lambert_Conformal_Conic"], PARAMETER["False_Easting",1968500.0],
PARAMETER["False_Northing",0.0],
PARAMETER["Central_Meridian",-77.75],
PARAMETER["Standard_Parallel_1",39.93333333333333],
PARAMETER["Standard_Parallel_2",40.96666666666667],
PARAMETER["Latitude_Of_Origin",39.33333333333334],
UNIT["Foot_US",0.3048006096012192]]

The first line identifies the projection, using something that should approach a WKT ("Well Known Text") name of a projection. In PostGIS, when the PostgreSQL database was geographically enabled, a table "spatial_ref_sys" was created that has a list of all SRID with names. Searching on this table should reveal a projection that has a name very similar to this one. That will define the target SRID for the whole project. The quick way to do this is to write a query:

SELECT srid, auth_name, auth_srid, srtext, proj4text
FROM spatial_ref_sys
WHERE srtext LIKE '%Pennsylvania%'
ORDER BY srtext;

In this case, we find:

2272; "PROJCS["NAD83 / Pennsylvania South (ftUS)", GEOGCS["NAD83", DATUM["North_American_Datum_1983",SPHEROID["GRS 1980", 6378137,298.257222101, AUTHORITY["EPSG","7019"]],

Note that in this case the key parts were NAD83, Pennsylvania South, and ft.

2. Identify a corresponding SRID that used long lat degrees
Another search through the "spatial_ref_sys" table finds only projections that are associated with feet or meters. So a broader search should be used to get

SELECT srid, auth_name, auth_srid, srtext, proj4text
FROM spatial_ref_sys
WHERE srtext LIKE '%NAD83%'
ORDER BY srtext;

This eventually reveals one projection that

4269;"GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]], AUTHORITY["EPSG","6269"]], PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]], UNIT["degree",0.01745329251994328, AUTHORITY["EPSG","9122"]], AUTHORITY["EPSG","4269"]]"

The key here is that the 'UNIT' is "degree" as opposed to "Survey feet" or "Meters".

3. Load data table into PostGIS

First, by this point the PostgreSQL database was created using a template that has PostGIS enabled. I created an SQL file with all the Create Table commands ahead of time. They look like this:

CREATE TABLE table2004
(
case_no character(7) NOT NULL,
week integer,
"month" character(3),
eventdate date,
bg_lat numeric(12,6),
bg_long numeric(12,6),
serial2004 serial NOT NULL,
CONSTRAINT table2004_pk PRIMARY KEY (serial2004)
)
WITH (OIDS=TRUE);
ALTER TABLE table2004 OWNER TO userid;

Key point is that I need something an integer field that is a primary key, or at some points indexing becomes a problem. So I made a serial field to be an index (because my case_no field is a text field so was unsuitable). Another issue is the geographic indexing also likes an OID field, so I set OIDS = TRUE. Most current PostgreSQL documentation mentions that this is deprecated, but PostGIS databases would be an exception.

Next, a geographically (geometry) enabled field is added. PostGIS has a PL/pgsql function named "AddGeometryColumn" that does this. Note that earlier, I established that the target SRID is 2272. Next, because there will be spatial searches, I will create an index on the geometry column.

SELECT AddGeometryColumn('public', 'table2004', 'longlat', 2272, 'POINT', 2);

CREATE INDEX idx_table2004_longlat ON table2004
USING GIST (longlat);

Next, the data is loaded. I had exported the data table to a csv file, then used VIM to surround each line with an INSERT INTO statement so it looked like this:

INSERT INTO table2004 (week, month, eventdate, case_no, bg_lat, bg_long) VALUES( 1,'JUL','07/01/2004', 6666097,40.382649,-79.803297);

This loads the data into PostgreSQL. The next step is to actually create the geographic data.

4. Add geometry (points) data to datatable, including transformation from degrees to feet.

Now that there is a geometry column, and the lat/long information in the database as a source, the point needs to be added as a point. This is done through an UPDATE statement, using the PL/pgsql function "transform" "PointFromText". Note that the SRID that corresponded to a latlong representation was 4269. Also note that POINT is expressed in x, y. So longitude comes before latitude.

UPDATE table2004 SET longlat = transform(PointFromText('POINT(' || bg_long || ' ' || bg_lat || ')', 4269), 2272);

Now, this is done, and the table can be viewed using a GIS viewer that can connect to a spatial database.

Some of the data is messy, in particular, I was not able to get lat/long for all the addresses. So I created a View that removed these from the result (or I get a bunch of points at 0, 0

CREATE OR REPLACE VIEW table2004work AS
SELECT table2004.week,table2004.month,table2004.eventdate, table2004.case_no,table2004.bg_lat,table2004.bg_long, table2004.serial2004, table2004.longlat
FROM table2004
WHERE table2004.bg_lat > 1 AND table2004.bg_long < (-1);

ALTER TABLE table2004work OWNER TO userid;
COMMENT ON VIEW table2004work IS 'Remove 0,0 data from map';

So this provides a cleaner dataset, and it is also accessible from a GIS viewer.

5. Optional: export PostGIS to ESRI Shapefile if needed.

To export from the PostGIS database to an ESRI Shapefile (if it needed to be given to someone who only need to look at it), the PostgreSQL database should be running. Then pgsql2shp can be run. This should be done after creating a data directory for the shape file to exist. From the terminal, cd into the target directory and run

pgsql2shp -f table2004working.shp datatable table2004work

This creates a shapefile with .shp, .dbg and .shx files. Note that you may need to create a .prj file as well, but this can be done from within a GIS viewer.

Note: Much of this was found through the great documentation available at Refractions PostGIS site and the Boston GIS website, which has multiple tutorials on PostGIS and PostgreSQL.

[Edit] In the 'srtext' field, the PROJCS identifies the SRID as belonging to a 'projection' which is a curved surface put on a flat plane. When there is no projection, it is a pure GEOGCS (geographic coordinate system) and it is in degrees lat/long

A Pittsburgher back from the Sandbox