Importing EPA WQX Domains into MySQL Tables

By John C. Zastrow

I wanted fresh WQX domain values from the STORET web service that are updated daily ( http://www.epa.gov/storet/wqx/wqx_getdomainvalueswebservice.html ) – which are provided in XML format. But, it’s not a friendly format for most software. Excel and Access “see it” but they don’t make useful tables out of them. It turns out that MySQL’s LOAD XML command (https://dev.mysql.com/doc/refman/5.5/en/load-xml.html) works for this, but you need to change the structure of the XML a bit. When the manual says that the XML must be one of the 3 formats – it’s not kidding and it’s not as flexible as the text makes it sound. So, the following instructions and horribly-brute force script will transform the EPA XML files into something that MySQL will understand and then load the data for you.

First I created this table to hold my data (you might need to add more columns).

CREATE TABLE `wqx21_domains` (
  `DOM_ID` int(11) NOT NULL AUTO_INCREMENT,
  `WQXElementName` varchar(500) DEFAULT NULL,
  `UniqueIdentifier` varchar(500) DEFAULT NULL,
  `Code` varchar(500) DEFAULT NULL,
  `Type` varchar(500) DEFAULT NULL,
  `TribalCode` varchar(500) DEFAULT NULL,
  `LastChangeDate` varchar(500) DEFAULT NULL,
  `Description` varchar(500) DEFAULT NULL,
  `Name` varchar(500) DEFAULT NULL,
  `ContextCode` varchar(500) DEFAULT NULL,
  `QualifierType` varchar(500) DEFAULT NULL,
  `Rank` varchar(500) DEFAULT NULL,
  `ExternalID` varchar(500) DEFAULT NULL,
  `STORETID` varchar(500) DEFAULT NULL,
  `SRSID` varchar(500) DEFAULT NULL,
  `SampleFractionRequired` varchar(500) DEFAULT NULL,
  `PickList` varchar(500) DEFAULT NULL,
  `CASNumber` varchar(500) DEFAULT NULL,
  `CountyFIPSCode` varchar(500) DEFAULT NULL,
  `CountyName` varchar(500) DEFAULT NULL,
  `StateCode` varchar(500) DEFAULT NULL,
  `CREATED_DT` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`DOM_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=52401 DEFAULT CHARSET=utf8

then run the following bash code from within a directory that looks like below.

WQX XML domain file list

WQX XML domain file list

in the last sed line below, note that you must reset the delimeter to (or something besides / as you need to escape those chars in the values you need to replace. This is the easiest way I think.

UPDATE: the post following this one shows how to create the directory of .zip files this script needs with a one-liner using wget.

<pre class="lang:sh decode:true" title="bash script to run through all the .zip files and load them to mysql">#!/bin/sh
clear
# optional stuff for checking output
# echo "I see this many zip files: "
# ls *.zip | wc -l
# echo "and they are: "
# ls *.zip
echo ""
echo ""
echo ""
echo ""

echo "************* START ******************"
pwd
echo ""
for zipper in *.zip
do
echo $zipper
echo ""
mkdir "$zipper"_folder
echo ""

unzip -u $zipper -d "$zipper"_folder
echo ""
echo ""
cd "$zipper"_folder
pwd
# I'm going to just leave these around as the directories are easy to delete later.
cat Results.xml | sed -e 's|WQXElementRowColumn|field|' > Results2.xml
cat Results2.xml | sed -e 's/WQXElementRowColumn>/field>/' > Results3.xml
cat Results3.xml | sed -e 's/colname="/name="/' > Results4.xml
cat Results4.xml | sed -e 's/value="/">/' > Results5.xml
cat Results5.xml | sed -e 's/" ">/">/' > Results6.xml
cat Results6.xml | sed -e 's|"></field>|</field>|' > Results7.xml
mysql -uuser -p'password' --local-infile -e "use wqx;LOAD XML LOCAL INFILE 'Results7.xml' INTO TABLE wqx21_domains ROWS IDENTIFIED BY '<WQXElementRow>';"
# rm *.xml
cd ..
echo ""
pwd
echo " ----- DONE ------- "$zipper
done
mysqldump -uuser -p'password' wqx > wqx_lookup_dump.sql

and you will get a MySQL table with contents as follows

--- +---------------------------------+---------+
| WQXElementName                  | Records |
+---------------------------------+---------+
| ActivityGroupType               |       8 |
| ActivityMedia                   |      16 |
| ActivityMediaSubdivision        |      64 |
| ActivityType                    |     100 |
| AddressType                     |       6 |
| AnalyticalMethod                |    6204 |
| Assemblage                      |      15 |
| BiologicalIntent                |       6 |
| CellForm                        |       5 |
| CellShape                       |      10 |
| Characteristic                  |    3766 |
| CharacteristicPickListValue     |    1747 |
| ContainerColor                  |       7 |
| ContainerType                   |      33 |
| Country                         |      16 |
| County                          |    3292 |
| Detection/QuantitationLimitType |      12 |
| ElectronicAddressType           |       3 |
| FrequencyClassDescriptor        |      65 |
| Habit                           |       9 |
| HorizontalCollectionMethod      |      38 |
| HorizontalReferenceDatum        |      16 |
| MeasurementUnit                 |     335 |
| MethodSpeciation                |      15 |
| MetricType                      |      21 |
| MetricTypeContext               |       3 |
| MonitoringLocationType          |      73 |
| NetType                         |       3 |
| Organization                    |     817 |
| PhoneType                       |      10 |
| ReferenceLocationType           |       4 |
| RelativeDepth                   |      16 |
| ResultDetectionCondition        |       5 |
| ResultLabComment                |      34 |
| ResultMeasureQualifier          |      53 |
| ResultStatisticalBase           |      28 |
| ResultStatus                    |      17 |
| ResultTemperatureBasis          |      19 |
| ResultTimeBasis                 |     106 |
| ResultValueType                 |       5 |
| ResultWeightBasis               |       4 |
| SampleCollectionEquipment       |     177 |
| SampleFraction                  |      24 |
| SampleTissueAnatomy             |      83 |
| SamplingDesignType              |       2 |
| State                           |      68 |
| Taxon                           |   65502 |
| ThermalPreservative             |      20 |
| TimeZone                        |      46 |
| ToxicityTestType                |       4 |
| Tribe                           |    1126 |
| VerticalCollectionMethod        |      28 |
| VerticalReferenceDatum          |      12 |
| Voltinism                       |      10 |
| WellFormationType               |       6 |
| WellType                        |      40 |
+---------------------------------+---------+
56 rows in set (0.92 sec)

Other resources:

http://stackoverflow.com/questions/8582837/load-xml-local-infile-with-inconsistent-column-names

http://blog.mclaughlinsoftware.com/2010/09/26/load-xml-local-infile/