GSoC 2024 - IOOS, ERDDAP™
Abstract
This blog is about all the work that I was able to add to ERDDAP™ as part of my Google Summer of Code contribution period for the organization IOOS with my mentors Chris and Tylar.
About the project
My project for GSoC 2024 was adding enhancements to ERDDAP™, which is a scientific data server written in java. The work consisted of adding features for better experience of ERDDAP™ administrators. The majority of this work was writing a new parser for parsing the xml configuration file (datasets.xml) along with some changes and fixes to the existing codebase.
Pre contribution period
These are some things that helped me get familiar with ERDDAP™.
My first ever contribution involved fixing a bug causing incorrect data being generated in the logs file and hence polluting the logs. It was a short fix with the challenging part being to figure out which part of the code was causing this. I learned how to setup ERDDAP™ locally and the developer lifecycle through this contribution. Check it out here!
Following this I helped out a little with things such as serving as a test user for Jetty environment setup and finding failing test cases that occured during the tests migration to maven JUnit. Majority of those conversations can be found in this GitHub discussions.
Major Contributions!
As mentioned, majority of the work was building the datasets parser for ERDDAP™. It already has a custom written parser called SimpleXmlReader, then why the new one? Well there are a few reasons, the codebase is fairly old and does not support modern xml features making it difficult to write the datasets.xml configuration file such features could be, XInclude, xml namespaces etc. We could modify the older parser too but overall writing a new one seemed like the better approach. Few reasons being able to add features more easily by using modern libraries to write the parser, xml validation and overall better security for the datasets parsing. So the next step was finding a suitable library which works best for our use case and after a lot of discussion and detours with my mentors we decided to use the event based Java SAX library.
The difficult part of using SAX is that the current parser parses the dataset file sequentially from the top to bottom on the other hand SAX uses a push based model where we run callback functions on encountering the desired tag.
The development took place in the following phases:
To get me more familiar with the codebase the initial step was writing tests, this helped me debug things better. I wrote tests for a class called 'LoadDatasets.java' which essentially, you guessed it, loaded the datasets into memory for ERDDAP™ to use and was running in a separate thread. Link to PR.
Now that we understand how the datasets are being loaded and how the old parser worked, its time to isolate the old parsing code and a lot of refactoring. Why is that? the new parser would be an option to use which can be setup in the config file. Hence cleaning and refactoring the old code would make it significantly easier to add the code for new parser. Link to PR.
The file to parse is fairly complicated. It consists of a lot of 'Top Level tags' that contain some important data and the <dataset> tag which acts as the parent tag for all the datasets that we need to parse and to do all this we firstly had to configure and initialize the parser the code for which looks something like this.
private void parseUsingSAX(
int[] nTryAndDatasets,
StringArray changedDatasetIDs,
HashSet<String> orphanIDSet,
HashSet<String> datasetIDSet,
StringArray duplicateDatasetIDs,
StringBuilder warningsFromLoadDatasets,
HashMap tUserHashMap)
throws ParserConfigurationException, SAXException, IOException {
var context = new SaxParsingContext();
context.setNTryAndDatasets(nTryAndDatasets);
context.setChangedDatasetIDs(changedDatasetIDs);
context.setOrphanIDSet(orphanIDSet);
context.setDatasetIDSet(datasetIDSet);
context.setDuplicateDatasetIDs(duplicateDatasetIDs);
context.setWarningsFromLoadDatasets(warningsFromLoadDatasets);
context.settUserHashMap(tUserHashMap);
context.setMajorLoad(majorLoad);
context.setErddap(erddap);
context.setLastLuceneUpdate(lastLuceneUpdate);
context.setDatasetsRegex(datasetsRegex);
context.setReallyVerbose(reallyVerbose);
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setXIncludeAware(true);
factory.setNamespaceAware(true);
SAXParser saxParser = factory.newSAXParser();
SaxHandler saxHandler = new SaxHandler();
TopLevelHandler topLevelHandler = new TopLevelHandler(saxHandler, context);
saxHandler.setState(topLevelHandler);
saxParser.parse(inputStream, saxHandler);
}
We see the context design pattern being used here which is a simple POJO class with a bunch of getters and setters and was needed to make retrieval of objects and properties more clean and concise. Link to PR.
Once done with parsing the top level tags came the most difficult bit and the most important part of the project which was parsing the <dataset></dataset> tag that consisted to multiple inner datasets each with their own properties and structure. There exists a lot of repeating tags within these datasets which is why we needed a solution to tell the parser on how to deal with each scenario, to tackle this I was suggested to use the "State" design pattern by my mentors and as someone who has no experience with design patterns application this proved to be quite the task for me which ended up taking a few weeks along with a lot of discussion and inputs from my mentors but we finally had it in place. Following this we had to import a lot of logic from the older parser such as some error handling, logging, logic of when to skip datasets and logic on how to process the datasets including writing tests for all the code I write. Link to PR.
Now we had most of the logic in place to start parsing datasets. ERDDAP™ has many dataset types that the old parser loaded.
EDDGridAggregateExistingDimension
EDDGridCopy
EDDGridFromAudioFiles
EDDGridFromDap
EDDGridFromEDDTable
EDDGridFromErddap
EDDGridFromEtopo
EDDGridFromMergeIRFiles
EDDGridFromNcFiles
EDDGridFromNcFilesUnpacked
EDDGridLonPM180
EDDGridLon0360
EDDGridSideBySide
EDDTableAggregateRows
EDDTableCopy
EDDTableFromAsciiServiceNOS
EDDTableFromCassandra
EDDTableFromDapSequence
EDDTableFromDatabase
EDDTableFromEDDGrid
EDDTableFromErddap
EDDTableFromFileNames
EDDTableFromAsciiFiles
EDDTableFromAudioFiles
EDDTableFromAwsXmlFiles
EDDTableFromColumnarAsciiFiles
EDDTableFromHttpGet
EDDTableFromInvalidCRAFiles
EDDTableFromJsonlCSVFiles
EDDTableFromHyraxFiles
EDDTableFromMultidimNcFiles
EDDTableFromNcFiles
EDDTableFromNcCFFiles
EDDTableFromNccsvFiles
EDDTableFromOBIS
EDDTableFromSOS
EDDTableFromThreddsFiles
EDDTableFromWFSFiles
Each of these have their own parsing logic following which the parsed data is passed to a constructor that creates the dataset object to load into the memory. To build similar logic with the SAX parser I wrote handlers for all these datasets that contained callback functions and logic on how to parse the information. This was majority of the code written during the entire project having thousands of lines along with Tests for the datasets whose handlers had been implemented and after a few weeks of coding, debugging and a few pull requests we could finally have a parser that was close to working and can be used by people just by adding the line
<useSaxParser>true<useSaxParser> in the setup.xml file.
This completed most of the work except few changes that were added to the ERDDAP™ 'status.html' page that showed the datasets that had failed to load along with the reasons for them to fail which before this had to be checked in the mail being sent by ERDDAP™ hence, making it a little easier and quicker for admins to see why they were failing.
The following are the links to Pull Requests that contains all the code for dataset handler and the final changes to ERDDAP™.
Impact and summary of the work
<useSaxParser>true<useSaxParser> (default false) can be added in the setup.xml file to turn on parsing using sax parser.
Features like namespaces, xInclude, xml validation added making it much easier to write the datasets.xml file.
Status page now shows the reasons for datasets to fail for both the simpleXmlParser and SaxParser.
Tests for some classes using modern JUnit like LoadDatasets.java, TopLevelHandlerTests.java and DatasetHandlerTests.java.
Imposing good xml writing practices through the use of sax parser when writing datasets.xml file configuration.
Future plans and improvements
The sax parser is something that will be adopted over time by the ERDDAP™ user community and there are a few things around it that can still be improved such as better error handling messages, things like including the line numbers, new documentation around the use of sax parser along with its pros and cons, maybe some benchmarks to see how both the parsers perform and more. I hope to stick around ERDDAP™ and add some these in the future myself!
Closing words
My summer with IOOS and ERDDAP™ was a very memorable one. I went from knowing the very basics of java to adding thousands of lines of code changes to an important open source project which is used by amazing people around the world. There were ups and downs, tense feeling when things failed and feelings of achievement when things worked and a lot of this was thanks to my mentors Chris and Tylar with whom I had many meetings and great times with during this summer. A big thanks to all the people involved with IOOS and Google Summer of Code for this amazing experience and to everyone else reading this I highly insist to give contributing to open source projects a shot and getting involved in their communities, you will meet amazing people and learn things exponentially and add to improving the ecosystem which is healthy for the projects and everyone involved.