Data Standardization for California’s Future
Jennifer Pahlka (founder of Code for America) recently wrote a good article outlining some of the details of Gavin Newsom’s new initiative — the Office of Digital Innovation (ODI) — and how the new office could spark momentum towards getting support towards badly needed improvements in the government’s technological infrastructure in the near future. Along with the governor’s strong commitment towards California’s housing goals, there’s plenty of reasons for Californians to be optimistic going forward since there’s a much stronger sense of “lets get things done!” now that I’ve seen previously before.
Although I live in Los Angeles now, until about a year ago I was working at The Mayor’s Office of Housing and Community Development as a Data Specialist up in San Francisco — at the time, under the direction of Mayor Edwin M. Lee before London Breed had taken over the role shortly thereafter. (After all the crazy happenings before the run up to the special-elections though that’s another story altogether.)
My job at the time was basically to get the data that we had — most of which were sitting around on Excel files and outdated databases (MS Access, anyone?) and get everything into a single, modernized database system. During my time there (~1 year) I was able to get a few of their smaller programs (Displaced Tenants Housing Preference Program [DTHP], Certificate of Preference [COP]) into the new system, which was an accomplishment that I’m pretty proud of, if I could say so myself. My stint at the Mayor’s Office was relatively short but I did enjoy my time there, mostly because of the great people I was able to work with day to day.
Now that I’ve had some time away to reflect, I feel like I have a bit of perspective on what went well there and what could be improved. And I would say that if the ODI wants to get the best bang for their buck, the best thing they could do is to push for a state-level data standard to help unify the definitions and procedures around virtually all the things that the government does every day.
Data Standards: Why It’s Needed — Badly
During my time working on data at the Mayor’s Office, much of the work I was doing involved having long conversations about what things *meant* — was “SSN” supposed to be “Social Security Number” or was it actually “Somewhat Spicy Noodles”? A lot of the distinctions might sound “obvious” to most people, but when you’re going through hundreds of them, some more vague than others, it can turn into a long, drawn out process. Since we were dealing with sensitive data that could potentially affect people’s livelihoods significantly (we’re talking about housing, after all) we weren't really allowed to make any assumptions in this regard. And that’s how it should be, honestly.
After my contract at the Mayor’s Office ended, I felt like I did have an impact on their procedures there — information made a little bit cleaner, a little bit more organized. And there’s a lot of low-hanging fruit in the GovTech space right now for this type of work, since it’s a problem that basically every department, every office, every staff has to deal with every single day.
But I do worry about where all of these efforts are going in the long run, because it seems like people are solving these issues independently rather than towards a unified standard, which is what good data practices should entail. Our department decided that they would call social security numbers “SSN”, but the office down the street could be using “SSN#”, or the office somewhere in one of the Los Angeles counties could be using “Social Sec. #” or some other variation that doesn’t quite line up with the others. If there ever becomes a need for us to connect these data points together (and the need will be there if we ever want to build any of the cool stuff), then we’d have to go through the redefinition process all over again, on a project-to-project basis.
This is actually very simple problem and the solution for fixing it is already quite obvious: we should all be using the same words for the things that mean exactly the same! Easier said than done, though, because it will require strong leadership on the part of the State of California to really get this right, and the path of getting there will probably take more work than most people think. Will the Office of Digital Innovation be the one to do it? If the State of California were to become the first state to set clear and workable data standards for all the cities and counties under it, it could potentially become the model in which things are done on a national level as well.
Even as a cryptocurrency/blockchain enthusiast (you can check my personal site or other posts on Medium for those) excited about the California Blockchain Working Group to come this summer, I know that without clean data, blockchain projects have basically no hope of succeeding — the blockchain itself can be thought of as just another type of a database, after all. If we try to decentralize government data systems before it’s clean — all we’d be doing is spreading the problems that we currently have all over the place, potentially making things even worse.
So if we want to take a “pay it forward” approach to Gov Tech and Civic Tech in general we will need to give the notion of clean data some serious thought — and fairly soon. Clean data allows us to capture people’s volunteering efforts more effectively, encourage accountability among elected officials and staff, while opening up the flood-gates towards services and projects that were previously impossible to do. To name a few: interactive maps, tracking/monitoring systems, auto-generated stats/charts, predictive data science models, cross-regional collaborative systems, and many others more that we won’t really know until the infrastructure itself is in place.
Data Standards: The Approach
Right now it’s unclear in regards to what sorts of roles and positions will be involved with the Office of Digital Innovation so there’s not much to be said there, but we can at least start to talk about how the implementation of the standards may look like, at least hypothetically.
- Step 1: Collect and create a comprehensive list of data definitions specific to State-level procedures. A lot of definitions will be specific to the State but entries like “First Name”, “Last Name”, “Address”, “Social Security Number”, etc. are definitions that are applicable more universally. The task then becomes to start finding the commonalities between the City and State.
- Step 2: Pass the list of standards down to the City level, where there agreement becomes: if there exists a definition that the State already has, the city will decide to use that one instead. If the city currently has “F. Name” as its naming scheme for example, it will simply change it to “First Name”, as established by the State.
- Step 3: After each City establishes their own standards in line with the State, they’ll then pass the standards down to their respective departments, where a similar process of renaming happens there.
- Step 4: Repeat the process as necessary, even down to the level of individual Excel files for individual staff. It may or may not be necessary to enforce these practices all the way down to that level, but in most cases staff members will probably appreciate the direction since they’re often forced to come up with these definitions on their own. (At the end of the day it doesn’t really matter to them if something is called “SSN” or “Social Security Number” since it doesn’t affect the nature of their work.)
By removing the majority of incongruities that arise from loose definitions, this gives data practitioners some breathing room to see themselves more as custodians of information that could lead to new and exciting services (what they signed up for), rather than a firefighter constantly trying to fix a system on the verge of falling apart — which is the reality of most data jobs right now, unfortunately.
Timeline for this project? The challenge here is more organizational than technical, so it could take a few months or a few decades, depending on how badly people want to get it done. The standards will have to be maintained, updated, and revised just like any other project but even an incomplete list would still be a huge improvement to what we have today.
Clean Data for a Better Future of California
So my hope is that the ODI would put some serious consideration into the establishment of a data standard practice for California as one of its priorities going forward this year. When it comes to “Big Data” nothing is bigger than the data that the government has, so it’s literally a huge resource that’s just waiting to be used. A lot of people think that the “Big Data” hype was a passing fad that happened some years ago — but the reality is that the potential for a data revolution was there before and after all along — just that nobody has bothered taking it to its full potential. Not yet, at least.
There are lots of people (technologists, volunteers, government officials, activists, PACs, etc.) just waiting for something like this this to happen so having the ODI take a lead on something like would send a positive signal to them — as well as the tech industry as a whole — that California is on the right track in regards to its technological policies going forward.