Frequently asked questions

Questions and answers about this project and the underlying the CAL-ACCESS database.

Why is all this necessary?

California’s jumbled, dirty and difficult campaign-finance database, known as CAL-ACCESS, sprawls across 80 database tables and weighs in at more than 650 megabytes.

A significant effort is required to understand its esoteric structure and prepare the records for meaningful analysis.

That barrier blocks anyone seeking to interpret the information. The challenge of untangling the database requires weeks of study and significant guesswork, discouraging most analysts from daring—and raising the risk that those who do will make a critical error.

That is the problem we aim to solve by leading an open-source effort to perfect the numerous transformations, filters and computer operations necessary to refine the raw data into an easy-to-use product.

What is the California Civic Data Coalition?

The California Civic Data Coalition is an open-source team of journalists and computer programmers from news organizations across America.

The coalition was formed in 2014 by Ben Welsh and Agustin Armendariz to lead the development of open-source software that makes California’s public data easier to access and analyze. The effort has drawn hundreds of contributions from developers and journalists at dozens of competing news outlets.

Learn more on our about page.

How far back does the CAL-ACCESS database go?

According to the California Secretary of State, electronic disclosure documents started being filed in CAL-ACCESS on Jan. 1, 2000. Historical analysis of the database should start from that date, the documentation says.

Do you offer all tables in the CAL-ACCESS database?

No. We’ve compared the list of tables in the daily database dump that powers this project to what’s described in the official CAL-ACCESS documentation. There are as many 73 tables excluded.

Some of these missing tables have names or descriptions suggesting they could contain sensitive information, such as user credentials and bank account numbers. It’s understandable that these tables would not be released.

However, many of these tables contain information that we believe should be publicly available.

For instance, there is a series of tables that describe elections, races and candidates that are not included in the daily exports, even though the list of candidates for the current election is published on the CAL-ACCESS website.

When we reached out to the California Secretary of State, asking that they include any elections-related tables in the daily dumps, we were told that “[g]iven our current resource constraints, staff cannot modify the database export to include that other data.”

Here’s a sampling of missing tables we think should be made public:

Name Description
CASH_RECON_REPORT_WRK Table description contains this mysterious comment: "J M needs to describe this table. Cox - 4/28/2000"
CODE_LIST This table contains a list of CAL file codes. Examples include entity codes, office codes and expense codes
CORRESPONDENCE Table description contains this mysterious comment: "J M needs to describe this table. Cox - 4/28/2000"
DISCLOSURE_PROCEDURES Table description contains this mysterious comment: "J M needs to describe this table."
ELECTION_CANDIDATES This table indicates if a candidate for a given race is an incumbent.
ELECTION_LINKS No description
ELECTION_RACES No description
ELECTION_TYPES This table links election types and their descriptions.
ELECTIONS No description
ERRORS_AND_OMISSIONS This table contains results of audit checks and the validation process.
FEDERAL_FORMS Table used to log reciept of federal filings.
FEES Fees, descriptions and their value
FILER_CORRESPONDENCE_BUILD2 Table description contains this mysterious comment: "J M needs to describe this table."
FILER_ELECTIONS Table description contains this mysterious comment: "J M needs to describe this table. He indicates it is for future use."
FILER_NOTICE_GENERATION_DEF "J M needs to describe this table. He indicates it is for future use."
FILER_OBLIGATIONS Table description contains this mysterious comment: "J M needs to describe this table. He indicates it is for future use."
FILER_TYPES_TO_FORMS Table description contains this mysterious comment: "J M needs to describe this table. It is in his list of tables designed for future releases."
FILING_ERROR_TYPES This lookup table provides a cross reference for errors and their and messages.
FILING_ERRORS This table contains the errors assocated with a given filing and each of it's amendments.
FILING_ID_TEMP No description
FORM_CODES This lookup table assocates record types to forms.
FORMS This table describes the form set.
LATE_CONT_IND_EXP_REPORT Table description contains this mysterious comment: "J M needs to describe this table."
LOCAL_FORMS This table is used to log reciept of local paper filings.
PRD_DATA_AUDIT No description
PRD_FINE_DETAIL Detail information on how a fine was calculated.
PRD_FINES Fine summary data table.
PRD_LIMITS Table description contains this mysterious comment: "J M needs to describe this table."
PRD_WAIVERS Table description contains this mysterious comment: "J Mo needs to describe this table."
TVIEW_CONTRIBUTIONS3 Campaign Disclosure reporting tables. "Need to get DH's Documentation to describe."

Do you offer all of the available data?

Almost. The raw data provided by the state contains some errors in how values are escaped, quoted and delimited. The result is that there are a small number of records we cannot yet automatically parse that are lost during our process.

However, according to our tracking, 99.9998% of records in the downloaded source files are offered here for download.

Do you modify the source data?

We make every effort to carefully parse and load the bulk CAL-ACCESS data from the state “as is.” Therefore, any undocumented modification of the data made during this process is considered a bug in our software.

There is one exception: We strip the time from any date field in the raw data. We consider this modification to be of little consequence since, for the most part, there is a time part of 12:00:00 AM for every value.

Based on our inspection of the raw data, very little information is being lost and whatever is lost has questionable utility.