The United States Census Bureau conducts an annual survey of approximately 3.5 million households called the American Community Survey (ACS), and releases the data in two formats that we use: a Summary file and a Public Use Microdata Sample (PUMS) file. The data is widely used by the federal government to shape legislation and funding for the nation.
In contrast to the decennial census which is designed to literally count all people living in the US, the ACS is administered using sampling: a long-form questionnaire is sent to a small percentage of the population every year. To protect privacy, some values, such as financial gains and losses or person age are top and/or bottom coded, and geographical location is only broadly known. This is in contrast to the summary files, which are able to share summary information from the households for very small geographical areas such as zipcodes or even blocks without endangering citizen privacy. The household PUMS data is reported across Public Use Microdata Areas (PUMAs) which each contain at least 100,000 people and are defined after each decennial census.
Included in the microdata files are a wide range of social, demographic, economic, and housing data which serve to enable analysis of relationships between various data points in a large community. For example, Ididio is able to describe the characteristics of people with given careers or degrees by aggregating the individual responses. The microdata is available in 1-year and 5-year formats; we use the 5-year data to allow greater statistical accuracy of the information presented.
Formatting and viewing the microdata files is not an easy task. However, people without data wrangling knowledge have terrific access to this data through the Census Bureau's American FactFinder, or, alternatively, through IPUMS. We access the microdata files through the Census FTP Server. Our access dates have been:
At Ididio, we used the house-by-house ACS microdata to investigate outcomes associated with people 1. Who earned bachelor's degrees in various fields, or 2. Who were working in various occupations.
When reporting outcomes for people based on the field of an earned bachelor's degree, we limit the microdata to those people who have attained a bachelor's degree. Some people report two undergraduate majors, and for such individuals, we split the corresponding survey record into two records. Each of the new records contains a single major, and the new record's survey weight is equal to half of the original's weight. This allows us to aggregate according to a single major, given that all possible combinations of double-majors would not yield statistically useful results.
The assumptions made in compiling workforce are significant in the results that we present on Ididio. Each assumption is based on an extensive study of the assumptions made by BLS and the Census Bureau in other data products, as well as a comparative analysis of the data that resulted when using differing assumptions.
Except when creating statistics related to age, we limit the person records to those who report occupations and who are 65 or younger. We found much more outliers for workers as they age, as post-retirement careers can be atypical.
With the exception of calculations involving employment status and part-time/full-time status for all who report occupations, we only calculate workforce data for individuals who report working 35 hours a week or more. This is our effort to create salary and other statistics for full-time work only.
Many individuals report both wage and salary income (paid by an employer) as well as self employment income, and we combine these two values for a single income corresponding to an individual. However, ACS top-codes these values at differing levels, so it is difficult to quantify/identify top-coding errors in our resulting estimates for high wage earners. It is also not possible to infer from the survey questions whether both wage earnings and self-employment earnings are for the same reported occupation, and without additional information we do infer that both types of earnings apply to the reported occupation.
At Ididio, we use the ACS microdata to create five-number summaries of the spread of responses:
We follow the method described by the National Institute of Standards and Technology (NIST) to calculate percentiles from a set of numbers, and we use interpolation to calculate values from binned data.
Each person record in the ACS is given a weight, so that the summed weights of all people estimate the total population. Each person is also given 80 replicate weight values that can be used to estimate the uncertainty of statistics inferred using the person record. For all statistics Ididio creates using the PUMS data, we calculate the percent standard error using the replicate weights and the Technical Documentation for ACS data methods.
We do not share any statistics created for which the standard error is greater than 25% of the data value.
The ACS microdata uses a special kind of top-coding of certain variables in order to protect the privacy of respondents. As an example, all reported wage and salary income above a certain percentile of responses are replaced with a single new value that causes the average over all data to be equivalent to that of the unchanged data. The trigger and replaced salary values vary by state and year.
The result of this policy is that some reported values may be quite a bit higher than their original value, and errors may not be possible to infer. There's a very helpful academic article Top-Coding and Public Use Microdata Samples from the US Census Bureau that explains this top-coding approach and its ramifications, and from the Census Bureau you can see a Census Bureau list of top-coded values and the impacted variables as well as the levels that trigger top coding is also available.