usdarnass
provides an alternative for downloading
various USDA data from https://quickstats.nass.usda.gov/
through R. You must sign up for an API key from the
mentioned website in order for this package to work.
The USDA’s documentation on Quick Stats can be found throughout https://www.nass.usda.gov/Quick_Stats/index.php. A short description of what the data entail can be summarised from the Quick Stats description on data.gov:
Quick Stats is the National Agricultural Statistics Service’s (NASS) online, self-service tool to access complete results from the 1997, 2002, 2007, and 2012 Censuses of Agriculture as well as the best source of NASS survey published estimates. The census collects data on all commodities produced on U.S. farms and ranches, as well as detailed information on expenses, income, and operator characteristics. The surveys that NASS conducts collect information on virtually every facet of U.S. agricultural production.
There are two main USDA sources within Quick Stats: censuses and surveys. The census values in Quick Stats start in 1997 while the survey values can range all the way back to 1850 and then annually since 1866. Although Agricultural Censuses occur once every 5 years (1997, 2002, 2007, 2012, and most recently 2017), USDA will administer other censuses, for instance the Census of Aquaculture, which explains the additional years of available census data. At this time, there is no support for documenting the various kinds of data which can be extracted from Quick Stats and this package only serves as an R interface for downloading Quick Stats data. It is up to the user to understand the source of the data they download, most of which can be found at https://www.nass.usda.gov/Surveys/.
usdarnass
is currently on CRAN and can be installed with
the typical method of using install.packages("usdarnass")
in R. However, if you would like the development version then it can be
installed with the use of devtools
:
Load package:
If a query works on the https://quickstats.nass.usda.gov/
interface, then it will work with the usdarnass
package.
Keep in mind that there is a 50,000 observation limit for both the web
interface and data queries with this package.
There are three main functions for this package with the first as the workhorse:
nass_data()
this will return a data.frame to the
specifications of the query from all of the arguments set in the
function call. This mimics the simple “GET DATA” command off of https://quickstats.nass.usda.gov/
and requires an API key. There is a 50,000 limit for each call.nass_param()
returns all of the possible values for a
parameter in a query. Helpful to understand how to subset a query if it
runs into the 50,000 limit.nass_count()
returns the number of records for a query.
Very useful in conjunction with nass_param()
to determine
what queries can return data with a nass_data()
call.All of these functions require an API Key for each query, which can be set with another function.
nass_set_key("YOUR_KEY_IN_QUOTATIONS", overwrite = TRUE)
# First time, reload your enviornment so you can use the key without restarting R.
readRenviron("~/.Renviron")
# You can check it with:
Sys.getenv("NASS_KEY")
The above script will add a line to your .Renviron
file
to be re-used whenever you are using the package. If you are not
comfortable with the package writing to your home directory, you can add
the following line to your .Renviron
file manually to
produce the same result. The function
usethis::edit_r_environ()
should open up your
.Renviron
file for you to make edits to with the following
line (use your actual API key for the right hand side value):
NASS_KEY = 'YOUR_KEY_IN_QUOTATIONS'
If you are not comfortable with either of these options, then to use
the package you need to ensure that the key
parameter is
set to your API key in each of your function calls that queries USDA
Quick Stats.
As previously stated, the workhorse function is the
nass_data()
function which will make query calls and return
a data.frame as long as the query will return 50,000 or fewer
observations. Each argument in the function call corresponds to the
categories which are displayed on the web interface of Quick Stats. The
arguments accept character values and even character vectors in the case
of multiple selections in a category. By way of an example, we can put
in a query to Quick Stats for the value of agricultural land (and
buildings) in Durham and Wake County North Carolina for 2012 (warning:
there is a lot of information :
nc_vals <- nass_data(year = 2012,
short_desc = "AG LAND, INCL BUILDINGS - ASSET VALUE, MEASURED IN $",
county_name = c("Durham", "WAKE"),
state_name = "NORTH CAROLINA")
str(nc_vals)
The output of this query has a lot to digest. The main focus is on
the Value
variable in the resulting data.frame. Please note
that the returned Value
for the query is of the class
character. The parameter numeric_vals
can be set to
TRUE
to have this return a numeric value, however the
default is to be a character type as there are some suppressed values
which will be coerced to NA
when numeric_vals
is set to TRUE
.
Aside from the output, the particular query used four parameters for
its output: year
, short_desc
,
county_name
, and state_name
. Each of these
parameters have a particular set of values which can generate of query,
which can be figured out using the nass_param()
function.
The year
argument is the only argument that accepts either
a character or numeric value, all other arguments only accept character
vectors which can be multiple values in the case of
county_name
in the call above. The arguments for
usdarnass
are not case sensitive. Whether the county is
called “Durham”, “durham”, “DURHAM”, or “dUrHaM” the parameter will be
ultimately passed to the API the same.
In querying the Quick Stats API, you usually do not need to subset
many parameters to get to the 50,000 limitation but if you do not get
within the 50,000 limitation the call will produce an error. As of right
now, there are 18 parameters for each query, although most of those will
be NULL
values and not required to specify an output.
There are 18 parameters which can be included in each query to Quick Stats however some of them are concatenations of other parameter values. If there is a name in quotations following a particular parameter, then that is the drop down menu via the Quick Stats web interface that gives you a value. Not all parameters will have a drop down menu though:
source_desc
“Program” - Source of data (“CENSUS” or
“SURVEY”). Census program includes the Census of Ag as well as follow up
projects. Survey program includes national, state, and county
surveys.sector_desc
“Sector” - Five high level, broad
categories useful to narrow down choices. (“ANIMALS & PRODUCTS”,
“CROPS”, “DEMOGRAPHICS”, “ECONOMICS”, or “ENVIRONMENTAL”)group_desc
“Group” - Subsets within sector (e.g., under
sector_desc = “CROPS”, the groups are “FIELD CROPS”, “FRUIT & TREE
NUTS”, “HORTICULTURE”, and “VEGETABLES”).commodity_desc
“Commodity” - The primary subject of
interest (e.g., “CORN”, “CATTLE”, “LABOR”, “TRACTORS”,
“OPERATORS”).short_desc
“Data Item” - A concatenation of six
columns: commodity_desc, class_desc, prodn_practice_desc,
util_practice_desc, statisticcat_desc, and unit_desc.domain_desc
“Domain” - Generally another characteristic
of operations that produce a particular commodity (e.g., “ECONOMIC
CLASS”, “AREA OPERATED”, “NAICS CLASSIFICATION”, “SALES”). For chemical
usage data, the domain describes the type of chemical applied to the
commodity. The domain_desc = “TOTAL” will have no further breakouts;
i.e., the data value pertains completely to the short_desc.domaincat_desc
“Domain Category” - Categories or
partitions within a domain (e.g., under domain_desc = “SALES”, domain
categories include $1,000 TO $9,999, $10,000 TO $19,999, etc).agg_level_desc
“Geographic Level” - Aggregation level
or geographic granularity of the data. (“AGRICULTURAL DISTRICT”,
“COUNTY”, “INTERNATIONAL”, “NATIONAL”, “REGION : MULTI-STATE”, “REGION :
SUB-STATE”, “STATE”, “WATERSHED”, or “ZIP CODE”)statisticcat_desc
“Category” - The aspect of a
commodity being measured (e.g., “AREA HARVESTED”, “PRICE RECEIVED”,
“INVENTORY”, “SALES”).state_name
“State” - State full name.asd_desc
“Ag District” - Ag statistics district
name.county_name
“County” - County name.region_desc
“Region” - NASS defined geographic entities
not readily defined by other standard geographic levels. A region can be
a less than a state (SUB-STATE) or a group of states (MULTI-STATE), and
may be specific to a commodity.zip_5
“Zip Code” - US Postal Service 5-digit zip
code.watershed_desc
“Watershed” - Name assigned to the
HUC.year
“Year” - The numeric year of the data and can be
either a character or numeric vector. Conditional values are also
possible, for example a character vector of “>=1999” of “1999<=”
will give years greater than or equal to 1999. Right now the logical
values can either be greater/less than or equal to with the logical at
either the beginning or end of a string with the year.freq_desc
“Period Type” - Length of time covered
(“ANNUAL”, “SEASON”, “MONTHLY”, “WEEKLY”, “POINT IN TIME”). “MONTHLY”
often covers more than one month. “POINT IN TIME” is as of a particular
day.reference_period_desc
“Period” - The specific time
frame, within a freq_desc.The descriptions of parameters here are minimal as most of the actual
parameters have a vast amount of available options. This is where the
nass_param()
function comes into play in order to give the
full set of values for each parameter.
The nass_param()
function will return a vector of all
the possible values for a parameter conditional on the other parameter
subsets given above. We can see that there are only two sources of
datasets for the Quick Stats queries by asking what the values for the
source_desc
parameter are:
The first argument in nass_param()
is the parameter of
interest, which can take on any of the 18 values from the parameters
section. This argument must be passed in a character format, so make
sure to use quotations in your calls. The characters are also not case
sensitive in the calls.
This function is most helpful in determining what variables are available for a certain subset. For example, if I was interested in what county level variables in Ohio are available in 2000 I might start by determining what “Group” is available at that level:
Now, if I want to further figure out what commodities are available for the “DAIRY” subset of this data but only after 2000, I would make a call of:
nass_param("commodity_desc",
group_desc = "dairy",
state_name = "OHIO",
agg_level_desc = "COUNTY",
year = ">2000")
While the year
parameter does not need to be a character
vector, it does accept relational operators which can modify the subsets
even further for queries.
Every query with nass_data()
has a 50,000 limit of
observations. In order to determine the number of observations in a
query, the nass_count()
function will accept all the same
parameters as nass_data()
but its output is a numeric of
the observations in a query:
Here we see how many observations are currently in the Quick Stats as of 2025-02-20. Clearly, the entire data can not be downloaded with a query of their API. If you are concerned about downloading all of the Quick Stats data, then it would be more efficient to use their ftp site ftp://ftp.nass.usda.gov/quickstats/.
By way of another example, we can look at how many observations are available related to agricultural land at the county level:
This particular query would not be able to be run for
nass_data()
because the number of observations greatly
exceeds 50,000 and indeed that query returns an error:
At this point in time a bit of understanding of the data and the
user’s goals are needed. If there is only one state of interest for the
study, then subsetting the data further to a state is likely the best
strategy. However, it is more likely that the user wants the all county
level data related to agricultural land. My strategy would be to look at
the number of observations for each year of interest to my data request.
I would then take a vector of potential values for years and iterate a
call to nass_count
over each year to determine the count of
values for each year:
years <- 2000:2017
names(years) <- 2000:2017
sapply(years,
function(x) nass_count(year = x,
commodity_desc = "AG LAND",
agg_level_desc = "COUNTY"), USE.NAMES = TRUE)
The observations here are only related to 2002, 2007, and 2012 which are agricultural census years and it is highly likely that there are a lot of variables in the category what would likely not be excessive information. It is then best to look at the descriptions of the variables to figure out what data would be most useful:
agland_params <- nass_param("short_desc",
commodity_desc = "AG LAND",
agg_level_desc = "COUNTY",
year = c(2002, 2007, 2012))
length(agland_params)
sapply(agland_params[1:6],
function(x) nass_count(short_desc = x,
commodity_desc = "AG LAND",
agg_level_desc = "COUNTY",
year = c(2002, 2007, 2012)))
There are many parameter values in that query, but only the first 6
are displayed to save space. While most of these data items fit within
the 50,000 limit, not all do. Take for example the “AG LAND - TREATED,
MEASURED IN ACRES” category exceeds the limit and would not be able to
be downloaded. This is because the treated category actually has
multiple domains which can be seen by combining the
nass_param()
and nass_count()
:
agland_domain <- nass_param("domain_desc",
short_desc = "AG LAND - TREATED, MEASURED IN ACRES",
commodity_desc = "AG LAND",
agg_level_desc = "COUNTY")
sapply(agland_domain,
function(x) nass_count(domain_desc = x,
short_desc = "AG LAND - TREATED, MEASURED IN ACRES",
commodity_desc = "AG LAND",
agg_level_desc = "COUNTY"))
To download the above data, one would need to iterate over each of
the domain_desc
values in order to fit within the limits.
The lapply
method is likely the easiest way to accomplish
this: