Open data

Our group strives to make our software and datasets easily accessible to promote reproducible research. Our open source codes are available under the permissive UofI/NCSA open source license and published on Github. Open datasets are published here.

2010-2013 New York City Taxi Data

This dataset was obtained through a Freedom of Information Law (FOIL) request from the New York City Taxi & Limousine Commission (NYCT&L). It covers four years of taxi operations in New York City and includes 697,622,444 trips. Thanks to a generous hosting policy by the University of Illinois at Urbana Champaign, we are able to make this large dataset publicly available.

You are free to use the data as you wish, we only kindly ask you to consider citing the following works if you plan to publish subsequent results using the dataset:

Brian Donovan and Daniel B. Work. “Using coarse GPS data to quantify city-scale transportation system resilience to extreme events.”  presented at the Transportation Research Board 94th Annual Meeting, January 2015.  preprintsource code.

Brian Donovan and Daniel B. Work  “New York City Taxi Trip Data (2010-2013)”. 1.0. University of Illinois at Urbana-Champaign. Dataset. http://dx.doi.org/10.13012/J8PN93H8, 2014.

Download the data here:  http://dx.doi.org/10.13012/J8PN93H8

The data is stored in CSV format, organized by year and month. In each file, each row represents a single taxi trip. Table 1 below gives a small sample of this data. As there are several entries per second for four years, the raw trip data takes up about 116GB in text CSV format. The data has been compressed (zip) to reduce download time.

The data is organized as follows:

  • medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.  See also medallions.
  • hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID. See also hack license.
  • vender id:  e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project.
  • rate_code: taximeter rate, see NYCT&L description.
  • store_and_fwd_flag: unknown attribute.
  • pickup datetime: start time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
  • dropoff datetime: end time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
  • passenger count: number of passengers on the trip, default value is one.
  • trip time in secs: trip time measured by the taximeter in seconds.
  • trip distance: trip distance measured by the taximeter in miles.
  • pickup_longitude and pickup_latitude: GPS coordinates at the start of the trip.
  • dropoff longitude and dropoff latitude: GPS coordinates at the end of the trip.

The medallion and hack licenses are reassigned each year, so it is only possible to track drivers and vehicles within each year. This is necessary for to render the data pseudo-anonymous, since de-anonymized data from 2013 can be reconstructed from existing published datasets, see the note on anonymity below.

Table 1. A small subset of the New York City taxi trip data. Each row corresponds to an occupied taxi trip. Scroll sideways to view all columns.
medallionhack_licensevendor_idrate_codestore_and_fwd_flagpickup_datetimedropoff_datetimepassenger_counttrip_time_in_secstrip_distancepickup_longitudepickup_latitudedropoff_longitudedropoff_latitude
20100000012010000001VTS12010-01-01 00:00:002010-01-01 00:34:0013414.05-73.94841840.72459-73.9261440.864761
20100000022010000002VTS12010-01-01 00:00:002010-01-01 00:33:001339.65-73.99741440.736156-73.99783340.736168
20100000032010000003VTS12010-01-01 00:00:002010-01-01 00:07:00171.63-73.96717140.764236-73.95629940.781261
20100000042010000004VTS12010-01-01 00:00:002010-01-01 00:33:0013326.61-73.78975740.646526-74.13674940.601543
20100000052010000005VTS12010-01-01 00:00:002010-01-01 00:28:002283.15-73.9995540.731152-73.97744840.763031
20100000062010000006VTS12010-01-01 00:00:002010-01-01 00:27:0012711.15-73.99369840.736946-73.86143540.756256
20100000072010000007VTS12010-01-01 00:00:002010-01-01 00:18:003184.30-74.00605840.739925-73.95740540.765686
20100000082010000008VTS12010-01-01 00:00:002010-01-01 00:27:001279.83-73.87424540.773739-74.002840.760498
20100000092010000009CMT102010-01-01 00:00:002010-01-01 00:18:13118.2199999999999993.40-74.00486840.751656-73.98834240.718399
20100000102010000010CMT102010-01-01 00:00:022010-01-01 00:36:27236.42000000000000212.40-73.9554640.787731-73.96173940.666935

 

Please note that the dataset contains a large number of errors. For example, there are several trips where the reported meter distances are significantly shorter than the straight-line distance, violating Euclidean geometry. For some periods, the field trip_time_in_secs is reported in seconds, in others it is reported in minutes (see the first record above). Generally the trip time can be safely computed by subtracting the pickup_datetime from the dropoff_datetime.  Additionally, many trips report GPS coordinates of (0,0), or cover impossible distances, times, or velocities. All of these types of obvious trip errors should be discarded in any analysis. In our preliminary investigations, these errors account for roughly 7.5% of all trips.  More details about these errors are available in the above article and corresponding open source code. Currently, only the raw data (no error filtering) is available for download via this site.

Fare data is also available from 2010-2014. The fare data takes about 75GB in raw text CSV format, and is also zipped to reduce download times. A sample of the fare data is shown in Table 2 below. The files are also organized by year and month, and contain the following attributes:

  • medallion: a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID. See also medallions.
  • hack license: a license to drive the vehicle, it is effectively a (randomly assigned) driver ID. See also hack license.
  • vender id:  e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project.
  • pickup datetime: start time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
  • payment type: Cash or credit card.
  • fare amount: the meter fare, it should include the Newark surcharge, in USD.
  • surcharge: Extra fees, such as rush hour and overnight surcharges, in USD.
  • mta tax: Metropolitan commuter transportation mobility tax, in USD.
  • tip amount: tip amount, in USD.
  • tolls amount: total price paid for tolls, summed across all tolls for the trip, in USD.
  • total amount: all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD.

Again, note the medallion and hack licenses change each year.

Table 2. A small subset of the New York City taxi fare data. Each row corresponds to an occupied taxi trip. Scroll sideways to view all columns.
medallionhack_licensevendor_idpickup_datetimepayment_typefare_amountsurchargemta_taxtip_amounttolls_amounttotal_amount
20100000012010000001VTS2010-01-01 00:00:00CAS34.10.50.50035.1
20100000022010000002VTS2010-01-01 00:00:00CAS27.30.50.50028.3
20100000032010000003VTS2010-01-01 00:00:00CAS6.90.50.5007.9
20100000042010000004VTS2010-01-01 00:00:00Cre56.10.50.5109.1476.24
20100000052010000005VTS2010-01-01 00:00:00CAS14.50.50.50015.5
20100000062010000006VTS2010-01-01 00:00:00CAS27.70.50.50028.7
20100000072010000007VTS2010-01-01 00:00:00CAS13.30.50.50014.3
20100000082010000008VTS2010-01-01 00:00:00Cre25.70.50.504.5731.27
20100000092010000009CMT2010-01-01 00:00:00Cas12.50.50.50013.5
20100000102010000010CMT2010-01-01 00:00:02Cas31.70.50.50032.7

 

A note on anonymity. The published datasets on this site have been pseudo-anonymized to obscure personally identifiable information. It is well known that location data is notoriously difficult to anonymize, see for example the works of Marco Gruteser or John Krumm. Moreover, a subset of the of the raw dataset obtained via a FOIL request (published by Chris Whong) has already been de-anonymized by Vijay Pandurangan. Because the true ids can still be recovered with a FOIL request and by following the techniques described in the above links, we only aim to make recovering the true ids slightly more work than writing a new FOIL request to NYCT&L.

How we pseudo-anonymized the datasets.  The medallion and hack licenses were pseudo-anonymized by assigning a randomly generated medallion and hack license, instead of using the hashed medallion and hack licenses provided by NYCT&L. Each year contains a new set of medallions and hack licenses. This means it is possible to track a driver through all of 2010, but it is NOT possible to track the same driver in 2011, for example. We are not able to give a medallion or hack license across the complete dataset because the 2013 data has already been de-anonymized, and doing so would trivially compromise the remaining data. Finally, the dataset may still be vulnerable to statistical or other attacks to recover the IDs, and thus we do not claim it is anonymous.

We ultimately decided to publish this dataset in an effort to make our own research reproducible, and to aid other researchers interested in taxi operations.

The NYYT&L Commission does not restrict publishing the data, as determined from personal communication with the Commission. “The data was disclosed pursuant to the NYS Freedom of Information Law, therefore there is no licensing restriction on your publication of the data.”

Moreover, the University of Illinois at Urbana Champaign Institutional Review Board reviewed our request to publish this dataset. “Since you received this information via the Freedom of Information Law, and will be analyzing trip and fare data, you are not considered interacting or intervening with human subjects, therefore, it has been determined that this project as described does not meet the definition of human subjects research as defined in 45CFR46(d)(f) or at 21CFR56.102(c)(e) and determined publication does not constitute human subjects research.”

 

 

2008 Mobile Century experiment GPS trajectory data

Mobile Century was an experiment run at UC Berkeley to test the potential to use GPS data to estimate traffic conditions. The dataset contains 8 hours of GPS trajectory data from 100 vehicles on a ~10 mile stretch of I-880 in California, as well as inductive loop detector data from PeMS, and travel times recorded by license plate recognition. The dataset remains one of the most comprehensive public GPS datasets for traffic monitoring research.

The key reference paper for the dataset is:

J.-C. Herrera, D. Work, J. Ban, R. Herring, Q. Jacobson, and A. Bayen. “Evaluation of traffic data obtained via GPS-enabled Mobile Phones: the Mobile Century experiment.”  Transportation Research Part C, 18(3), pp. 568–583, 2010. DOI: 10.1016/j.trc.2009.10.006. Download: preprint,  manuscript.  Most Cited Transportation Research Part C: Emerging Technologies Article Since 2008 (June 2013).

Download the data here: http://traffic.berkeley.edu/project/downloads/mobilecenturydata