Stat 425 Final Project (Fall 2015)

This exam is a take-home project. Write your answers as a report (maximum 15 pages, including figures and tables; not the more the better).

Bike Sharing Demand

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded.

You are asked to analyze the data from the Capital Bikeshare program in Washington, D.C. to answer the questions below. The data, the evaluation metric, and other relevant information can be found at Kaggle.

What you need to include in your report?

  • Section 1: Introduction
    Provide a brief introduction of the goal of this final project. What’s it about? Where you get the data? What’s the background of the data?
  • Section 2: Exploratory Data Analysis
    Include some graphical displays of the data. Also comment on any patterns/characteristics of the data which you find interesting or anything relevant to your later analysis. Suggest to add some new variables related to time, such as hour, wday (day of week), year, and month. Provide a brief explanation/summary of variables you plan to include in your analysis.

    • Which variables are categorical and which are numerical?
    • Should we keep wday or workingday in our analysis?
    • Should we keep season or month in our analysis?
    • Should we treat hour as numerical or categorical? If latter, should we use all 24 levels or merge group them into less number of levels?
    • For categorical variables, should we include any interactions?
    • For numerical variables, any evidence supporting nonlinear trends
  • Section 3: Method
    You are required to build three prediction models. For each of the three methods, you should have a score after submitting your prediction on Kaggle. Report your scores in your report. For each method, include of a description of the methodology, and a description of the implementation if the implementation is not trivial.No Peeking into the Future: For each test sample, we are only allowed to use information up to the time stamp of that test sample.

    • Section 3.1: Start with a simple model, a model that doesn’t require much training. For example, in my R code, I predict the counts based on the average counts over the same month, the same wday, and the same hour. Of course, you can build your own simple model.
    • Section 3.2: Predict with linear regression models…
    • Section 3.3: Predict with Randomforest
    • Section 3.4 (Optional): You can also try some other methods.
  • Section 4: Discussions

What you need to submit on Compass?

  • Submit the following on Compass (Assignment Dropbox) by midnight, December 16, 2015:
    • report (in pdf),
    • R code (in .R or .txt, or .pdf from R markdown) and
    • submission files in Kaggle required format (at least three)
  • Summarize your numerical results using tables/figures instead of listing R output in your report.
  • All the figures and tables (if there are any) must be labeled, and you
    should comment on the results displayed there in the main text.
  • Add comment lines in your R script so it’s easy for us (me and the TA) to follow, e.g., “# Generate figure 1 in Sec 2“, “# Model I: linear regression with the following variables ...... The corresponding prediction is in Submission File 1.” It may be a good idea to prepare your R script using R markdown. Please check the sample R markdown file I posted on the discussion board.

Rules

  • You are NOT allowed to discuss the exam with anyone else. If you have questions, please email me or post your question on the discussion board.
  • You are allowed to use online resources. A good place to start would be the Forum section and Script section on Kaggle. It’ll be a good idea to have an “Acknowledgment” Section at the end of your report where you acknowledge the author (or authors) of the online resources.
  • I did some initial analysis, and [here] is my code. If you find some useful resources online, and want to share with your fellow classmates, although I don’t know why you want to do that :-), please post it on the discussion board.
  • You are NOT allowed to copy any sentences from others’ work (paper, blog, or his/her post on the Forum) verbatim to your report. You have to either paraphrase or cite the source. Check some online websites on “how to avoid plagiarism”.