DATASCI 415

Course project

The project is intended to engage you in a non-trivial application of data mining and machine learning. You may complete the project in teams of up to 4 students. We expect larger teams to deliver a more substantial project than smaller teams. The deliverables are

  1. a project proposal due at noon ET on Fri, Oct 11,
  2. a draft report (for peer review) due at noon ET on Mon, Nov 25,
  3. a peer review due at noon on Mon, Dec 2, and
  4. a final report due at noon on Mon, Dec 9.

The overall grade on the final project is a combination of the grades on the deliverables: 10% project proposal, 10% peer review, 80% final report. Although the draft report does not contribute to the overall (project) grade, teams who do not submit a submit a draft report will not be assigned a project to review (so they will receive zero credit for the peer review).

Picking a project

Here are some possible project ideas. They are intended to provide you with a sense of the appropriate scope for a project. You can pick one of the ideas here or come with your own idea. For graduate students in the course, we encourage to combine the project with your research. If you come up with your own idea, you must get it approved by the course staff.

Netflix Prize: From 2006-2009, Netflix sponsored a competition to improve its movie recommendation system with a $1M grand prize. Their system is based off of predicting what rating a user will give to a particular movie (on a 1-5 star scale). The data is a table in which the rows and columns correspond to users and movies respective. Some entries are filled with past ratings, but most of entries are unknown. The task is predicting the unknown entries of the table from the known entries (and any side information).

Deanonymizing Netflix Prize data: In light of the success of the Netflix Prize, Netflix planned a second competition with trickier problems and even more detailed consumer data. However, they canceled the competition after Narayanan & Shmatikov showed that it is possible to identify subscribers in the (anonymized) Netflix Prize dataset by combining the data with the Internet Movie Database (IMDB) user profiles. The task is replicating Narayanan & Shmatikov’s results.

Energy demand forecasting: The IEEE Power and Energy society sponsored a competition to predict the total energy load for a US utility across multiple zones, as well as the sum of the energy load across all these zones, based on temperature data. (Note that the labels for the weeks meant to be predicted in the original contest are no longer available, so you should choose a random subset of other weeks to use as a test set.)

Human activity recognition: Human activity recognition is the problem of classifying sequences of accelerometer data recorded by smart phones into known well-defined movements. The dataset is collected by smartphones from 30 subjects, performing different activities with a smartphone to their waists. The goal is to classify sequences of sensor data into the human activities such as walking, walking upstairs, walking downstairs, sitting, standing and laying.

Forecasting stock price trends with market sentiment data: The task is predicting stock price trends from market sentiment data. Stock data (e.g. price history, trading volume, earnings etc) available from Google/Yahoo Finance and market sentiment data is available from a variety of sources:

Improving LLM reasoning capabilities via reasoning chains: OpenAI’s latest o1 model has impressive reasoning capabilities (see OpenAI’s o1 introduction) because it generates a long internal chain of thought before generating its (final) output. In this project, you will investigate the efficacy of such reasoning chains in improving the reasoning capabilities of other LLMs. There are some open source efforts in this area (see eg g1), and you should build on their (promising) results.

Project proposal

The proposal is due at noon ET on Fri, Oct 11. It should be no more than 2 pages in NeurIPS format (excluding the contributions section, references, and any appendices). It is intended to get you started on the project and solicit feedback from the course staff. The proposal must include

  1. the title of the project,
  2. the names of the team members
  3. a description of the task and your (tentative) approach: What is the problem you are tackling? What dataset(s) and algorithm(s) will you use? What are some expected challenges? If your project entails data collection, describe the data collection protocol. What methods/metrics will you use to evaluate the performance of your approach? If you are working on a well-studied problem, describe baseline methods that you will compare against.
  4. a review of the relevant literature (see guidelines for the related work section in the final report below)
  5. a to-do list for the draft report: if your project entails data collection, then we expect you to have collected all the data; if your project uses pre-processed data (e.g. from Kaggle), then we expect experimental results (e.g. performance on baselines).

The proposal should be no more than two pages (not including references). Please follow the guidelines for mathematical writing.

Draft report and peer review

The draft report is due at noon ET on Mon, Nov 25. It is intended to solicit feedback from your classmates, so it should be close to the final report. We expect the draft report to include some experimental results demonstrating the efficacy of the approach on the task. Please follow the guidelines for mathematical writing.

The peer review process is double-blind; i.e. the reviewer(s) are hidden from the author(s) and vice versa. Thus the draft report must be anonymous; i.e. do not include team member names in the draft report. The peer review of your assigned project is due on noon ET on Mon, Dec 2. It is intended to provide constructive feedback to your classmates. The review should include the following sections (adapted from NeurIPS reviewer guidelines):

  1. Summary: Briefly summarize the paper and its contributions. This is not the place to critique the paper; the authors should generally agree with a well-written summary.
  2. Strengths and Weaknesses: Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following aspects:
    • originality: Are the tasks or methods new? Is the work a novel combination of well-known techniques? (This can be valuable!) Is it clear how this work differs from previous contributions? Is related work adequately cited?
    • quality: Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about the strengths and weaknesses of their work?
    • clarity: Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) An expert reader should be able to easily reproduce the results in a well-written paper.
  3. Questions: Please list any questions and suggestions for the authors. Think of the things where a response from the author can change your opinion, clarify a confusion or address a limitation.
  4. Limitations: Have the authors adequately addressed the limitations and potential negative societal impact of their work? If not, please include constructive suggestions for improvement.

Final report

The final report is due on noon ET on Mon, Dec 9. It should be no more than 8 pages in NeurIPS format (excluding the contributions section, references, and any appendices). The report should include (but is not limited to) the following sections (adapted from Stanford’s CS 229 final report guidelines):

  1. Introduction (0.5 to 1 pages): Explain the problem and why it is important. Clearly state what the inputs and outputs are (e.g. our algorithm accepts an histopathological image as input and predicts whether the central regions contains any tumor tissue).
  2. Related work (0.5 to 1 page): You should find relevant papers, group them into categories based on their approaches, discuss their strengths and weaknesses, and compare them with your approach. Which approaches were clever/good? What is the state-of-the-art? You should cite at least a dozen relevant papers. Google Scholar is very useful for finding relevant papers.
  3. Dataset and features (0.5 to 1 pages): Describe the dataset you are using. How many training/validation/test examples do you have? Did you preprocess the data in any way? What features did you extract? Space permitting, show some examples from your dataset.
  4. Methods (1 to 2 pages): Describe your learning pipeline, including any algorithm(s). Make sure to include relevant mathematical details. For each algorithm, give a short description (1 to 2 paragraphs) of how it works. If you are using cutting edge or niche algorithms (or any algorithm not covered in class), provide enough detail so that your classmates can understand the algorithm. You should also describe how you chose (hyper)parameters (e.g. what was your mini-batch size and why).
  5. Experiments/Results (1 to 2 pages): Present your results with a mixture of tables and plots. For example, if you are solving a classification problem, you should include a confusion matrix or AUC/AUPRC curves. Your figures should include legends, axis labels, and have font sizes that are legible when printed. Make sure to describe (mathematically if necessary) any metrics you report and refer to any figures/tables in your main text. You should have both quantitative and qualitative results.
  6. Conclusion/Discussion (1 to 2 pages): Summarize your report and reiterate the main points. What worked and what didn’t work (and why)? Discuss the advantages and disadvantages of your method (e.g. provide examples of where your algorithm failed/succeeded). For future work, how can the method be improved?
  7. Contributions: If you are working on a project as part of a team, you must also include a contributions section at the end of the report (where acknowledgements usually appear) describing the contributions of the team members. If there are discrepancies among the contributions of the team members, the grades will be adjusted. This section does not count towards the page limit.
  8. References: Include citations for: (1) any papers mentioned in the related work section, (2) papers describing algorithms that you used which were not covered in class, (3) references for datasets and software. Any reference format that include author(s), title, conference/journal, year is acceptable.

You must submit a PDF file of your project report and a repository of (properly commented) code that reproduces any computer output in the project report on Canvas (eg on GitHub). The report must be typeset. Submissions will be evaluated on three aspects: