Project organisation

Organising a project

It is helpful to have a consistent structure for research projects for a few reasons:

  • Code compatibility: A consistent structure makes it easier to find and run code across multiple projects (e.g., if code outputs are always saved to /results)
  • Familarity: A consistent structure makes it easier for others (and yourself) to understand and navigate your project (e.g., if all data is always saved to /data)
  • Tidy code: A consistent structure can help you keep your code tidy and organised

Project structure

In general, a project should have the following structure:

├── data                    # Raw data files
├── figures                 # Figures generated by the code
├── models                  # Pre-trained models
├── notebooks               # Jupyter notebooks
├── results                 # Processed data and final results
├── scripts                 # Scripts used to run analyses
├── tests                   # Unit tests
└── code                    # The code itself (will need to be named according to the project)
 -- .gitignore
 -- README.md
 -- setup.py

Some projects many not need certain directories (e.g., models if you are not using pre-trained models), and some projects may need additional directories. However, this should form a good starting point for most projects.

Description of folders

data

This directory should contain all raw data files. It is important to keep raw data separate from processed data to ensure that you can always go back to the original data if needed.

figures

This directory should contain all figures generated by the code. This can include plots, images, etc.

models

This directory should contain any pre-trained models that are used in the project. This is particularly useful if you are using machine learning models that take a long time to train, but can be saved and loaded quickly.

notebooks

This directory should contain all Jupyter notebooks used in the project. This can include exploratory data analysis, data cleaning, model training, etc.

results

This directory should contain all processed data and final results. This can include cleaned data, model predictions, etc.

scripts

This directory should contain all scripts used to run analyses. This can include Python scripts, shell scripts, etc.

tests

This directory should contain all unit tests for the code. This is particularly useful if you are developing a package, as you can ensure that all functions are working as expected. I would recommend using the pytest library for writing tests.

code

This directory should contain all the code for the project, to be used as a Python module. This will need to be named according to the project (e.g., my_project).

For more information on using custom Python packages, see this tutorial