Contributing
We welcome contributions! There are many ways to help. For example, you can:
- Help us track bugs by filing issues
- Suggest and help prioritise new functionalities
- Develop a new functionality!
- Help us make the library as straightforward as possible, by simply asking questions on whatever does not seem clear to you.
Please do not hesitate to suggest functionalities you have developed and want to incorporate into eds-scikit. We will be glad to help! Also, any non-technical contribution (e.g. lists of ICD-10 codes curated for a research project) is also welcome.
Development installation
To be able to run the test suite, run the example notebooks and develop your own functionalities, you should clone the repo and install it locally.
Spark and Java
To run tests locally, you need to have Spark and Java. Whereas Spark will be installed as a dependency of PySpark, you may need to install Java yourself. Please check to installation procedure.
# Clone the repository and change directory
$ git clone https://github.com/aphp/eds-scikit.git
---> 100%
$ cd eds-scikit
# Create a virtual environment
$ python -m venv venv
$ source venv/bin/activate
# Install dependencies and build resources
$ pip install -e ".[dev, doc]"
# And switch to a new branch to begin developing
$ git switch -c "name_of_my_new_branch"
To make sure the pipeline will not fail because of formatting errors, we added pre-commit hooks using the pre-commit
Python library. To use it, simply install it:
$ pre-commit install
The pre-commit hooks defined in the configuration will automatically run when you commit your changes, letting you know if something went wrong.
The hooks only run on staged changes. To force-run it on all files, run:
$ pre-commit run --all-files
---> 100%
color:green All good !
Proposing a merge request
At the very least, your changes should :
- Be well-documented ;
- Pass every tests, and preferably implement its own ;
- Follow the style guide.
Testing your code
We use the Pytest test suite.
The following command will run the test suite. Writing your own tests is encouraged!
python -m pytest ./tests
Most tests are designed to run both with Pandas as Koalas DataFrames as input. However, to gain time, by default only Pandas testing is done. The above line of code is equivalent to
python -m pytest ./tests -m "not koalas"
However, you can also run tests using only Koalas input:
python -m pytest ./tests -m "koalas"
or using both inputs:
python -m pytest ./tests -m ""
Finally when developing, you might be interested to run tests for a single file, or even a single function. To do so:
python -m pytest ./tests/my_file.py #(1)
python -m pytest ./tests/my_file.py:my_test_function #(2)
Style Guide
We use Black to reformat the code. While other formatter only enforce PEP8 compliance, Black also makes the code uniform. In short :
Black reformats entire files in place. It is not configurable.
Moreover, the CI/CD pipeline enforces a number of checks on the "quality" of the code. To wit, non black-formatted code will make the test pipeline fail. We use pre-commit
to keep our codebase clean.
Refer to the development install tutorial for tips on how to format your files automatically. Most modern editors propose extensions that will format files on save.
On conventional commits
We try to use conventional commits guidelines as much as possible. In short, prepend each commit message with one of the following prefix:
fix:
when patching a bugfeat:
when introducing a new feature- If needed, you can also use one of the following:
build:, chore:, ci:, docs:, style:, refactor:, perf:, test
Documentation
Make sure to document your improvements, both within the code with comprehensive docstrings, as well as in the documentation itself if need be.
We use MkDocs
for eds-scikit's documentation. You can checkout the changes you make with:
# Install the requirements
$ pip install ".[doc]"
---> 100%
color:green Installation successful
# Run the documentation
$ mkdocs serve
Go to localhost:8000
to see your changes. MkDocs watches for changes in the documentation folder
and automatically reloads the page.
Warning
MkDocs will automaticaly build code documentation by going through every .py
file located in the eds_scikit
directory (and sub-arborescence). It expects to find a __init__.py
file in each directory, so make sure to create one if needed.
Developing your own methods
Even though the koalas project aim at covering most pandas functions for spark, there are some discrepancies. For instance, the pd.cut()
method has no koalas alternative.
To ease the development and switch gears efficiently between the two backends, we advice you to use the BackendDispatcher
class and its collection of custom methods.