Table of contents:
Where to start?
Working with the code
Docstrings Guidelines
Writing tests
Contributing your changes to datasist
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
For first time contributors, you can find pending issues on the GitHub “issues”. There are a number of issues listed and "good first issues" where you could start out. Once you’ve found an interesting issue, and have an improvement in mind, next thing is to set up your development environment.
Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need to learn how to work with GitHub and the datasist code base.
The datasist code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on this project.
Some great resources for learning Git:
Find Instructions for installing git, setting up your SSH key, and configuring git. These steps need to be completed before you can work seamlessly between your local repository and GitHub.
You will need your own fork to work on the code. Go to the datasist project page and hit the Fork button.
Next, you will clone your fork to your local machine:
git clone https://github.com/your-user-name/datasist.git
This creates the directory datasist and connects your repository to the upstream (main project) repository.
To test out code changes, you’ll need to build datasist from source, which requires a Python environment.
Next, you'll create an isolated datasist development environment:
Install either Anaconda or miniconda
Make sure your conda is up to date (conda update conda)
Make sure that you have cloned the repository.
Follow the steps below:
cd datasist
python setup.py build
pip install -e .
import datasist
Docstrings are an important part of coding and we encourage you to write clear and concise docstrings for your functions, methods and classes. Docstrings written for your code are automatically used to generate the datasist documentation using the pdoc library. Some guildlines for writing docstrings are:
Sample docstrings:
def add_df(df1=None, df2=None):
'''
A function to add two dataframes together.
Parameters:
------------
df1: DataFrame, Series
The first dataframe to add.
df1: DataFrame, Series
The second dataframe to add.
Returns:
---------
DataFrame: Concatenation of the two dataframe
'''
#check if dataframe is None
if df1 is None:
raise ValueError('df1: Expected a DataFrame, got None')
if df2 is None:
raise ValueError('df1: Expected a DataFrame, got None')
final_df = df1 + df2
return final_df
We strongly encourages contributors to write test for their code contributions Like many packages, datasist uses pytest.
All tests should go into the tests subdirectory and placed in the corresponding module. This folder contains some current examples of tests, and we suggest looking to these for inspiration.
To compare dataframe or series objects, you can use the pandas.util.testing module. The easiest way to verify that your code is correct is to explicitly construct the result you expect (expected), then compare it to the actual result (output).
Here we show an example of a test case we wrote for the drop_redundant function in the feature_engineering module. This test is placed in the test_feature_engineering.py file inside the tests folder.
...
def drop_redundant(data):
'''
Removes features with the same value in all cell. Drops feature If Nan is the second unique class as well.
Parameters:
-----------------------------
data: DataFrame or named series.
Returns:
DataFrame or named series.
'''
if data is None:
raise ValueError("data: Expecting a DataFrame/ numpy2d array, got 'None'")
#get columns
cols_2_drop = _nan_in_class(data)
print("Dropped {}".format(cols_2_drop))
df = data.drop(cols_2_drop, axis=1)
return df
The corresponding test for the function above is:
import pytest
from datasist import feature_engineering
import pandas as pd
import numpy as np
df = pd.DataFrame({'country': ['Nigeria', 'Ghana', 'USA', 'Germany'],
'size': [280, 20, 60, np.NaN],
'language': ['En', 'En', 'En', np.NaN]})
def test_drop_redundant():
expected = ['country', 'size']
output = list(feature_engineering.drop_redundant(df).columns)
assert expected == output
To run the test case, navigate the correct test subfolder and open a command prompt. Run the following command.
pytest tests
Learn more about pytest here
Once you’ve made changes, you can see them by typing:
git status
Next, you can track your changes using
git add .
Next, you commit changes using:
git commit -m "Enter any commit message here"
When you want your changes to appear publicly on your GitHub page, you can push to your forked repo with:
git push
If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be reviewed and eventually merged into the master version. To submit a pull request:
Navigate to your repository on GitHub
Click on the Pull Request button
Write a description of your changes in the Preview Discussion tab
Click Send Pull Request.
This request then goes to the repository maintainers, and they will review the code and everything looks good, merge it with the master.
Hooray! Youre now a contributor to datasist. Now go bask in the euphoria!