Contributing to datasist¶

Table of contents:

Where to start?
Working with the code
- Version control, Git, and GitHub
- Getting started with Git
- Forking
- Creating a development environment
Docstrings Guidelines
Writing tests
- Using pytest
- Running the test suite

Contributing your changes to datasist
- Committing your code
- Pushing your changes
- Review your code and finally, make the pull request

Where to start?¶

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

For first time contributors, you can find pending issues on the GitHub “issues”. There are a number of issues listed and "good first issues" where you could start out. Once you’ve found an interesting issue, and have an improvement in mind, next thing is to set up your development environment.

Working with the code¶

Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need to learn how to work with GitHub and the datasist code base.

Version control, Git, and GitHub¶

The datasist code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on this project.

Some great resources for learning Git:

Official GitHub pages.

Getting started with Git¶¶

Find Instructions for installing git, setting up your SSH key, and configuring git. These steps need to be completed before you can work seamlessly between your local repository and GitHub.

Forking the datasist repo¶

You will need your own fork to work on the code. Go to the datasist project page and hit the Fork button.

Next, you will clone your fork to your local machine:


    git clone https://github.com/your-user-name/datasist.git

This creates the directory datasist and connects your repository to the upstream (main project) repository.

Creating a development environment¶

To test out code changes, you’ll need to build datasist from source, which requires a Python environment.

Creating a Python environment¶¶

Next, you'll create an isolated datasist development environment:

Install either Anaconda or miniconda
Make sure your conda is up to date (conda update conda)
Make sure that you have cloned the repository.

Follow the steps below:

cd to the datasit source directory

cd datasist

Build and install datasist

    python setup.py build
    pip install -e .

Test that datasist was successfully installed

import datasist

If there is no error after the previous command, then you're ready to start contibuting. Now you can fire up your favorite IDE and start implementing your changes.

Docstrings Guidelines¶

Docstrings are an important part of coding and we encourage you to write clear and concise docstrings for your functions, methods and classes. Docstrings written for your code are automatically used to generate the datasist documentation using the pdoc library. Some guildlines for writing docstrings are:

Define what the function does.
Define all parameter types and what they do.
State the return values
Use the correct spacing and indentation as this affects the documentation generated by pdoc.

Sample docstrings:

def add_df(df1=None, df2=None):
        '''
        A function to add two dataframes together.  

        Parameters:
        ------------
        df1: DataFrame, Series
            The first dataframe to add.

        df1: DataFrame, Series
            The second dataframe to add.

        Returns:
        ---------
            DataFrame: Concatenation of the two dataframe
        '''

        #check if dataframe is None
        if df1 is None:
            raise ValueError('df1: Expected a DataFrame, got None')

        if df2 is None:
            raise ValueError('df1: Expected a DataFrame, got None')

        final_df = df1 + df2

        return final_df

Writing tests¶

We strongly encourages contributors to write test for their code contributions Like many packages, datasist uses pytest.

All tests should go into the tests subdirectory and placed in the corresponding module. This folder contains some current examples of tests, and we suggest looking to these for inspiration.

To compare dataframe or series objects, you can use the pandas.util.testing module. The easiest way to verify that your code is correct is to explicitly construct the result you expect (expected), then compare it to the actual result (output).

Using pytest¶

Here we show an example of a test case we wrote for the drop_redundant function in the feature_engineering module. This test is placed in the test_feature_engineering.py file inside the tests folder.

...

    def drop_redundant(data):
    '''
    Removes features with the same value in all cell. Drops feature If Nan is the second unique class as well.

    Parameters:
    -----------------------------
        data: DataFrame or named series.

    Returns:

        DataFrame or named series.
    '''

    if data is None:
        raise ValueError("data: Expecting a DataFrame/ numpy2d array, got 'None'")

    #get columns
    cols_2_drop = _nan_in_class(data)
    print("Dropped {}".format(cols_2_drop))
    df = data.drop(cols_2_drop, axis=1)
    return df

The corresponding test for the function above is:

import pytest
from datasist import feature_engineering
import pandas as pd
import numpy as np

df = pd.DataFrame({'country': ['Nigeria', 'Ghana', 'USA', 'Germany'],
                    'size': [280, 20, 60, np.NaN],
                    'language': ['En', 'En', 'En', np.NaN]})


def test_drop_redundant():
    expected = ['country', 'size']
    output = list(feature_engineering.drop_redundant(df).columns)
    assert expected == output

Running the test case¶

To run the test case, navigate the correct test subfolder and open a command prompt. Run the following command.

pytest tests

Learn more about pytest here

Contributing your changes to pandas¶

Committing your code¶

Once you’ve made changes, you can see them by typing:

git status

Next, you can track your changes using

git add .

Next, you commit changes using:

git commit -m "Enter any commit message here"

Pushing your changes¶

When you want your changes to appear publicly on your GitHub page, you can push to your forked repo with:

git push

Review your code and finally, make the pull request¶

If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be reviewed and eventually merged into the master version. To submit a pull request:

Navigate to your repository on GitHub
Click on the Pull Request button
Write a description of your changes in the Preview Discussion tab
Click Send Pull Request.

This request then goes to the repository maintainers, and they will review the code and everything looks good, merge it with the master.

Hooray! Youre now a contributor to datasist. Now go bask in the euphoria!