Panic

waiting hours shorten breath take thoughts to disastrous possibilities. “catastrophising tightens every fibre you said you’d be back” is published by Tracy Aston in Haiku Hub.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Predicting Pump functionality of water points in Tanzania

Photo by Pexels Boss from Pexels

Access to clean water is a necessity for life. Like many low income nations, people in Tanzania have difficulty finding access to clean and sanitary water. Ground water wells are a major source of water for many people in Tanzania, however many of the water points have become non functional or do not provide clean water.

We can examine the functionality of these water points to find out what are the causes that lead them to being non functional to see if any actions can be taken to ensure water points are operating optimally. My analysis will attempt to predict the functional status of the water point to identify which are in need of repair.

My goal is to prioritize non functional water points, so I will simplify this to a binary by combining functional and functional needs repairs.

Around 38% of water points are non functional.

There are some missing values on some features which I will address while exploring. But first based on feature description, there may be features that contain similar information.

Feature engineering

While exploring other features, there were some that could benefit from some modifications.

Funder and installer were both found to have a high number of unique values. I decided to keep only the top 10 most frequent values and assign everything else to other.

Before we can input our data into a model, I will need to separate the target features and convert them to numerical values. Since I am focusing on predicting non functional water points, I will assign 0 to functional and 1 to non functional.

Categorical features will also need to be encoded before inputting into model. To do this I will create a pipeline that will one hot encoded all categorical features. First I will create a column transformer that will include any preprocessing steps to be passed through the pipeline. Even though I will only need to use one encoder, a pipeline will make it easier to keep track of preprocessing steps if I need to add others.

Now I can split our data into train and test sets. I will use default 75% / 25% train/test split.

I can now test out different classification algorithms to see which performs the best. I will be optimizing for f1 score to get a good balance between precision and recall.

Logistic Regression

Decision Tree

Random Forest

Calculating the mean f1 score, precision, and recall from cross validation I had the below results.

Randomized Search CV with random forest

Random Forest performed the best. I will use randomized search to see if we can find any hyper parameters that will perform better in terms of f1 score.

Permutation Importances

Examining a few of the important features we can see the following:

Evaluating on test set

Final model was found to be random forest with tuned hyper parameters. I will now check model’s performance on the test set.

f1 score, precision, and recall scores are similar to our cross validation scores. This is a good sign that our model is not over fitting on the training data.

Confusion matrix of test set

A plot of the confusion matrix shows that we have around 708 false positives and 1539 false negatives.

Decision Threshold

Depending on the cost of false positives, we can adjust decision threshold of model to increase precision at the cost of recall. Default threshold is 0.5. Below is a graph of model’s precision and recall scores along different thresholds.

Adjusting threshold to 0.6, we can get a precision of 90% and recall of 67%. Below confusion matrix of model with threshold set a 0.6 shows false positives have decreased while false negatives has increased.

I was able to predict non functional water points with decent results. Model’s feature importance's show areas that can be focused on to best avoid water points becoming non functional. Depending on costs of false positives or false negatives when predicting non functional water points, we can adjust decision threshold accordingly.

Future Considerations

Add a comment

Related posts:

EBay membership helpline

To sign up for an eBay membership all you need is your email address. Once your eBay membership is set up, you can buy, sell, and enjoy all the benefits of being an eBay member. With your eBay member…

How to Make Texas Leadership Development Money Online

I knew then as I knew now that to manage my online Texas Leadership Development business, I had to have some area of expertise to offer. How to make money online was the question that was blooming…

How can peer tutoring benefit college students?

The concept of peer tutoring has been around since the 1800s and is now used in schools and colleges, due to its cost-effective nature and proven benefits for both the tutors and tutees. According to…