In this investigation, I wanted to look at the factors that could contribute to trip duration prediction of Citi Bike Users. I focused more on the time frame and other features that could have a leverage on the prediction of the trip duration. Of course, not all the features in our dataset were looked at, but some that I felt their importance to the investigation were placed higher than the rest
The initial data used for this investigation had observations/samples running to millions and 14 features including the target variable. While investigating the data, I discovered that for us to have a deeper insight to the project problem, the features provided wouldn't be enough. I had to take steps to engineered some new features I think would be intrumental in getting the insights needed.
After much investigations and cleaning were done on the data, the final product had to be where some data points in the original dataset were removed due to them not having much to contribute to the investigation.
The trip duration in the original dataset take on a very large range of values, from about 60 seconds at the lowest, to about 6 million seconds at the highest. But because more than 95% of the data is between 60 seconds and 4000 seconds, I removed samples above 4000 seconds trip duration in our final dataset.
When plotted the data without transforming shows that distribution is right skewed with a long tail. After plotting the trip duration distribution on a logarithmic scale, which resulted to distribution of slight skewing towards the left. The peak distribution hover around 400 seconds and 800 seconds.
The standard distribution of trip_dist variable shows a highly skewed data with a long tail to the right with the peak around 0.015 distance. This calls for transformation. So, I transformed the variable with a logarithm transformation. However, there are values in trip_dist variable that are of zero(0) values which log can't apply on. For this case, I added constant values to the original values to make the log scaling. This act resulted to distribution around 0 which is inconsequentail to the subject matter. There is peak distribution between 0.05 and 0.07 trip distance.
Note: I added 0.001
as a constant value to all the values to be able to transform the variable.
I wanted to see the the distribution of some other four features important to the investigation. So, I plotted some categorical and numerical variables altogether.
It appears that more than 90% of the Citi Bike Users are Subscribers. The peak hours fall between 8:00-9:00 and 9:00-10:00 in the morning, and 17:00-18:00 and 18:00-19:00 in the evening. And riding activities happen more often during the weekdays. Summer months tend to be the peak months during the year of review.
This is plotted on transformed trip duration and trip distance using logarithmic transformation. Looking at the visual, it shows no linear relationship between trip distance and duration. However, when I check the correlation coefficient between the two, I got score around 0.628 (62.8%) compared with correlation coefficient of the other numerical variables, it's the highest.
There's no linear relationship between the trip duration and age of the riders. However, the correlation coefficient confirmed that there's negative relationship between trip duration and rider's age. The visual shows we have unusual data points at age 0.
When I check correlation coefficients of numerical variables with the target variable (trip duration), the result confirmed that trip distance has the strongest relationship with trip duration. However, the scatter plot we plotted earlier shows little to nothing of such relationship. There is also slight negative relationship between Trip duration and Rider's Age. Month of the Year has the lowest relationship with the target variable
In conclusion, trip distance would have huge contribution to the trip duration prediction.
We can see there's an upsurge of trip duration during the evening across gender classes. People tend to ride more during the weekends compare to weekdays across all gender classes. Summer months contribute mostly to the trip duration during the year of reveiw.
These also follow the earlier visuals where there's upsurge of trip duration during the evening, weekends, summer months across user type.
Mostly, customer users always have the attributes of being an unknown gender and unknown age (0). There's high tendency of a user that have all these attributes to ride more with slight increasing in trip duration.