Citibike Data-Viz using Tableau
If you wish to check on the workbook before reading, please find via this link: https://public.tableau.com/views/citibike-trips/Story1?:language=en-GB&:display_count=n&:origin=viz_share_link
Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? These were the most frequent questions for CitiBike and the bank provides some data since 2013. In this work, new_york_citibike dataset from Google BigQuery public database has been used. Because the trips data is huge, I have randomly sampled 0.01 of it. And the years are 2013 to 2017.
In the BigQuery public database, new_york_citibike dataset has two tables: citibike_stations and citibike_trips. citibike_stations has the data for stations specifically and citibike_trips table is all the trips made from 2013 to 2018. When I count the rows, it seems like trip counts increased significantly each year, but 2018 has quite a low number. Therefore I excluded 2018.
citibike_trips table has more than 50 million rows but about 5 million of them are null. So I excluded the null values and randomly sampled 0.01 of the table using this query:
SELECT
*
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`
WHERE
start_station_id IS NOT NULL AND
EXTRACT(YEAR from starttime) != 2018 AND
RAND() < 0.01;
-- RAND() function generates a random value between 0 and 1 uniformly.
-- Therefore, a value is less than 0.01 means approximately 0.01 of the table.
citibike_stations table should be used wholly but some of the values in station_id are inconsistent. Most of the ids are numeric and less than 3 digits but some of the values are 36 character long strings. So I excluded these inconsistent rows and select all the others using this query:
SELECT *
FROM `bigquery-public-data.new_york_citibike.citibike_stations`
WHERE LENGTH(station_id) != 36;
Then I exported the result as a BigQuery table into the personal project on my account. After connecting the data with the Tableau desktop, it is all ready to work.
I would like to start with a density map. Below is an interesting insight into user type and station preferences. Most used stations by customers are some tourist sites, especially south of Central Park whereas subscribers are more diverse in Manhattan.
Which days and what hours do users prefer riding these bikes? My first guess would be 8–9 in the morning and 5–6 afternoon on weekdays. When we look into this, it is indeed the case. Note the two peaks for weekdays and wider distribution for the weekend.
A density visualization below can provide another angle for this fact. Darker color indicates higher number of trips.
The next visualization shows the difference between parts of the day. There is also distinction between genders. On the screenshot, not every text details are shown but you can hover your mouse and see tooltips on the link I provided with tableau public. Note that evening rides are more than morning rides only on weekends.
Lastly some demographics on users. Colors discern genders and bars show the age groups.
These were some brief insights about the data and if you wish to check out, there are more to find on the workbook I have uploaded on Tableau Public. Thanks for reading!
P.S. I have uploaded the data on Kaggle if anyone wants to check it out: https://www.kaggle.com/datasets/fatihb/citibike-sampled-data-2013-2017