This continues from a previous blog post where I showed how to do a logistic regression on a dataset to see which type of news stories get shared the most.
This time I will talk about how to do visualizations while exploring the data.
One of the first visualizations I did was a scatterplot. A scatterplot gives a good depiction of how the data is spread out.
sns.scatterplot(df['n_unique_tokens'],df['shares'],hue=df['label'])
Next was a bar plot to see which day the stories got shared the most.
share = []
for d in ['weekday_is_monday', 'weekday_is_tuesday', 'weekday_is_wednesday','weekday_is_thursday', 'weekday_is_friday',
'weekday_is_saturday','weekday_is_sunday']:
share.append(df[df[d]==1]['shares'].median())
ax = sns.barplot(x= ['Mon','Tue','Wed','Thu','Fri','Sat','Sun'], y= share)
ax.set(xlabel='day of week', ylabel='median_shares')
plt.show()
Finally, to go deeper, besides just looking at distributions, I grouped data channels and labels to visualize in a bar plot.
df.groupby(['data_channel','label'])['num_hrefs'].mean().plot(kind='bar')
Visualizations are a great way to tell story and come in handy for your exploratory data analysis.