The music industry is on the rise. Over the last few years, annual revenues have consistently been increasing, with a revenue last year of $20 billion. Music streaming especially has taken off in the past decade.
Spotify reported a $7.44 billion revenue in 2019, and competitors like Apple Music and YouTube Music all also have high, growing revenues.
Clearly, there’s a lot of money to be made within this industry for artists, but only if they manage to become popular. However, it is assumed to be very subjective and unclear as to what makes popular music. Maybe, however, there are hidden trends in the vast amount of data that can be analyzed in order to identify or find leads on what aspects of a song make it popular.
The main purpose of this project is to explore trends between popularity and certain sound features/attributes of songs. These attributes include:
Read more about the attributes here: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/
We will also investigate other trends in the data.
import warnings; warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
from statistics import mean
from sklearn.model_selection import KFold
import statsmodels.formula.api as sm
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier as KNN
from scipy import stats as st
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
#Reading in data from SpotifyFeatures.csv into pandas dataframe
Song_Data_Frame = pd.read_csv("SpotifyFeatures.csv")
#Fixing inconsistent data records by making all instances of the same entry consistent
Song_Data_Frame['genre'] = Song_Data_Frame['genre'].apply(lambda x: "Children's Music" if x == "Children’s Music" else x)
Song_Data_Frame['genre'] = Song_Data_Frame['genre'].apply(lambda x: "Reggae" if x == "Reggaeton" else x)
#Dropping A Capella genre from the dataframe
Song_Data_Frame = Song_Data_Frame[Song_Data_Frame['genre'] != 'A Capella']
#Reading in data from SpotifyData.csv (tracks also have a year associated with them) into a pandas dataframe
Song_Data_Frame_Time = pd.read_csv("SpotifyData.csv")
#Merging the two dataframes into one by matching up a tracks by their ID
Song_Data_Frame_Merged = Song_Data_Frame.merge(Song_Data_Frame_Time[['id','year','explicit']], how = "inner",
left_on = "track_id", right_on = 'id')
#Creates a dictionary with the artist name mapping to the number of their tracks in the dataframe
artist_freq = Song_Data_Frame['artist_name'].value_counts()
#Adding a new column to store the number of tracks the artist has in the dataframe
Song_Data_Frame['artist_freq'] = Song_Data_Frame['artist_name'].apply(lambda x: int(artist_freq[x]))
#Adding a column to store the decade that track was made to the Song_Data_Frame_Merged dataframe
Song_Data_Frame_Merged['decade'] = Song_Data_Frame_Merged['year'].apply(lambda x: str(x)[2] + '0s'
if x/10 < 200 else
"20" + str(x)[2] + '0s')
Song_Data_Frame
To obtain our data, we used a Spotify CSV file obtained from Kaggle, here: https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db
After reading the data into a Pandas dataframe, we noticed that there was an issue with the same genre being listed as two different strings. We used a lambda function to make the two string representations uniform.
We also dropped A Capella music from the dataset because the genre had too many data inconsistencies and bad observations.
Once we had clean data, we added an artist frequency column in order to track how many songs each artist in the dataframe appears in. This would allow us to do things like remove “one-hit-wonders” and compare the number of times that an artist appeared.
We also created a new dataframe called Song_Data_Frame_Merged where we merged our original dataframe with another Spotify CSV that includes dates. Since the date one doesn’t include genre, while our original one does, inner merging the two dataframes gives us the best of both worlds.
In this dataframe, we also created a decade column so it would be easy to access songs based on the decade they appeared in, rather than only the specific year. This was done using a lambda function and some string manipulation.
We began our data exploration by making a few plots where decade is the independent variable. Below we made a plot of popularity vs decade. As we can see, there seems to be a trend where the more modern the music is, the more popular it is right now.
#Grouping the data into groups of the same decade and calculating each group's average values
Decade_Groups = Song_Data_Frame_Merged.groupby(by = 'decade', as_index = False).mean().sort_values(by = 'year')
Decade_Groups.plot(kind='scatter',x='decade',y='popularity', s = 50,
color='blue', figsize=(10, 7), title = 'Popularity vs Decade ')
Next, we will plot danceability vs decade in order to see how one of our attributes correlates with the popularity trend we see above. As we can see from the graph below, there does seem to be some correlation between popularity and danceability, as both increase over time. Although there are some outliers (1920s and 1930s), the rest of the data appears to be very close to what is observed in the graph above.
Decade_Groups.plot(kind='scatter',x='decade',y='danceability', s = 50,
color='purple', figsize=(10, 7), title = 'Danceability vs Decade')
In this next graph, we attempt to do the same thing as we did above, but with energy instead of danceability. As can be seen with the graph below, energy also follows a similar trend to decade and danceability. As the decades progress, the average energy for that decade seems to also increase. Can this mean that certain attributes correlate with popularity?
Decade_Groups.plot(kind='scatter',x='decade',y='energy', s = 50,
color='green', figsize=(10, 7), title = 'Energy vs Decade')
Another interesting trend to look at is which genre from each time period is the most popular right now. To accomplish this, we simply iterate through the different decades and isolate the songs in each one. Using this new subset of data, we then group by the genre of music and find the average popularity of each genre. Finally, we can take the most popular entry from this subset to get the currently most popular genre of music from that decade. We can add this to our final dataframe.
Decade_df = pd.DataFrame()
#looping through the different decades
for decade in Song_Data_Frame_Merged['decade'].unique():
#Creating a temporary dataframe to store songs of the current decade only
df = Song_Data_Frame_Merged[Song_Data_Frame_Merged['decade'] == decade]
#Grouping the songs in the current decade by genre and calculating the mean values for each genre within the decade
df = df.groupby(by = 'genre', as_index = False).mean().sort_values(by = 'popularity', ascending = False)
#Extracting the most popular row (genre) from the grouped df
Decade_df = Decade_df.append(df.iloc[0])
#Adding back the decade column that was lost in the process above
Decade_df['decade'] = Decade_df['year'].apply(lambda x: str(x)[2] + '0s' if x/10 < 200 else "20" + str(x)[2] + '0s')
Decade_df.sort_values(by = 'year')[['decade', 'genre', 'popularity']]
From the above chart, we can see that the most popular genre per decade right now seems to be Pop. However, from the 1920s to 1950s, other genres are dominant in popularity. Also, from the average popularities listed, we can see that earlier decades of music tend to be less popular now.
Intuitively, to see what attribute best predicts popularity, we will look at the popularity row of the correlation table. As can be seen in the popularity row, none of the attributes have a strong or even moderate correlation with popularity.
Song_Data_Frame.corr(method = 'spearman')
Above we used the Spearman method for calculating the correlation because as detailed here: https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/, the Pearson method relies on normality whereas the Spearman method does not. Also we chose not to use the Kendall method because it works best with small sample sizes and here we are working with rather large sample sizes.
More on Spearman vs Kendall: https://www.statisticssolutions.com/kendalls-tau-and-spearmans-rank-correlation-coefficient/
#Graphing popularity vs danceability to show how visually how there is no correlation
Song_Data_Frame.sort_values(by = 'popularity').plot(kind='scatter',x='danceability',y='popularity',
color='red', figsize=(10, 7), title = 'Popularity vs Danceablity ')
Perhaps the reason why there is no strong correlation is because different genres of music have different requirements of attributes to be more or less popular. As can be seen in the graph above, it would seem if we consider the entire dataset of songs, every value of danceability covers the entire range of popularity. Could this be because certain genres require different levels of danceability?
To see if the average popularity of each genre is throwing off the correlation, we will perform a paired t-test to see if the mean popularity of each genre is statistically different than the mean of the population.
for genre in Song_Data_Frame['genre'].unique():
Genre_df = Song_Data_Frame[Song_Data_Frame['genre'] == genre]['popularity']
print("Genre: "+genre,"\n", st.ttest_1samp(Genre_df, Song_Data_Frame['popularity'].mean()), "\n\n")
As can be seen above all the p-values are extremely small, so we can reject the null hypothesis which is that the genre popularity mean is similar to the entire population popularity mean. This means we may have to consider genres separately so that the popularity measures are not skewed.
Genre_Groups = Song_Data_Frame.groupby(by = 'genre', as_index = False).mean()
Genre_Groups.corr(method = 'spearman')
The above code was written to try to solve the problem of comparing individual songs of different genres. To do this, we created a new dataframe, Genre_Groups, by making each row correspond to the average entry of each genre in the Song_Data_Frame. This was done so a correlation could be made between each genre's popularity and its attributes instead of individual songs.
#Starting to see a trend form, but it still unclear
Genre_Groups.plot(kind='scatter',x='danceability',y='popularity', s = 30,
color='red', figsize=(20, 10))
plt.title('Mean Popularity vs Mean Danceablity', size = 25)
plt.xlabel('\nMean Danceability', size = 25)
plt.ylabel('Mean Popularity', size = 25)
plt.xticks(size = 20)
plt.yticks(size = 20)
It turns out the fix we attempted above (Genre_Groups), improved the correlation between popularity and other attributes. For example, as shown above, the correlation between genre mean popularity and genre mean danceability is about .68, so not strongly correlated but an improvement.
#Mean popularities are different per genre so perhaps different levels of attributes contribute more to one genre's popularity
#than another's.
Genre_Groups['genre'] = Genre_Groups['genre'].apply(lambda x: x[0:5])
Genre_Groups.sort_values(by = 'popularity').plot(kind='scatter',x='genre',y='popularity', s = 50,
color='red', figsize=(20, 10), title = 'Popularity vs Genre ')
plt.title('Mean Popularity vs Genre', size = 25)
plt.xlabel('\nGenre', size = 25)
plt.ylabel('Mean Popularity', size = 25)
plt.xticks(size = 11)
plt.yticks(size = 20)
The above graph illustrates the large differences in average popularity between different genres. This leads us to believe that we may need to instead correlate popularity within a specific genre to attributes also within that specific genre. This way, attributes in one genre that correlate to its popularity won’t affect how those same attributes affect the popularity of another genre.
#Creating a correlation table with each row representing the popularity correlation within only one genre
corr_Data_Frame = pd.DataFrame()
row = pd.DataFrame()
#Looping through each genre
for genre in Song_Data_Frame['genre'].unique():
#Creating a dataframe consisting of songs from only one genre
row = pd.DataFrame(Song_Data_Frame[Song_Data_Frame['genre'] == genre].corr(method = 'spearman').iloc[0])
row = row.transpose().reset_index()
row['index'] = genre
#Extracting the popularity row of the correlation table for the current genre and appending it to the overall
#correlation table
corr_Data_Frame = corr_Data_Frame.append(row, ignore_index = True)
corr_Data_Frame = corr_Data_Frame.rename(columns = {'index' : 'Genre Popularity corr'})
corr_Data_Frame
In the above dataframe, each row corresponds to the popularity correlation row of the dataframe consisting of only songs from that genre. In other words, we are able to observe the popularity correlation within a specific genre for all genres, without having one genre’s attributes affect another.
This was done by creating a dataframe consisting of only songs from each genre, creating a correlation table for that dataframe, extracting the popularity correlation from that genre’s correlation table, and appending it to the correlation dataframe seen above.
As we can see above, this does not dramatically increase the correlations between the attributes and popularity. This could be due to the fact that the fanbase of each artist within each genre likes different attributes. We try accounting for this below.
#Creating a correlation table with each row representing the popularity correlation within only one artists works
corr_Data_Frame = pd.DataFrame()
row = pd.DataFrame()
#Creating a temporary dataframe consisting of songs by artists that appear more than 30 times in the data set
temp = Song_Data_Frame[Song_Data_Frame['artist_freq'] > 30]
#Looping through each artist
for artist in temp['artist_name'].unique():
#Creating a dataframe consisting of songs from only one artist
row = pd.DataFrame(Song_Data_Frame[Song_Data_Frame['artist_name'] == artist].corr(method = 'spearman').iloc[0])
row = row.transpose().reset_index()
row['index'] = artist
#Extracting the popularity row of the correlation table for the current artist and appending it to the overall
#correlation table
corr_Data_Frame = corr_Data_Frame.append(row, ignore_index = True)
corr_Data_Frame = corr_Data_Frame.rename(columns = {'index' : 'Artist Popularity corr'})
#Calculating the average correlation values between all the artists (on average how well does popularity
#correlate with the other attributes)
print("Mean Correlation With Popularity:\n\n" +
str(np.absolute(corr_Data_Frame.drop(columns = ['artist_freq', 'Artist Popularity corr']).dropna()).mean()))
print("\n\nVariance of Correlation With Popularity:\n\n" +
str(np.absolute(corr_Data_Frame.drop(columns = ['artist_freq', 'Artist Popularity corr']).dropna()).var()))
Above we created another correlation dataframe, but this time each row corresponds to only songs of a particular artist. To only include artists with enough data points, we only include artists who appear in the dataset more than 30 times. This removes the issue with having “one-hit-wonder” artists appearing in our data, which can throw off popularity due to their one popular song. We did this using the artist_freq column.
Since the dataframe has 1000s of rows, we took the average to see how popularity correlation measured in this way performs on average across the many artists.
As can be seen by the results above, there is still no strong correlation between a single variable and popularity. Perhaps this is because it is not one attribute but a combination of attributes that can predict popularity?
Below we want to see if we can model popularity based on multiple interaction terms. To do this, we use statsmodels’s OLS feature.
More on the stats models api: https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html
modelString = '''popularity ~ acousticness * danceability * duration_ms * energy * instrumentalness * liveness *
loudness * speechiness * tempo * valence'''
model = sm.ols(modelString, data = Song_Data_Frame).fit()
print("R-squared: " + str(model.rsquared) + "\tR-squared-adj: " + str(model.rsquared_adj) + "\n")
Above we attempted to model popularity using many interaction terms. As can be seen by the R^2 value, the model does not fit well, so perhaps we have to again, try accounting for genre.
Maybe we can model popularity based on the interaction between the other attributes within each genre. This way, we can observe the R^2 value per genre, rather than over the entire dataset’s songs.
modelString = '''popularity ~ acousticness * danceability * duration_ms * energy * instrumentalness * liveness *
loudness * speechiness * tempo * valence'''
#Looping through the genres
for genre in Song_Data_Frame['genre'].unique():
#Creating a model for popularity only considering songs of the current genre
model = sm.ols(modelString, data = Song_Data_Frame[Song_Data_Frame['genre'] == genre]).fit()
print(genre + " R-squared: " + str(model.rsquared) + "\nR-squared-adj: " + str(model.rsquared_adj) + "\n")
From our resulting R^2 values within each genre, we can see that the vast majority of the genres have a poor fit between their attributes and their popularity. Instead, we may need to go another level deeper, and find out how well these attributes can model popularity within each artist. To do this, we only observe artists who appear more than 100 times in the dataset, sample 20 of these artists randomly, and find the R^2 value corresponding to each of them.
modelString = '''popularity ~ acousticness * danceability * duration_ms * energy * instrumentalness * liveness *
loudness * speechiness * tempo * valence'''
#Creating a temporary dataframe consisting of songs by artists that appear more than 100 times in the data set
temp = Song_Data_Frame[Song_Data_Frame['artist_freq'] > 100]
#Looping through the aritsts
for artist in temp.drop_duplicates(subset=['artist_name'])['artist_name'].sample(20):
#Creating a model for popularity only considering songs of the current artist
model = sm.ols(modelString, data = Song_Data_Frame[Song_Data_Frame['artist_name'] == artist]).fit()
#Some artists have invalid values causing invalid results from the model generation, this prevents those from printing
if str(model.rsquared_adj) != 'nan' and float(model.rsquared_adj) != None and str(model.rsquared_adj) != '-inf':
print(artist + " R-squared: " + str(model.rsquared) + "\nR-squared-adj: " + str(model.rsquared_adj) + "\n")
Clearly, this yields significantly better results. Most of the R^2 values are close to 1, indicating a good fit in the model between the interaction terms and popularity. While this does not guarantee a good set of predictors, the above may indicate that while the attributes discussed may not be a good predictor of popularity of a song in general, or of a song within a genre, it may be possible to predict the popularity of a song in the scope of the artist’s other work. In other words, it may be possible to predict how popular a particular artist’s song will be compared to their other songs based on these attributes. This could be due to the fact that the fanbase of the artist consistently enjoys specific combinations of qualities of the artist’s songs, and so the songs that contain these qualities tend to be more popular compared to other songs by the artist that don’t have these characteristics.
Before testing if we can effectively predict a song’s popularity within their artist’s work, however, we can first run the same model as before on all artists to confirm that on average we get a high R^2 value. Since there are too many artists to view effectively at once, we will instead find the mean R^2 of 100 artists and use this.
modelString = '''popularity ~ acousticness * danceability * duration_ms * energy * instrumentalness * liveness *
loudness * speechiness * tempo * valence'''
Rsquared_Total = 0
Adj_Rsquared_Total = 0
counter = 0
#Creating a temporary dataframe consisting of songs by artists that appear more than 100 times in the data set
temp = Song_Data_Frame[Song_Data_Frame['artist_freq'] > 100]
#Looping through the aritsts
for artist in temp.drop_duplicates(subset=['artist_name'])['artist_name'].sample(100):
#Creating a model for popularity only considering songs of the current artist
model = sm.ols(modelString, data = temp[temp['artist_name'] == artist]).fit()
#Some artists have invalid values causing invalid results from the model generation, this prevents those from printing
if str(model.rsquared_adj) != 'nan' and float(model.rsquared_adj) != None and str(model.rsquared_adj) != '-inf':
counter = counter + 1
Rsquared_Total = Rsquared_Total + float(model.rsquared)
Adj_Rsquared_Total = Adj_Rsquared_Total + float(model.rsquared_adj)
#Printing average R-squared and R-squared-adj values
print("R-squared average: " + str(Rsquared_Total/counter))
print("\nR-squared-adj average: " + str(Adj_Rsquared_Total/counter))
We get an average R^2 value of 0.97. This is a sign that for most artists, the model of all attributes interacting to popularity has a good fit. Due to this fact, we can try creating a prediction model using Random Forest, K-Nearest Neighbors, and Linear Regression models.
We will train each of the models based on the attributes discussed, and try to predict popularity within a small margin of error. The margin we will use will be the difference between that artist’s highest popularity and lowest popularity songs, multiplied by 0.05. This means for the model to be correct in our 10-Fold Cross Validation test, the predicted popularity of the song must fall within 5% of the artist’s total popularity range from its actual popularity. In the 10-Fold Cross Validation test, we will split 90% of the artist’s music into a training dataset to train the model on, and 10% of the artist’s music into a testing dataset to test our model on. We will run this 10 times, where each 10% portion of the artist’s songs will be used as testing data at some point. We will then find the average accuracy between all 10 iterations of this test to see our average accuracy per model for that artist. To start, we can use the following three well-known artists:
For more on how K-Nearest Neighbors algorithm works: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
For more on how Random Forest algorithm works: https://www.wikiwand.com/en/Random_forest
For more on how sklearn's Linear Regression algorithm works: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
#This function returns the ratio of accepted predicted values to total values
#If a predicted value is within the specified accept range, it will count as a success
def score(predicted, actual, accept_range):
count = 0
for i in range(len(predicted)):
if abs(predicted[i] - actual[i]) < accept_range:
count += 1
return count / len(actual)
Artist_List = ['Meek Mill', 'Taylor Swift', 'Wiz Khalifa']
cols_to_drop = ['genre', 'artist_name', 'track_name', 'track_id', 'popularity', 'key', 'mode', 'time_signature', 'artist_freq']
#Looping through artists
for artist in Artist_List:
#Creating a temporary dataframe of only songs by the current artist
Artist_Data_Frame = Song_Data_Frame[Song_Data_Frame['artist_name'] == artist]
#Extracting the predictors columns
x = Artist_Data_Frame.drop(columns = cols_to_drop).values
#Extracting the actual popularity values
y = Artist_Data_Frame['popularity'].values
RF_scores = []
KNN_scores = []
reg_scores = []
#Creating the partition indexing for k-fold
Folds = KFold(n_splits=10, shuffle=False)
#Calculating the accept range for this particular artist
accept_range = .05*(y.max() - y.min())
#Looping through the partitions for k-fold
for train, test in Folds.split(x):
#Random Forrest training and predicting
RF_class = RFC()
RF_class.fit(x[train], y[train])
RF_predicted = RF_class.predict(x[test])
RF_scores.append(score(RF_predicted, y[test], accept_range))
#k-NN classification training and predicting
KNN_class = KNN(n_neighbors = 1) #k_NN
KNN_class.fit(x[train], y[train])
KNN_predicted = KNN_class.predict(x[test])
KNN_scores.append(score(KNN_predicted, y[test], accept_range))
#linear regression from sklearn
reg = LinearRegression().fit(x[train], y[train])
reg_predicted = reg.predict(x[test])
reg_scores.append(score(reg_predicted, y[test], accept_range))
#Printing average score for each machine learning algorithm for the current artist
print(artist+" RF average score: " +str(mean(RF_scores))+ " STD Error: "+ str(st.sem(RF_scores)))
print("\n"+artist+" KNN average score: "+ str(mean(KNN_scores))+ " STD Error: "+ str(st.sem(KNN_scores)))
print("\n"+artist+" reg average score: "+ str(mean(reg_scores))+ " STD Error: "+ str(st.sem(reg_scores))+"\n\n")
From this testing, we see that our Random Forest model had the most accuracy throughout all three artists, closely followed by the K-Nearest Neighbors model. For both Meek Mill and Wiz Khalifa, the two models had over a 95% accuracy. For Taylor Swift, while the models were not quite as accurate, we still obtained over a 75% accuracy in popularity prediction, which is still a good result. This leads us to believe that maybe it is possible to accurately predict a song’s popularity within that artist’s work using the attributes discussed above. To test this theory, we can try this model on a sample of 100 artists from the entire dataset for artists who appear more than 100 times. We will then find the average accuracy between all the artists for each model.
In this 10-Fold CV Test, we will use a margin of error of 7.5 popularity points because the range of popularity points can be larger or smaller depending on the artist, so it would be fair to have a general scale for popularity. For example, for a specific artist, the output could output a popularity of 95 instead of 96, but if the range of this artist’s popularity is small enough, then this could be an incorrect prediction even though the model is very close.
KNN_average = []
RF_average = []
reg_average = []
#Creating a temporary dataframe consisting of songs by artists that appear more than 100 times in the data set
temp = Song_Data_Frame[Song_Data_Frame['artist_freq'] > 100]
counter = 0
cols_to_drop = ['genre', 'artist_name', 'track_name', 'track_id', 'popularity', 'key', 'mode', 'time_signature', 'artist_freq']
#Looping through the artists
for artist in temp.drop_duplicates(subset=['artist_name'])['artist_name'].sample(100):
#Creating a temporary dataframe of only songs by the current artist
Artist_Data_Frame = temp[temp['artist_name'] == artist].dropna()
counter = counter + 1
#Extracting the predictors columns
x = Artist_Data_Frame.drop(columns = cols_to_drop).values
#Extracting the actual popularity values
y = Artist_Data_Frame['popularity'].values
RF_scores = []
KNN_scores = []
reg_scores = []
#Creating the partition indexing for k-fold
Folds = KFold(n_splits=10, shuffle=False)
accept_range = 7.5
#Looping through the partitions for k-fold
for train, test in Folds.split(x):
#Random Forrest training and predicting
RF_class = RFC()
RF_class.fit(x[train], y[train])
RF_predicted = RF_class.predict(x[test])
RF_scores.append(score(RF_predicted, y[test], accept_range))
#k-NN classification training and predicting
KNN_class = KNN(n_neighbors = 1) #k_NN
KNN_class.fit(x[train], y[train])
KNN_predicted = KNN_class.predict(x[test])
KNN_scores.append(score(KNN_predicted, y[test], accept_range))
#linear regression from sklearn
reg = LinearRegression().fit(x[train], y[train])
reg_predicted = reg.predict(x[test])
reg_scores.append(score(reg_predicted, y[test], accept_range))
#Appending each machine learning algorithms average score for the current artist
KNN_average.append(mean(KNN_scores))
RF_average.append(mean(RF_scores))
reg_average.append(mean(reg_scores))
#Printing the average of the average scores for each machine learning algorithm
print("\n\nRF average score: ", mean(RF_average), " STD Error: ", st.sem(RF_average))
print("\nKNN average score: ", mean(KNN_average), " STD Error: ", st.sem(KNN_average))
print("\nreg average score: ", mean(reg_average), " STD Error: ", st.sem(reg_average))
From the test conducted above, we can see that we can get a decent prediction of popularity based on the song attributes. Although, as seen before, the prediction is more accurate for certain artists compared to others. This could mean that a certain artist’s fanbase prefers a combination of specific attributes in the artist’s music more than another artist’s fanbase, who may prefer different types of music from them.
Earlier, during our data exploration, we noticed that there might be some correlation between certain attributes and popularity. Through analysis using single variable correlation, we found that there is no strong relationship between a single variable and popularity. However, through the use of interaction terms, we found that we can get a decent prediction of popularity using these attributes within each artist.
Since our initial guess that popularity could be predicted overall was incorrect, we had to continuously narrow our scope from overall popularity, to popularity within a genre, and finally to our most accurate model of popularity within an artist’s work.