Twitter Sentiment Analysis by Geographical Area

14 minute read

Hello, today we are going to perform a Twitter Sentiment Analysis by a custom Geographical Area. In this blog post we are not use the Twitter Developer Account, instead I will use a simple and unlimited Twitter scraper with python.

I am interested to get top tweets with a custom keyword , geolocated less than 50 km from White House in the United States. I would like to analyze the word “gas” around the White House during one month since the February 24 2022 .

White House in United States

Installation of Conda

First you need to install anaconda at this link

in this location C:\Anaconda3 , then you, check that your terminal , recognize conda

C:\conda --version
conda 4.12.0

Step 1. Environment creation

The environments supported that I will consider is Python 3.7,

I will create an environment called twitter, but you can put the name that you like.

conda create -n twitter python==3.7

then we activate

conda activate twitter

then in your terminal type the following commands:

conda install ipykernel

then

python -m ipykernel install --user --name twitter --display-name "Python (Twitter)"

you will get something like

(twitter) C:\Users\ruslamv>python -m ipykernel install --user --name twitter --display-name "Python (Twitter)"
Installed kernelspec twitter in C:\Users\ruslanmv\AppData\Roaming\jupyter\kernels\twitter

If you want to delete your conda environment

conda env remove -n twitter

Step 2. Installation of the repositories

pip install certifi pandas python-dotenv chromedriver-autoinstaller geckodriver-autoinstaller urllib3 seaborn sklearn

pip install selenium==4.2.0

and finally the great scrapper

pip install Scweet==1.8

and the geolocation

pip install geopy

and to analize the text

pip install wordcloud nltk

then open the Jupyter notebook with the command

jupyter notebook&

Let us get top tweets with a custom keyword like ‘gas’ geolocated less than 50 km from White house.

In this notebook we are going to classify important and no important twitter text messages by considering the number of likes. We are going to webscrap one month of twitter messages and analize them. We will choose one geographical area and from it , we will consider a radio around it and extract all possible twitts that contains a keyword of your interest.

For this blogpost we will consider the keyword gas and we will choose the White House and we are going to choose the dates one month after the War In Ukraine started. Just for curiosity. Get started. The interesting thing is that we do not need any account with twitter and the requests are unlimited and free.

Step 3. Getting the data

After the notebook is open you can import and use the functions as follows:

#Loading Libraries

from Scweet.scweet import scrape
from Scweet.user import get_user_information, get_users_following, get_users_followers
from geopy.geocoders import Nominatim

The first step is initialize the geolocator engine

geolocator = Nominatim(user_agent="twitter")

Selection of place

Here we need to specify the address you want to perform the analysis , for example, you can choose your home address or the street and number that you want to perform the research, in my case I chose the White House.

location = geolocator.geocode(" White House ") 

print(location.address)

White House, 1600, Pennsylvania Avenue Northwest, Washington, District of Columbia, 20500, United States

print((location.latitude, location.longitude))

(38.897699700000004, -77.03655315)

and we setup the radius that we want to explorer

distance="50km"

geolocation=str(location.latitude)+","+str(location.longitude)+","+str(distance)

geolocation

'38.897699700000004,-77.03655315,50km'

and also the period of time that you want to perform the research, then we create the dataset as follows, after this, we will wait a while until the webscrapping will work, and extract all the data

data = scrape(words=['gas'],
              since="2022-02-24", 
              until="2022-03-24", 
              from_account = None,      
              interval=1, 
              headless=False, 
              display_type="Top", 
              save_images=False, lang="en",
              resume=False, 
              filter_replies=False,
              proximity=False,
              geocode=geolocation)

looking for tweets between 2022-02-24 and 2022-02-25 ...
.
.
.

you can see the shape

data.shape

(420, 11)

you can save the data

data.to_csv('my_twitter.csv', index=False)

Get the main information of a given list of users. From the data, you can identify one single user

users = ['BrianJohnsonMPA',] 

# this function return a list that contains : 
# ["nb of following","nb of followers", "join date", "birthdate", "location", "website", "description"]
users_info = get_user_information(users, headless=True)

Scraping on headless mode.
--------------- BrianJohnsonMPA information : ---------------
Following :  3,370
Followers :  1,067
Location :  Washington, DC
Join date :  Joined July 2015
Birth date :  
Description :  VP Gov & Public Affairs / Head of Office, Veterans Guardian | Past: Vogel Group, American Petroleum Institute & Americans for Tax Reform | views my own
Website :  https://t.co/SXSxU0WBXH

as you see you can get a lot of information that can be very useful to the analysis of the region.

Moreover you can extract more information of the single user.

import pandas as pd

users_df = pd.DataFrame(users_info, index = ["nb of following","nb of followers", "join date", 
                                             "birthdate", "location", "website", "description"]).T
users_df

	nb of following	nb of followers	join date	birthdate	location	website	description
BrianJohnsonMPA	3,370	1,067	Joined July 2015		Washington, DC	https://t.co/SXSxU0WBXH	VP Gov & Public Affairs / Head of Office, Vete...

Interesting, well, let us continue to clean our dataset.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df=pd.read_csv('my_twitter.csv') 

We replace NAN values with 0 ror the whole DataFrame using pandas

df=df.fillna(0)

df.head()

	UserScreenName	UserName	Timestamp	Text	Embedded_text	Emojis	Comments	Likes	Retweets	Image link	Tweet URL
0	Tommie Collins	@tommiecollins	2022-02-24T01:30:26.000Z	Tommie Collins\n@tommiecollins\n·\nFeb 24	So hold up...gas is almost $4 and y'all drivin...	0	6.0	22.0	68	['https://pbs.twimg.com/ext_tw_video_thumb/149...	https://twitter.com/tommiecollins/status/14966...
1	David Nichols	@DavidNichols_31	2022-02-24T22:15:45.000Z	David Nichols\n@DavidNichols_31\n·\nFeb 24	My only take on the War in the Ukraine is I be...	0	0.0	1.0	4	[]	https://twitter.com/DavidNichols_31/status/149...
2	KratosOfPortugal	@KratosPRT	2022-02-24T03:35:14.000Z	KratosOfPortugal\n@KratosPRT\n·\nFeb 24	Get ready for expensive gas.\nAFP News Agency\...	0	0.0	0.0	7	['https://pbs.twimg.com/profile_images/1272776...	https://twitter.com/KratosPRT/status/149669015...
3	Samuel Hammond	@hamandcheese	2022-02-24T17:26:52.000Z	Samuel Hammond\n@hamandcheese\n·\nFeb 24	Atlantic Canada is sitting on trillions of cub...	🌐 🏛 🏗	2.0	5.0	34	['https://pbs.twimg.com/profile_images/1437631...	https://twitter.com/hamandcheese/status/149689...
4	Brian M. Johnson MPA	@BrianJohnsonMPA	2022-02-24T19:06:04.000Z	Brian M. Johnson MPA\n@BrianJohnsonMPA\n·\nFeb 24	Replying to \n@JerryDunleavy\nOil & natural ga...	0	3.0	3.0	16	[]	https://twitter.com/BrianJohnsonMPA/status/149...

and let us choose the information that it is more relevant for us

dfa=df[['Embedded_text','Likes','Comments','Retweets','UserName']]

dfa.head()

	Embedded_text	Likes	Comments	Retweets	UserName
0	So hold up...gas is almost $4 and y'all drivin...	22.0	6.0	68	@tommiecollins
1	My only take on the War in the Ukraine is I be...	1.0	0.0	4	@DavidNichols_31
2	Get ready for expensive gas.\nAFP News Agency\...	0.0	0.0	7	@KratosPRT
3	Atlantic Canada is sitting on trillions of cub...	5.0	2.0	34	@hamandcheese
4	Replying to \n@JerryDunleavy\nOil & natural ga...	3.0	3.0	16	@BrianJohnsonMPA

df = dfa.sort_values(by=['Likes','Retweets','Comments'], ascending=False)

df.head(10)

	Embedded_text	Likes	Comments	Retweets	UserName
378	It doesn’t matter what a gallon of gas costs-r...	473.0	58.0	2,739	@JJCarafano
323	"We're banning all imports of Russian gas, oil...	424.0	70.0	2,316	@Liveuamap
185	These are also known as “pump gas with ya gloc...	109.0	7.0	109	@_JuiceJones
322	Gas is already at $4.33 a gallon here in DC, i...	93.0	187.0	575	@IntelDoge
219	In the search for new measures against becaus...	80.0	24.0	534	@carlbildt
377	Good piece. I’ll add that i saw a thread yeste...	55.0	10.0	199	@SaysSimonson
47	Seeing whole columns of Russian armor out of g...	43.0	9.0	249	@klonkitchen
396	The EU spent almost 13 million EUR on Russian ...	41.0	5.0	88	@SashaUstinovaUA
151	Multiple sources tell me investigators believe...	40.0	3.0	16	@Brad7News
375	Gas might 5+ but riding dick is freee shordy\n...	38.0	3.0	98	@nikkotank

We have created a single dataset of messages around the white house.

Step 4. Creation of the dataset

We are interested to classify the twitter messages by the importance or relevance,

Imporant
Unimportant

df=df.reset_index(drop=True)

df.plot(y='Likes', use_index=True)

# selecting rows based on condition
df_likes = df.loc[df['Likes'] != 0]
df_likes.shape

(116, 5)

# selecting rows based on condition
df_retweets = df.loc[df['Retweets'] != 0]
df_retweets.shape

(253, 5)

# selecting rows based on condition
df_comments = df.loc[df['Comments'] != 0]
df_comments.shape

(161, 5)

Step 5. Labelling the messages

In this part, we will consider the unimportant messages, that there are not likes, neither retweets or comments.

df_unimportant=df.loc[(df['Likes'] == 0) & (df['Retweets'] == 0)&(df['Comments'] == 0)]
df_unimportant.shape

(135, 5)

and the important messages with likes, retweets and comments,

df_important=df.loc[(df['Likes'] != 0) | (df['Retweets'] != 0)|(df['Comments'] != 0)]
df_important.shape

(285, 5)

# We check if our dimensions are okay
assert df_important.shape[0] +df_unimportant.shape[0] == df.shape[0], 'something is wrong'
print('its okay, the number of total messages are:',df.shape[0])

its okay, the number of total messages are: 420

df_important=df_important.reset_index(drop=True)

df_important.plot(y='Comments', use_index=True)

df_important.plot(y='Likes', use_index=True)

If we split the set of important twits into 3 categories we can do this approach

size=int(len(df_important)/3)

size

It is possible split the the importance of the message may be splitted in

Very Important
Fairly Important
Slightlly important
Not at all important

s# splitting dataframe by row index
df_1 = df_important.iloc[:size,:]
df_2 = df_important.iloc[size:2*size,:]
df_3 = df_important.iloc[2*size:s,:]
df_4 = df_unimportant

However we are dealing with very few messages we cannot proceed with this approach, if you have larger dataset for sure you can proceed this way.

df_1.head()

	Embedded_text	Likes	Comments	Retweets	UserName
0	It doesn’t matter what a gallon of gas costs-r...	473.0	58.0	2,739	@JJCarafano
1	"We're banning all imports of Russian gas, oil...	424.0	70.0	2,316	@Liveuamap
2	These are also known as “pump gas with ya gloc...	109.0	7.0	109	@_JuiceJones
3	Gas is already at $4.33 a gallon here in DC, i...	93.0	187.0	575	@IntelDoge
4	In the search for new measures against becaus...	80.0	24.0	534	@carlbildt

print("Shape of new dataframes - {} , {}, {}".format(df_1.shape, df_2.shape,df_3.shape))

Shape of new dataframes - (95, 5) , (95, 5), (95, 5)

# Create a sample dataframe with an text index
plotdata = pd.DataFrame(
    {"Twitter Classes": [len(df_1), len(df_2), len(df_3), len(df_4)]}, 
    index=["Very Important", 
           "Fairly Important", 
           "Slightlly important", 
           "Not at all important"])
# Plot a bar chart
plotdata.plot(kind="bar")

but for simplicity we choose only two types, important and not important messages.

Step 6. Let us consider only two classes important and not important

df_1['label'] = 0
df_2['label'] = 0
df_3['label'] = 1
df_4['label'] = 1

	Embedded_text	Comments	Retweets	UserName	label
190	Replying to \n@janjowen\nWhat if I want to rem...	2.0	1	@mls1776	1
191	Replying to \n@schlthss\nEven our EVs depend o...	1.0	1	@sogand_karbalai	1
192	Replying to \n@baecariss\nAll gas no break! \n...	1.0	1	@Team_Newby1	1
193	Replying to \n@tharealcatmom\nCramps and gas. ...	1.0	1	@itstifftiara	1
194	Replying to \n@Jim__840\n and \n@Arriadna\nThi...	1.0	1	@sonofnels	1

There is slightlly unbalanced data

# concatenating df1 and df2 along rows
df_twitter = pd.concat([df_1, df_2,df_3,df_4], axis=0)

df_twitter.rename(columns = {'Embedded_text':'tweet'}, inplace = True)

df_twitter.to_csv('twitter_gas.csv', index=False)

# Load the data
tweets_df = pd.read_csv('twitter_gas.csv')

tweets_df.shape

(420, 6)

tweets_df.head()

	tweet	Likes	Comments	Retweets	UserName
0	It doesn’t matter what a gallon of gas costs-r...	473.0	58.0	2,739	@JJCarafano
1	"We're banning all imports of Russian gas, oil...	424.0	70.0	2,316	@Liveuamap
2	These are also known as “pump gas with ya gloc...	109.0	7.0	109	@_JuiceJones
3	Gas is already at $4.33 a gallon here in DC, i...	93.0	187.0	575	@IntelDoge
4	In the search for new measures against becaus...	80.0	24.0	534	@carlbildt

tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420 entries, 0 to 419
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet     420 non-null    object 
 1   Likes     420 non-null    float64
 2   Comments  420 non-null    float64
 3   Retweets  420 non-null    object 
 4   UserName  420 non-null    object 
 5   label     420 non-null    int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 19.8+ KB

# Drop the 'id' column
tweets_df = tweets_df.drop(['Likes', 'Comments', 'Retweets', 'UserName'], axis=1)

tweets_df.head()

	tweet	label
0	It doesn’t matter what a gallon of gas costs-r...	0
1	"We're banning all imports of Russian gas, oil...	0
2	These are also known as “pump gas with ya gloc...	0
3	Gas is already at $4.33 a gallon here in DC, i...	0
4	In the search for new measures against becaus...	0

tweets_df.hist(bins = 30, figsize = (13,5), color = 'r')

sns.countplot(tweets_df['label'], label = "Count")

# Let's get the length of the messages
tweets_df['length'] = tweets_df['tweet'].apply(len)

tweets_df.head()

	tweet	length
0	It doesn’t matter what a gallon of gas costs-r...	131
1	"We're banning all imports of Russian gas, oil...	197
2	These are also known as “pump gas with ya gloc...	67
3	Gas is already at $4.33 a gallon here in DC, i...	125
4	In the search for new measures against becaus...	284

tweets_df.describe()

	label	length
count	420.000000	420.000000
mean	0.547619	178.990476
std	0.498321	116.781234
min	0.000000	20.000000
25%	0.000000	93.750000
50%	1.000000	148.500000
75%	1.000000	258.000000
max	1.000000	677.000000

# Let's see the shortest message 
tweets_df[tweets_df['length'] == 20]['tweet'].iloc[0]

'Premium gas is 5.30?'

# Let's view the message with mean length 
tweets_df[tweets_df['length'] == 178]['tweet'].iloc[0]

'The EU spent almost 13 million EUR on Russian coal, oil, and gas since the Kremlin’s war on Ukraine began on February 24. \n\nStop financing Putin’s war machine! #StopPutin\n5\n41\n88'

# Plot the histogram of the length column
tweets_df['length'].plot(bins=100, kind='hist') 

#Very Important 
class1 = tweets_df[tweets_df['label']==0]
#Fairly Important
class2 = tweets_df[tweets_df['label']==0]
#Slightlly important
class3 = tweets_df[tweets_df['label']==1]
#Not at all important
class4 = tweets_df[tweets_df['label']==1]

sentences = tweets_df['tweet'].tolist()
len(sentences)

sentences_as_one_string =" ".join(sentences)

sentences_as_one_string =sentences_as_one_string.replace('Replying', '')

from wordcloud import WordCloud
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))

Step 7. Perform data cleaning - remove punctuation from text

Let us summarize some standard techniques to clean our text

import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Test = '$I love AI & Machine learning!!'
Test_punc_removed = [char for char in Test if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join

'I love AI  Machine learning'

Test = 'Good morning beautiful people :)... I am having fun learning Machine learning and AI!!'

Test_punc_removed = [char for char in Test if char not in string.punctuation]

# Join the characters again to form the string.
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join

'Good morning beautiful people  I am having fun learning Machine learning and AI'

Perform Data Cleaning - Remove Stopwords

import nltk # Natural Language tool kit 
nltk.download('stopwords')

# You have to download stopwords Package to execute this command
from nltk.corpus import stopwords
#stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rusla\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Test_punc_removed_join = 'I enjoy coding, programming and Artificial intelligence'
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]

Test_punc_removed_join_clean # Only important (no so common) words are left

['enjoy', 'coding,', 'programming', 'Artificial', 'intelligence']

Perform Count Vectorization (Tokenization)

from sklearn.feature_extraction.text import CountVectorizer
sample_data = ['This is the first paper.','This document is the second paper.','And this is the third one.','Is this the first paper?']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)

print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'paper', 'second', 'the', 'third', 'this']

print(X.toarray())  

[[0 0 1 1 0 1 0 1 0 1]
 [0 1 0 1 0 1 1 1 0 1]
 [1 0 0 1 1 0 0 1 1 1]
 [0 0 1 1 0 1 0 1 0 1]]

Removing any URL within a string in Python

text='''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/'
The White House Gaslights on Gas | by @njhochman  https://nationalreview.com/corner/the-white-house-gaslights-on-gas/?taid=6228b39e07024b000156be08&utm_campaign=trueanthem&utm_medium=social&utm_source=twitter…
'''

import re
text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
print(text)

text1
text2

text3
text4

text5
text6

The White House Gaslights on Gas | by @njhochman  

Step 8. Create a pipeline to remove Punctuations, Stopwords and perform Count Vectorization

Let’s define a pipeline to clean up all the messages The pipeline performs the following:

remove urls
remove punctuation
remove stopwords

import re
def message_cleaning(message):
    message =re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', message)    
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

# Let's test the newly added function
tweets_df_clean = tweets_df['tweet'].apply(message_cleaning)

print(tweets_df_clean[5]) # show the cleaned up version

['Good', 'piece', 'I’ll', 'add', 'saw', 'thread', 'yesterday', '3', 'legacy', 'media', 'reporters', 'total', 'agreement', 'there’s', 'nothing', 'get', 'gas', 'ground', 'right', 'One', 'said', 'straight', 'face', '“even', 'permit', 'issued', 'takes', 'long', 'time', 'drill”', 'National', 'Review', 'NRO', '·', 'Mar', '9', 'White', 'House', 'Gaslights', 'Gas', 'njhochman', '10', '55', '199']

print(tweets_df['tweet'][5]) # show the original version

Good piece. I’ll add that i saw a thread yesterday with 3 legacy media reporters in total agreement that there’s nothing we can do to get more gas out of the ground right now. One of them said with a straight face “even after the permit is issued, it takes a long time to drill”
National Review
@NRO
 · Mar 9
The White House Gaslights on Gas | by @njhochman  https://nationalreview.com/corner/the-white-house-gaslights-on-gas/?taid=6228b39e07024b000156be08&utm_campaign=trueanthem&utm_medium=social&utm_source=twitter…
10
55
199

from sklearn.feature_extraction.text import CountVectorizer
# Define the cleaning pipeline we defined earlier
vectorizer = CountVectorizer(analyzer = message_cleaning, dtype = np.uint8)
tweets_countvectorizer = vectorizer.fit_transform(tweets_df['tweet'])

print(vectorizer.get_feature_names())

print(tweets_countvectorizer.toarray())  

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

tweets_countvectorizer.shape

(420, 3463)

X = pd.DataFrame(tweets_countvectorizer.toarray())

X.head()

	0	1	2	3	4	5	6	7	8	9	...	3453	3454	3455	3456	3457	3458	3459	3460	3461	3462
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 3463 columns

y = tweets_df['label']

Step 9. Train and evaluate a Naive Bayes Classifier Model

In statistics, naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve high accuracy levels

The fundamental Naive Bayes assumption is that each feature makes an:

independent
equal

contribution to the outcome.

With relation to our dataset, this concept can be understood as:

We assume that no pair of features are dependent. For example, the price being ‘Expensive’ has nothing to do with the White. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example, knowing only price and oil alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to the outcome.

X.shape

(420, 3463)

y.shape

(420,)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.naive_bayes import MultinomialNB

NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)

MultinomialNB()

from sklearn.metrics import classification_report, confusion_matrix

# Predicting the Test set results
y_predict_test = NB_classifier.predict(X_test)
fig, ax = plt.subplots(figsize=(5,5))

cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot=True)
plt.xlabel('Predicted',fontsize=14)
plt.ylabel('True',fontsize=14)

Text(24.0, 0.5, 'True')

print(classification_report(y_test, y_predict_test))

              precision    recall  f1-score   support

           0       0.69      0.71      0.70        35
           1       0.79      0.78      0.78        49

    accuracy                           0.75        84
   macro avg       0.74      0.74      0.74        84
weighted avg       0.75      0.75      0.75        84

You can download the notebook here.

Congratulations!. You have created a dataset from twitter and analyzed which messages are relevant in according to one keyword around one geographical area.

Share on

Twitter Facebook LinkedIn

Ruslan Magana Vsevolodovna

Twitter Sentiment Analysis by Geographical Area

Installation of Conda

Step 1. Environment creation

Step 2. Installation of the repositories

Step 3. Getting the data

Selection of place

Step 4. Creation of the dataset

Step 5. Labelling the messages

Step 6. Let us consider only two classes important and not important

Step 7. Perform data cleaning - remove punctuation from text

Perform Data Cleaning - Remove Stopwords

Perform Count Vectorization (Tokenization)

Removing any URL within a string in Python

Step 8. Create a pipeline to remove Punctuations, Stopwords and perform Count Vectorization

Step 9. Train and evaluate a Naive Bayes Classifier Model

Share on

Leave a comment

You may also enjoy

From Zero to Hero: Building a Multi-Agent System with Watsonx Orchestrate

07 Jul 2025

Watsonx.ai Agent to MCP Gateway

01 Jul 2025

Building a Watsonx.ai Chatbot RAG Server with MCP

20 Apr 2025

Building RAG Applications with IBM watsonx.ai and Langflow

20 Apr 2025

	0	1	2	3	4	5	6	7	8	9	...	3453	3454	3455	3456	3457	3458	3459	3460	3461	3462
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	0	1	2	3	4	5	6	7	8	9	...	3453	3454	3455	3456	3457	3458	3459	3460	3461	3462
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	0	1	2	3	4	5	6	7	8	9	...	3453	3454	3455	3456	3457	3458	3459	3460	3461	3462
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0