Grouping Pandas DataFrame by location

I have a data frame where I wish to aggregate my rows by the location column (US STATES)

the Location column is of the the following format.

      Location  
0                       Texas, USA  
1                Middle of nowhere  
2                              NaN  
3                   Largo, Florida  
4                              NaN  
5                         Indiana   
6                       Upstate NY  
7     People's Republic of Chicago  
8               South Florida, USA  
9                       Texas, USA  
10                             NaN  
11                             NaN  
12                  Cardiff, Wales  
13                             NaN  
14                   Long Beach CA  
15                           Texas  
16                             NaN  
17     WithLove StandingWithIsrael  
18  Suffolk , Lake Ronkonkoma , NY  
19                   Illinois, USA  

All the tweets that do not belong to US location such as Middle of nowhere, WithLove StandingWithIsrael and NaN will be treated as missing values.

The real problem comes while filtering tweets based on the location as it is not of a standard format. For e.g., tweets belonging to Texas are of format Texas USA, or Tx or Texas or Austin, Texas. How do I normalize the location in a way where it is easier to filter by US States? Any any help would be greatly appreciated.

0 Comment

NO COMMENTS

LEAVE A REPLY

Captcha image