Introduction to Data Science & Artificial Intelligence

What is it?

Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years. - Andrew Ng

I’d like to offer an alternative metaphor: machine learning has become alchemy. - Ali Rahimi

There's an anguish in the field ... Many of us feel like we're operating on an alien technology.

If you’re arguing against AI, then you’re arguing against safer cars that aren’t going to have accidents, and you’re arguing against being able to better diagnose people when they’re sick. - Mark Zuckerberg

I believe AI is one of the more defining technologies of our time. -Satya Nadella

Is data science magic?

No. It is a statistician who can program.

What do we do with it

Descriptive Data Science

Descriptive statistics [...] summarize a given data set, which can be either a representation of the entire or a sample of a population.
- Investopedia

Predictive Data Science

Predictive analytics is the use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data. The goal is to go beyond knowing what has happened to providing a best assessment of what will happen in the future.
- SAS

Applications of Data Science

  • Wayz
  • Social media
  • Bioinformatics
    • AlphaFold
  • Recommender systems
    • Amazon
    • Search engines (Google, Bing)

How do we do it

Tools

  • Scripting
    • Python
    • R, Julia
  • Data processing/storage
    • SQL
      • Google BigQuery
    • Spark

Tools (cont'd)

  • Libraries, Packages, Frameworks
    • Pandas + numpy/tidyverse
    • Scikit-learn
    • Tensorflow / PyTorch
  • Exploration / Sharing
    • Jupyter Notebook
    • Reveal.js
    • Kaggle

Intro to Pandas

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Assignment 4

Here's a brief demonstration of why you might be interested in using pandas.

This is all the code necessary to load our .csv file. We can take a peek at our data using the df.head() function.

In [3]:
crime_df = pd.read_csv("A4/crime.csv")
crime_df.head()
Out[3]:
INCIDENT_ID OFFENSE_ID OFFENSE_CODE OFFENSE_CODE_EXTENSION OFFENSE_TYPE_ID OFFENSE_CATEGORY_ID FIRST_OCCURRENCE_DATE LAST_OCCURRENCE_DATE REPORTED_DATE INCIDENT_ADDRESS GEO_X GEO_Y GEO_LON GEO_LAT DISTRICT_ID PRECINCT_ID NEIGHBORHOOD_ID IS_CRIME IS_TRAFFIC
0 2016376978 2016376978521300 5213 0 weapon-unlawful-discharge-of all-other-crimes 6/15/2016 11:31:00 PM NaN 6/15/2016 11:31:00 PM NaN 3193983.00 1707251.00 -104.81 39.77 5 521 montbello 1 0
1 20186000994 20186000994239900 2399 0 theft-other larceny 10/11/2017 12:30:00 PM 10/11/2017 4:55:00 PM 1/29/2018 5:53:00 PM NaN 3201943.00 1711852.00 -104.78 39.79 5 522 gateway-green-valley-ranch 1 0
2 20166003953 20166003953230500 2305 0 theft-items-from-vehicle theft-from-motor-vehicle 3/4/2016 8:00:00 PM 4/25/2016 8:00:00 AM 4/26/2016 9:02:00 PM 2932 S JOSEPHINE ST 3152762.00 1667011.00 -104.96 39.66 3 314 wellshire 1 0
3 201872333 201872333239900 2399 0 theft-other larceny 1/30/2018 7:20:00 PM NaN 1/30/2018 10:29:00 PM 705 S COLORADO BLVD 3157162.00 1681320.00 -104.94 39.70 3 312 belcaro 1 0
4 2017411405 2017411405230300 2303 0 theft-shoplift larceny 6/22/2017 8:53:00 PM NaN 6/23/2017 4:09:00 PM 2810 E 1ST AVE 3153211.00 1686545.00 -104.96 39.72 3 311 cherry-creek 1 0

Here's the second .csv:

In [4]:
code_df = pd.read_csv("A4/offense_codes.csv")
code_df.head()
Out[4]:
OFFENSE_CODE OFFENSE_CODE_EXTENSION OFFENSE_TYPE_ID OFFENSE_TYPE_NAME OFFENSE_CATEGORY_ID OFFENSE_CATEGORY_NAME IS_CRIME IS_TRAFFIC
0 2804 1 stolen-property-possession Possession of stolen property all-other-crimes All Other Crimes 1 0
1 2804 2 fraud-possess-financial-device Possession of a financial device all-other-crimes All Other Crimes 1 0
2 2901 0 damaged-prop-bus Damaged business property public-disorder Public Disorder 1 0
3 2902 0 criminal-mischief-private Criminal mischief to private property public-disorder Public Disorder 1 0
4 2903 0 criminal-mischief-public Criminal mischief to public property public-disorder Public Disorder 1 0

Question 1

Write a function code_to_names(code) that takes an int crime code and returns a list of strings where each string contains the code extension and the full name of the crime.

In [5]:
def code_with_extension(row):
    return f"{row.OFFENSE_CODE_EXTENSION}: {row.OFFENSE_TYPE_NAME}"

def code_to_names(code):
    code_filtered = code_df[code_df.OFFENSE_CODE == code]
    return code_filtered.apply(code_with_extension, axis=1).tolist()

code_to_names(2804)
Out[5]:
['1: Possession of stolen property',
 '2: Possession of a financial device',
 '0: Recovered vehicle stolen outside Denver']

Question 2

Write a function code_from_keywords that takes a list of keywords and returns a list of all the crime codes that include the keywords in their description.

In [6]:
def code_from_keywords(keywords):
    match_string = f"{'|'.join(keywords).lower()}"
    print(match_string)
    
    kw_df = code_df[code_df.OFFENSE_TYPE_NAME.str.lower() \
    .str.contains(match_string, regex=True)]
    return kw_df.OFFENSE_CODE.tolist()
    
code_from_keywords(["stolen", "violent"])
stolen|violent
Out[6]:
[2804, 2402, 2801, 2803, 2804]

Question 3

Write a function crimes_by_code(code) that returns a list of all the crimes with the given code that have occurred. Each element of the return list is a tuple having the crime id, the abbreviated crime name, the date and time, and neighbourhood id of the crime.

In [7]:
def crimes_by_code(code):
    selected_cols = ["INCIDENT_ID", "OFFENSE_TYPE_ID", \
                     "REPORTED_DATE", "NEIGHBORHOOD_ID"]
    return crime_df[crime_df.OFFENSE_CODE == code][selected_cols]

result = crimes_by_code(2399)
result.head()
Out[7]:
INCIDENT_ID OFFENSE_TYPE_ID REPORTED_DATE NEIGHBORHOOD_ID
1 20186000994 theft-other 1/29/2018 5:53:00 PM gateway-green-valley-ranch
3 201872333 theft-other 1/30/2018 10:29:00 PM belcaro
30 201872460 theft-other 1/30/2018 10:11:00 PM harvey-park-south
44 2016229783 theft-other 4/13/2016 6:02:00 PM windsor
53 20186000998 theft-other 1/29/2018 7:44:00 PM platt-park

Youtube Dataset

The Data

In [8]:
df = pd.read_csv('USvideos.csv')
df.head()
Out[8]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 2kyS6SvSYSE 17.14.11 WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat 22 2017-11-13T17:13:01.000Z SHANtell martin 748374 57527 2966 15954 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg False False False SHANTELL'S CHANNEL - https://www.youtube.com/s...
1 1ZAPwfrtAFY 17.14.11 The Trump Presidency: Last Week Tonight with J... LastWeekTonight 24 2017-11-13T07:30:00.000Z last week tonight trump presidency|"last week ... 2418783 97185 6146 12703 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg False False False One year after the presidential election, John...
2 5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146033 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3 puqaWrEC7tY 17.14.11 Nickelback Lyrics: Real or Fake? Good Mythical Morning 24 2017-11-13T11:00:04.000Z rhett and link|"gmm"|"good mythical morning"|"... 343168 10172 666 2146 https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg False False False Today we find out if Link is a Nickelback amat...
4 d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095731 132235 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho...

As we have some time-based data, we should convert it to a datetime object.

In [9]:
# http://strftime.org/
df.trending_date = pd.to_datetime(df.trending_date, format="%y.%d.%m") 
df.publish_time = pd.to_datetime(df.publish_time, format="%Y-%m-%dT%H:%M:%S.%fZ")

The df.describe() method lets us view summary statistics about our data.

In [10]:
df.describe()
Out[10]:
category_id views likes dislikes comment_count
count 40949.00 40949.00 40949.00 40949.00 40949.00
mean 19.97 2360784.64 74266.70 3711.40 8446.80
std 7.57 7394113.76 228885.34 29029.71 37430.49
min 1.00 549.00 0.00 0.00 0.00
25% 17.00 242329.00 5424.00 202.00 614.00
50% 24.00 681861.00 18091.00 631.00 1856.00
75% 25.00 1823157.00 55417.00 1938.00 5755.00
max 43.00 225211923.00 5613827.00 1674420.00 1361580.00

Data Manipulation

Let's add a column for the like percentage of each video. Note how operators can be used directly on columns.

In [11]:
df["like_pct"] = df.likes / (df.likes + df.dislikes)

If we only want to view a subset of the data frame, we can use a list of column names as our index.

In [12]:
selected_columns = ["title", "channel_title", "views", "likes", "dislikes", "like_pct", "video_id"]
df[selected_columns].head(10)
Out[12]:
title channel_title views likes dislikes like_pct video_id
0 WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat 748374 57527 2966 0.95 2kyS6SvSYSE
1 The Trump Presidency: Last Week Tonight with J... LastWeekTonight 2418783 97185 6146 0.94 1ZAPwfrtAFY
2 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 3191434 146033 5339 0.96 5qpjK5DgCt4
3 Nickelback Lyrics: Real or Fake? Good Mythical Morning 343168 10172 666 0.94 puqaWrEC7tY
4 I Dare You: GOING BALD!? nigahiga 2095731 132235 1989 0.99 d380meD0W0M
5 2 Weeks with iPhone X iJustine 119180 9763 511 0.95 gHZ1Qz0KiKM
6 Roy Moore & Jeff Sessions Cold Open - SNL Saturday Night Live 2103417 15993 2445 0.87 39idVpFF7NQ
7 5 Ice Cream Gadgets put to the Test CrazyRussianHacker 817732 23663 778 0.97 nc99ccSXST0
8 The Greatest Showman | Official Trailer 2 [HD]... 20th Century Fox 826059 3543 119 0.97 jr9QtXwC9vc
9 Why the rise of the robots won’t mean the end ... Vox 256426 12654 1363 0.90 TUmyygCMMGA

Let's take a look at the most viewed videos. We can do this with the df.sort_values() method.

In [13]:
df[selected_columns].sort_values("views", ascending=False).head()
Out[13]:
title channel_title views likes dislikes like_pct video_id
38547 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 225211923 5023450 343541 0.94 VYOjWnS4cMY
38345 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 220490543 4962403 338105 0.94 VYOjWnS4cMY
38146 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 217750076 4934188 335462 0.94 VYOjWnS4cMY
37935 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 210338856 4836448 326902 0.94 VYOjWnS4cMY
37730 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 205643016 4776680 321493 0.94 VYOjWnS4cMY

What does this tell us about our data? Is it wrong? Duplicated?

A Closer Look at Gambino's Music Video

We can filter our data to view only this video. the df.shape attribute tells us the size of our data frame - in this case, 25 rows by 17 columns.

In [14]:
gambino_df = df[df.video_id == "VYOjWnS4cMY"]
print(gambino_df.shape)
gambino_df.head(3)
(25, 17)
Out[14]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description like_pct
33351 VYOjWnS4cMY 2018-05-08 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 10 2018-05-06 04:00:07 Childish Gambino|"Rap"|"This Is America"|"mcDJ... 31648454 1405355 51547 149473 https://i.ytimg.com/vi/VYOjWnS4cMY/default.jpg False False False “This is America” by Childish Gambino http://s... 0.96
33557 VYOjWnS4cMY 2018-05-09 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 10 2018-05-06 04:00:07 Childish Gambino|"Rap"|"This Is America"|"mcDJ... 47169016 1841540 79717 194822 https://i.ytimg.com/vi/VYOjWnS4cMY/default.jpg False False False “This is America” by Childish Gambino http://s... 0.96
33772 VYOjWnS4cMY 2018-05-10 Childish Gambino - This Is America (Official V... ChildishGambinoVEVO 10 2018-05-06 04:00:07 Childish Gambino|"Rap"|"This Is America"|"mcDJ... 60776509 2183732 104377 232723 https://i.ytimg.com/vi/VYOjWnS4cMY/default.jpg False False False “This is America” by Childish Gambino http://s... 0.95

It's very easy to graph with pandas. Simply calling df.plot("x", "y") will work.

In [15]:
gambino_df.plot("trending_date", "views")
plt.gca().get_yaxis().get_major_formatter().set_scientific(False) # hide scientific notation on y-axis
plt.show() # renders plot to screen - like pygame.display.update()

There are lots of helpful methods we can apply to the columns of a data frame as well. Here, we can count the number of instances each video appears in our data.

In [16]:
most_trending = df.title.value_counts()
most_trending.head()
Out[16]:
WE MADE OUR MOM CRY...HER DREAM CAME TRUE!                                      30
Mission: Impossible - Fallout (2018) - Official Trailer - Paramount Pictures    29
The ULTIMATE $30,000 Gaming PC Setup                                            29
Sam Smith - Pray (Official Video) ft. Logic                                     29
Why I'm So Scared (being myself and crying too much)                            29
Name: title, dtype: int64

But that's not a data frame! That's a series - what Pandas uses as columns. We can also convert it back to a data frame:

In [17]:
most_trending = most_trending.to_frame()
most_trending.head()
Out[17]:
title
WE MADE OUR MOM CRY...HER DREAM CAME TRUE! 30
Mission: Impossible - Fallout (2018) - Official Trailer - Paramount Pictures 29
The ULTIMATE $30,000 Gaming PC Setup 29
Sam Smith - Pray (Official Video) ft. Logic 29
Why I'm So Scared (being myself and crying too much) 29

Grouping

Sometimes, we don't want to look at each row individually, but instead groups of data (like by youtube channel). For this we can use df.groupby().

In [18]:
trending_channels = df.groupby("channel_title").video_id.size()
trending_channels.head()
Out[18]:
channel_title
12 News                    2
1MILLION Dance Studio     33
1theK (원더케이)              19
20th Century Fox         135
2CELLOS                    2
Name: video_id, dtype: int64
In [19]:
trending_channels = trending_channels.nlargest(10).to_frame()
trending_channels
Out[19]:
video_id
channel_title
ESPN 203
The Tonight Show Starring Jimmy Fallon 197
Netflix 193
TheEllenShow 193
Vox 193
The Late Show with Stephen Colbert 187
Jimmy Kimmel Live 186
Late Night with Seth Meyers 183
Screen Junkies 182
NBA 181

pandas has also recently introduced another helpful way to plot your data, using df.style.

In [20]:
trending_channels.style.bar()
Out[20]:
video_id
channel_title
ESPN 203
The Tonight Show Starring Jimmy Fallon 197
Netflix 193
TheEllenShow 193
Vox 193
The Late Show with Stephen Colbert 187
Jimmy Kimmel Live 186
Late Night with Seth Meyers 183
Screen Junkies 182
NBA 181

By Category

Here's an example of some more complex data manipulation we might do, using another file to annotate our dataset.

In [21]:
import json

with open("US_category_id.json", "r") as f:
    category_labels = json.load(f)['items']

print(json.dumps(category_labels[:2], indent=2))
[
  {
    "kind": "youtube#videoCategory",
    "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ\"",
    "id": "1",
    "snippet": {
      "channelId": "UCBR8-60-B28hp2BmDPdntcQ",
      "title": "Film & Animation",
      "assignable": true
    }
  },
  {
    "kind": "youtube#videoCategory",
    "etag": "\"m2yskBQFythfE4irbTIeOgYYfBU/UZ1oLIIz2dxIhO45ZTFR3a3NyTA\"",
    "id": "2",
    "snippet": {
      "channelId": "UCBR8-60-B28hp2BmDPdntcQ",
      "title": "Autos & Vehicles",
      "assignable": true
    }
  }
]

The df.column.apply function works like a for loop.

for row in df.column: # pseudocode
    row = my_function(row)

This is the same as:

df.column = df.column.apply(my_function)
In [22]:
def match_category_label(cat_id):
    # referencing a global object - bad!
    for row in category_labels:
        if int(row['id']) == cat_id:
            return row['snippet']['title']

df['category'] = df.category_id.apply(match_category_label)
In [23]:
df.groupby("category").views.sum().nlargest(10).plot.bar()
plt.gca().get_yaxis().get_major_formatter().set_scientific(False)
plt.show()

There are other ways to group our data as well. We can group by time, for example.

In [24]:
time_grouper = pd.DatetimeIndex(df.publish_time)

best_post_times = df.groupby(time_grouper.hour)
best_post_times.views.mean().plot()

plt.gca().get_yaxis().get_major_formatter().set_scientific(False)
plt.show()

Why might the most successful videos be posted so early in the morning?

It appears as though most videos are published early evening.

In [25]:
best_post_times.video_id.count().plot()

plt.gca().get_yaxis().get_major_formatter().set_scientific(False)
plt.show()

Brief Introduction to Natural Language Processing

(with a bit of deep learning)

"Tokenizing" means splitting our text into individual words, sentences, phrases, or even just letters - all referred to as "tokens."

In [26]:
from big_data_scripts.nlp import *

tokenized = tokenize(df['title'])
tokenized.to_frame().head()
Out[26]:
title
0 [we, want, to, talk, about, our, marriage]
1 [the, trump, presidency, last, week, tonight, ...
2 [racist, superman, rudy, mancuso, king, bach, ...
3 [nickelback, lyrics, real, or, fake]
4 [i, dare, you, going, bald]

$n-$grams $(\mathtt{context, target})$ pairs of words, where the size of $\mathtt{context}$ is equal to $n$.

For example, (("yer", "a"), "wizard") represents that the word "wizard" is preceeded by the two context words "yer" and "a". As there are $3$ words in question, we call this a trigram - if there were $2$, we'd call it a bigram, a unigram for $1$, and so on.

Even simply looking at trigram frequency can give a great look at what people are watching!

In [27]:
from collections import Counter

trigrams = make_trigrams(tokenized)

gram_df = pd.DataFrame(Counter(trigrams).most_common(10), columns=["trigram", "count"])
gram_df.style.bar()
Out[27]:
trigram count
0 (('official', 'music'), 'video') 441
1 (('official', 'trailer'), 'hd') 422
2 (('how', 'to'), 'make') 281
3 (('the', 'last'), 'jedi') 201
4 (('official', 'video'), 'ft') 179
5 (('official', 'lyric'), 'video') 154
6 (('avengers', 'infinity'), 'war') 147
7 (('official', 'teaser'), 'trailer') 142
8 (('trailer', 'hd'), 'netflix') 135
9 (('star', 'wars'), 'the') 135

Neural Networks and Deep Learning - Practical Examples

What is a vector?

What is a Neural Network?

Predicting Handwritten Digits

Can we teach a computer to recognize these digits?

(Source 3:35 - 4:13)

In [28]:
from IPython.display import IFrame
IFrame('https://streamable.com/s/74csi/plahrb', width=800, height=650)
Out[28]:

Neural Networks with Youtube Data

After tokenizing our text data, we need to represent it in a way a computer can understand. To do this, we assign each token (in our case, each word) a unique number.

In [29]:
from big_data_scripts.neural_network import *
from big_data_scripts.vocab import *

vocab = Vocab()
tokenized.apply(vocab.addList)

list(vocab.word2index.items())[:5]
Out[29]:
[('we', 3), ('want', 4), ('to', 5), ('talk', 6), ('about', 7)]

From here, we can feed these numbers (in trigram form) into a "neural network," which you may have heard about.

Note: Here, you see the network trained once on 25 examples. In reality, I trained it on about 92,000 examples for 6 loops, and it still wasn't very good! That's why the big companies spend so much computing power on this stuff.

In [30]:
FancyNN = YoutubeNeuralNetwork(vocab)
FancyNN.train(EPOCHS=1, BATCH_SIZE=25)
2019-03-13 12:17:26.507 | INFO     | big_data_scripts.neural_network:train:52 - Begin training
2019-03-13 12:17:26.983 | INFO     | big_data_scripts.neural_network:train:85 - Epoch 1/1: 9.252068862915038
2019-03-13 12:17:26.983 | INFO     | big_data_scripts.neural_network:save_progress:91 - Saving model checkpoint to file: YT_NN_EP1_TEP1_BS25.torch

Out[30]:
[9.252068862915038]

Loading the saved model, we can take a glimpse at our data. This is our numeric representation of the word "youtube."

In [31]:
model = torch.load("YT_NN_FINAL.torch")

print(model['youtube'].shape)
model['youtube']
torch.Size([1, 200])
Out[31]:
tensor([[-1.9406, -0.0120, -0.3133,  0.0053,  0.5015, -1.5472,  2.1686, -0.4759,
         -0.4145,  0.3926,  2.1765, -0.8619, -0.4588, -0.1544,  0.2805, -1.6433,
         -0.6072,  0.3850,  2.7858, -0.8609, -1.9652, -0.4307,  0.7734, -0.0118,
          0.4662,  0.7570, -0.1216, -2.5097, -0.4403,  0.7610, -0.2263, -0.5579,
          0.8442,  1.8795,  1.3703, -0.1387, -1.2866, -1.2277,  0.9556,  0.1192,
         -1.0888,  2.7796,  1.2900,  1.0276, -1.1250,  3.8281,  0.8798, -0.7008,
          0.5990, -1.5850, -0.0440,  0.2456,  1.6168,  1.2200, -1.9468,  0.6593,
          1.1960,  0.4593, -0.6702, -1.8451,  2.2054,  1.5364, -1.5712,  1.7310,
          1.8256, -0.1310, -0.4584, -0.7093,  0.1162,  1.6298,  0.1817, -0.1406,
         -0.3664,  1.0069, -2.2419,  0.2265, -0.6465, -0.7956, -0.1883, -0.7869,
         -0.7542,  1.1626,  1.1267, -0.0078,  1.9863, -3.0799, -0.2717, -2.1401,
          0.5000,  0.1797,  0.5621, -0.5052, -1.0213, -1.1396,  0.3706,  0.2873,
         -1.0374, -1.0707,  1.6861,  1.1170, -0.0438, -0.5695,  0.4232, -0.1756,
          0.5864, -0.5788, -0.5489, -0.9428, -1.1567, -0.8392,  0.2723, -0.6071,
         -0.6823,  0.9999, -0.0997,  0.2406,  0.8948, -0.6845, -1.7521, -0.0114,
         -1.5581, -0.6389,  0.3049, -1.2423, -0.6244, -1.1311,  0.8315,  1.6257,
         -0.6005,  0.3280,  0.7026,  0.5396, -1.3772,  0.9477,  0.3131,  1.5963,
          0.4464,  0.3037,  0.1809,  1.0227, -0.1442, -0.7252,  0.0517, -0.8854,
          1.0309, -0.4741, -0.0853,  1.4625, -0.2310, -1.0477, -2.0578, -0.9728,
          1.0158, -2.5560,  0.4762,  0.3727,  0.3899,  0.2529, -0.3575,  0.7579,
         -0.6964, -0.1466, -1.3390, -1.7652,  0.9323,  1.1681, -0.0687, -1.1306,
         -0.0848,  0.5598,  0.3072,  0.7480, -0.1291, -1.7567, -0.5913,  1.0419,
         -0.3902,  0.5930, -0.0215,  0.2398, -0.2237,  0.6250,  1.2858, -0.5799,
         -1.2265, -0.0914, -0.2178,  1.3319, -0.4002, -0.0620,  0.4640, -0.0443,
          0.3477, -0.1807, -0.3323,  1.0229,  2.6406,  0.2830,  0.1710, -1.7539]],
       grad_fn=<IndexBackward>)

This lets us do all sorts of cool things! For example, we can predict the next word in a title, given the first two.

In [32]:
next_word(["official", "music"], model)
official music =>
-----
video
nbc
awards
react
volcano

But maybe that was too easy. How about this example?

In [33]:
next_word(["star", "wars"], model)
star wars =>
-----
show
story
cast
episode
jedi

Visualization

If these representations are based on similarity, then we should be able to graph them. Just like (3, 5) and (4, 5) are close to each other on a $2$-d graph, these words should be close together as well, kind of like the picture below.

However, we can't visualize $200$ dimensions. We can use the t-SNE algorithm to reduce the dimensionality, also known as dimensionality reduction, to $3$-d, which we can then visualize.

Better results: https://projector.tensorflow.org/

In [34]:
from big_data_scripts.plot_vectors import *
output = word_viz(model)