PBPRDF - Linked Data for Basketball Analytics

Jacob Danovitch

What is PBPRDF?

Generate[s] RDF for NBA or WNBA basketball play-by-play data by reading a folder full of ESPN-style play-by-play HTML pages [...] and generating a Turtle file that contains all of the plays from each game.

From this ...

To this...

NBA game: Bucks (108) at Celtics (100) on 2017-10-18 - 475 events
TD Garden 2017-10-18T19:30:00.000-04:00

Team Pd. Time Scr Event
Celtics 1 12:00 0-0 Thon Maker vs. Al Horford (Marcus Smart gains possession)
Celtics 1 11:38 0-0 Marcus Smart misses 23-foot three point jumper
Bucks 1 11:33 0-0 Giannis Antetokounmpo defensive rebound
Celtics 1 11:33 0-0 Jaylen Brown personal foul
Bucks 1 11:26 0-0 Thon Maker misses 26-foot three point jumper
Celtics 1 11:22 0-0 Jaylen Brown defensive rebound
Celtics 1 11:16 0-0 Al Horford bad pass (Khris Middleton steals)
Bucks 1 11:10 2-0 Khris Middleton makes layup (Malcolm Brogdon assists)
Celtics 1 10:53 2-0 Kyrie Irving misses 17-foot pullup jump shot
Bucks 1 10:50 2-0 Giannis Antetokounmpo defensive rebound

And finally, this!

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix pbprdf: <http://stellman-greene.com/pbprdf#> .

<http://stellman-greene.com/pbprdf/games/2017-12-02_Pelicans_at_Trail_Blazers> a pbprdf:Game ;
    pbprdf:gameLocation "Moda Center" ;
    rdfs:label "NBA game: Pelicans (123) at Trail Blazers (116) on 2017-12-02 - 447 events" ;
    pbprdf:gameTime "2017-12-02T22:00:00.000-05:00"^^xsd:dateTime .

<http://stellman-greene.com/pbprdf/games/2017-12-02_Pelicans_at_Trail_Blazers/1> a pbprdf:JumpBall ;
    pbprdf:jumpBallHomePlayer <http://stellman-greene.com/pbprdf/players/Jusuf_Nurkic> ;
    pbprdf:jumpBallAwayPlayer <http://stellman-greene.com/pbprdf/players/DeMarcus_Cousins> ;
    pbprdf:jumpBallGainedPossession <http://stellman-greene.com/pbprdf/players/Damian_Lillard> ;
    a pbprdf:Play ;
    pbprdf:forTeam <http://stellman-greene.com/pbprdf/teams/Trail_Blazers> ;
    pbprdf:period "1"^^xsd:int ;
    a pbprdf:Event ;
    rdfs:label "Trail Blazers: DeMarcus Cousins vs. Jusuf Nurkic (Damian Lillard gains possession)" ;
    pbprdf:time "12:00" ;
    pbprdf:inGame <http://stellman-greene.com/pbprdf/games/2017-12-02_Pelicans_at_Trail_Blazers> ;
    pbprdf:secondsIntoGame "0"^^xsd:int ;
    pbprdf:secondsLeftInPeriod "720"^^xsd:int ;
    pbprdf:awayScore "0"^^xsd:int ;
    pbprdf:homeScore "0"^^xsd:int .

Why should I use PBPRDF?

  • Data availability
    • Aka: You don't really have a choice!

How do I use PBPRDF?

  1. Download the latest release
  2. Run ./fetch-(w)nba-play-by-plays.sh
    • 2b: Wait a really long time
  3. Run mv data/retrieved_*/wnba-2018-playoffs && ./pbprdf wnba-2018-playoffs/ wnba-2018-playoffs.ttl.
    • 3b: Wait again
  4. Try to set up Eclipse RDF4J and Apache Tomcat for 2 hours
  5. Fail, give up
  6. Just use Jena
  7. Do not try to load the dataset into memory
    • This will hurt
  8. Enjoy your data!

Example Use-Cases

From the author:

Clutch Shots

SELECT ?playerName ?shotsTaken ?shotsMade ?shotPercentage
WHERE 
{ 
  ?player a pbprdf:Player .
  ?player rdfs:label ?playerName .

  # Find the number of shots taken
  {
    SELECT ?player (COUNT(?shot) AS ?shotsTaken)
    WHERE 
    {
      ?shot a pbprdf:Shot .
      ?shot pbprdf:shotBy ?player .
      ?shot pbprdf:secondsLeftInPeriod ?secondsLeftInPeriod .
      FILTER (?secondsLeftInPeriod < 5)
    }
    GROUP BY ?player 
  }

  # Find the number of shots made
  {
    SELECT ?player (COUNT(?shot) AS ?shotsMade)
    WHERE 
    {
      ?shot a pbprdf:Shot .
      ?shot pbprdf:shotBy ?player .
      ?shot pbprdf:shotMade "true"^^xsd:boolean .
      ?shot pbprdf:secondsLeftInPeriod ?secondsLeftInPeriod .
      FILTER (?secondsLeftInPeriod < 5)
    }
    GROUP BY ?player 
  }

  # Calculate the shot percentage
  BIND ( (round((?shotsMade / ?shotsTaken) * 10000)) / 100 AS ?shotPercentage ) .

  # Only match players who took more than 10 shots just before the end of the period
  FILTER (?shotsTaken >= 15) .
}
ORDER BY DESC(?shotPercentage)

3-Pointers Following Opponent Attempts

From the author: gist

(query waaaaaaaaay too long to show)

PlayerName ThreePointShotPercentage ShotPercentageAfterMiss ShotPercentageAfterMake
"Karl-Anthony Towns" 41.99 45.16 62.50
"Aaron Gordon" 33.86 33.33 56.67
"Kevin Durant" 42.07 42.00 54.29
"Anthony Tolliver" 44.66 39.58 51.28
"Tobias Harris" 41.57 43.64 51.11
"Paul George" 39.64 47.27 50.79
"CJ McCollum" 39.76 46.88 50.00
"Nikola Jokic" 38.75 22.22 50.00
"Robert Covington" 36.99 33.33 50.00
"Allen Crabbe" 38.27 36.36 48.39
AVERAGE 37.28 37.29 36.95
STDEV 3.42 7.74 8.70

My Use-Case

Win Expectancy

  • Several SQL scripts
  • Hundreds of lines of awful
  • Would frequently crash my laptop
CREATE TABLE WinEx AS 
SELECT p.inning_side AS side,
       p.inning AS inn,
       substr(count,1,1) AS b,
       substr(count,3,4) AS s,
       atbat.o AS out,
       p.Occupied1b AS state_1b,
       p.Occupied2b AS state_2b,
       p.Occupied3b AS state_3b,
       (CAST(atbat.home_team_runs AS INT) - CAST(atbat.away_team_runs AS INT)) AS rd,
       (CAST(SUM(game.homeW) AS float)/ CAST(COUNT( * ) AS float)) AS w_Pct,
       atbat.off_rem_outs_h AS off_r_outs,
       atbat.def_rem_outs_h AS def_r_outs,
       count( * ) AS n
  FROM pitch AS p
       INNER JOIN
       atbat ON (p.gameday_link = atbat.gameday_link AND 
                 p.num = atbat.num) 
       INNER JOIN
       game ON atbat.gameday_link = game.gameday_link
 GROUP BY side,
          inn,
          count,
          out,
          Occupied1b,
          Occupied2b,
          Occupied3b,
          rd
 ORDER BY inn ASC,
          side DESC,
          rd ASC;

(yikes)

OR, INSTEAD:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pbprdf: <http://stellman-greene.com/pbprdf#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT (xsd:integer(?sec)/60 AS ?min) ?pd (SUM(?homeW) / COUNT(?homeW) AS ?wp) (COUNT(?sec) AS ?n)
WHERE {
    ?game rdf:type pbprdf:Play .
    ?game pbprdf:secondsIntoGame ?sec .
    ?game pbprdf:awayScore ?away .
    ?game pbprdf:homeScore ?home .

    ?game pbprdf:inGame ?g .
    ?final pbprdf:inGame ?g .
    ?final rdf:type pbprdf:EndOfGame .

    ?final pbprdf:awayScore ?awayF .
    ?final pbprdf:homeScore ?homeF .

    BIND(?home - ?away AS ?pd)
    BIND(xsd:integer(?homeF > ?awayF) AS ?homeW)
}
GROUP BY ?sec ?pd
ORDER BY DESC(?n)

These two queries produce the same result.

In [1]:
import pandas as pd

pd.read_csv("winex.csv").style.bar(subset=['pd ', 'wp '], color=['#d65f5f', '#5fba7d'])
Out[1]:
seconds min pd wp n
0 0 0 0 0.58 465
1 720 12 3 0.74 151
2 1440 24 0 0.55 142
3 720 12 -3 0.33 128
4 1440 24 5 0.75 115
5 720 12 -1 0.7 114
6 1440 24 -1 0.44 111
7 720 12 7 0.65 110
8 720 12 2 0.56 106
9 720 12 -6 0.33 100

True Linked Data

Goal - metadata from dbPedia

  • Fun trivia
    • Which colleges produce the most clutch players?
    • Which countries produce the ...
      • Tallest players?
      • Best shooters?
      • Best passers?
  • Interesting analysis
    • Are Kobe Bryant and Mark Cuban right about AAU/youth basketball?
    • Open-sourcing RPM
    • Comparing draft position to player value
      • Somebody should do this!
      • I'm just like really busy right now

Reality

Thanks, dbPedia!

Conclusion