Chronology: Data-driven Sports Image Indexing Research

From Wikibase.slis.ua.edu
Revision as of 17:39, 1 June 2023 by Admin (talk | contribs)
Jump to navigation Jump to search

Linked Data Research Group

Steven L. MacCall, PhD and Huapu Liu
School of Library and Information Studies
College of Communication and Information Sciences
University of Alabama


Introduction and Background

This chronology reports on research, teaching, service, and entrepreneurial activities investigating data-driven sports image indexing led by Steven L. MacCall, PhD, Associate Professor in the School of Library and Information Studies (SLIS) at the University of Alabama.

This research program investigates the effectiveness and efficiency of a semantic indexing method designed to facilitate image search queries that are pinpointed on game statistical situations (e.g., retrieve all plays from the 2017 Alabama Crimson Tide football season that resulted in touchdowns and occurred in the 3rd quarter on 2nd down - all such plays / just those such plays having available video clips => to run queries: press white arrow in blue box lower left). The method, based on a recently issued UA patent, is not based on the traditional asset-by-asset approach that results in the creation of individual metadata records for each asset; rather, our method is innovative because it proposes to invert the traditional process by facilitating the organizing of each game on a play-by-play basis using structured time segments, and then we “attach” each asset to the appropriate play via an actual or simulated time-based parameter. This reduction in time spent organizing (fewer games versus indexing many more individual images) is the source of our claim for efficiency and effectiveness.

The key to the success of this semantic indexing method was the design and development of a data processing pipeline that takes raw statistical play-by-play datasets and converts them to RDF triples so that they can be uploaded into our linked data application using a Wikibase instance hosted here at the University of Alabama.

Indexing research is a long standing area for information science investigators. Applications of such research include the improvement of systems design (efficiency and effectiveness) as well as informing the teaching of information professionals to better prepare them for the technological intensities of the modern metadata work environment. In a professional school such as SLIS, it is vitally important that full time faculty members have teaching and service activities informed by their research.

Current Research Questions

One would expect that research questions would evolve over time, and in that spirit, these are the current primary research questions under investigation in this research program:

  1. Effectiveness: Can data-driven indexing methods serve a maximal indexing objective in which all identifiable named entities are indexed based on all of their observable features and attributes?
  2. Efficiency: Can data-driven indexing methods scale to meet the maximal indexing objective and also match the rate in which current sports images (photo and video) that are captured by digital cameras and historical sports images (photo and video) that are digitized from the historical record?
  3. Integration: Can current born digital sports images and digitized historical images be indexed in such a way that results in a single database application rather than in multiple silos?
  4. Identification: Can a maximal semantic indexing method incorporating statistical play-by-play datasets provide sufficient context for identifying images based on visible entities and attributes?

Data Sources (and Their Original Material Condition) Used in our Data Processing Pipeline

The data-driven method under study incorporates statistical play-by-play datasets on a per game basis into the image indexing process. We are currently focusing our investigation on Alabama Crimson Tide football based on publicly accessibility play-by-play datasets for 2017 and also based on our access to the historical data sources available at the Paul W. Bryant Museum at the University of Alabama.

Data preparation involved the development of a data processing pipeline consisting of a series of steps including data reformatting, data cleaning, and the reconciling of entities with their respective properties in the ontology that we developed and implemented using our Wikibase instance.

Our data sources varied based on their original material condition (JSON-formatted digital files or paper-based sources) and their completeness. As such, we identified three categories of datasets that fed into our data processing pipeline based on the following combination of conditions:

  1. Category 1 games with JSON-formatted play-by-play datasets: We obtained digital files containing JSON-formatted datasets for every game of the 2017 Alabama Crimson Tide season from publicly accessible play-by-play datasets. NOTE: These files are available for every NCAA football game since 2005, which figures into our future plans.
  2. Category 2 games with paper-based typewritten play-by-play datasets: We had volunteers transcribe the paper-based typewritten play-by-play datasets obtained from the documentary collection of the Paul W. Bryant Museum for the following two games of the 1992 Alabama Crimson Tide season:
    1. Alabama versus Tulane played on October 10, 1992 at the Louisiana Superdome in New Orleans
    2. Alabama versus LSU played on November 7, 1992 at Tiger Stadium in Baton Rouge, LA
  3. Category 3 games with no existing play-by-play datasets in any format: Without access to play-by-play datasets as either digital files or in typewritten paper-based sources, we turned to newspaper accounts of games in order to attempt to reconstruct a play-by-play dataset for the following two games of the 1961 Alabama Crimson Tide season:
    1. Alabama versus Tulane played on September 30, 1961 at Ladd Stadium in Mobile, AL (Newspaper source: "Alabama Nips Tulane 9 to 0" by Charles Land appearing in the October 1, 1961 edition of the Tuscaloosa News)
    2. Alabama versus Vanderbilt played on October 7, 1961 at Legion Field in Birmingham (Newspaper source: "Tide Engulfs Vandy, 35-6" by Charles Land appearing in the October 8, 1961 edition of the Tuscaloosa News)

During this investigation, we came across an additional game, the 1961 Iron Bowl played on December 2, 1961 at Legion Field in Birmingham, that had already been fully reconstructed for the "Bama's Greatest Games" TV series so we added all of the plays from that game to our database. We would like to thank the Paul W. Bryant Museum for providing us with a digitized copy of the VHS version of this game.

Research Result Highlights and Links to Working Software Demonstrations

See also Publications and Presentations

Semantic Indexing Era: Wikibase Software (2018 to present)

Since summer of 2018 (see entry for 2018 in chronology below), we have been able to take advantage of a state-of-the-art research and development software platform (Wikibase) that facilitates our investigation of research questions using semantic indexing methods rather than the using conventional image indexing software (see subsection just below).

Since summer of 2018, we have achieved the following milestones related to our research questions (follow links for more information about each):

  1. Research questions 1 and 2: Image indexing efficiency and effectiveness
    1. Proof of concept (spring 2022) - Successfully ingested play-by-play data for over 12,500 college football game from 2005 - 2021: During the spring 2022 semester, we completed the design of an high-throughput ETL pipeline and successfully evaluated the pipeline by loading play-by-play statistical data for 17 college football seasons (2005-2021). This load resulted in a SPARQL queriable graph database of over 2.5 million entities representing about 12,500 games and over 2.1 million plays that were run during those games. This successful database load is a scaled up proof of concept for applying linked data technologies to organization of information challenges. Sample SPARQL queries:
      1. All plays from the 2020 Alabama Crimson Tide football season with their corresponding game clock and wall clock time
      2. All plays from the 2018 Michigan Wolverine football season with their corresponding game clock and wall clock time
      3. All plays from the 2010 Alabama Crimson Tide football season with their corresponding game clock time
      4. All plays from the 2019 SEC football season (all teams) with their corresponding game clock and wall clock time
      5. Rushing touchdowns run for over 50 yards in 2021 SEC season games (all teams)
    2. Pilot study (spring 2019) - Successfully ingested play-by-play data for all 15 games from the 2017 Alabama Crimson Tide football season: We have completed the semantic indexing of every single play that occurred in all 15 games from the 2017 Alabama Crimson Tide football season using a semantic indexing method that incorporated JSON-formatted Web-accessible statistical play-by-play datasets into a semantic indexing process via the creation of a semi-automated data processing pipeline. This effort was partially funded by two 2018 grants that we received; see chronology below. See 2018-19 Academic Year Research Report for detailed documentation and software demonstration pertaining to the accomplishment this milestone.
    3. Pilot study of historical games (1) (spring 2020) - Successfully ingested partial play-by-play data from 2 games from the 1992 Alabama Crimson Tide football season: We have completed RGC-grant funded work (see 2018 entry in chronology below) that allowed us to investigate the recovery of play-by-play data from the historical record. For games in 1992, the historical record consisted of typed play-by-play datasets requiring manual transcription by volunteers, which allowed us to investigate the inclusion into our data processing pipeline of a crowdsourcing method for historical data transcription. See 2019-20 Academic Year Research Progress Report for detailed documentation and software demonstration pertaining to the accomplishment this milestone.
    4. Pilot study of historical games (2) (spring 2020) - Successfully ingested partial play-by-play data from 2 games from the 1961 Alabama Crimson Tide football season: As part of the above mentioned RGC grant, we also investigated games from the 1961 season. For games from 1961, there were no historical records available containing play-by-data. However, there are other documentary sources of such data, including journalistic newspaper accounts of those games. In our study, we investigated the inclusion into our data processing pipeline of a crowdsourcing method for historical data-extraction from newspaper articles. We chose to use articles published in the Tuscaloosa News available online. See 2019-20 Academic Year Research Progress Report for detailed documentation and software demonstration pertaining to the accomplishment this milestone.
    5. PLEASE NOTE: One additional game from the 1961 Alabama Crimson Tide football season: When working with the 1961 season, we came across a fully reconstructed game, the "Iron Bowl" game against Auburn on December 2, 1961 that had been converted into a television production using digitized coaches film and historical accounts maintained at the Paul W. Bryant Museum. Based on the availability of this digitized full-game content, we were able to index every play of that game as if it were a game from the 2017 season (described above). See 2019-20 Academic Year Research Progress Report for detailed documentation and software demonstration pertaining to this game.
  2. Research question 3: Integration: After completing the semantic indexing of individual plays from the three seasons of Alabama Crimson Tide football as described above, we were able to successfully evaluate the querying of plays as if they were all in the same semantic database. Here are example SPARQL queries demonstrating this capability. (PLEASE NOTE: To run each query, click on the Blue Arrow icon in lower left portion of screen after clicking on links below):
    1. All rushing touchdowns that went for over 50 yards during the 3 seasons of Alabama Crimson Tide football in our database (just those for which there are video clips)
    2. All interceptions returned for 15 or more yards during the 3 seasons of Alabama Crimson Tide football in our database (just those for which there are video clips)
    3. All successful field goals from 30 yards or more during the 3 seasons of Alabama Crimson Tide football in our database (just those successful field goals for which there are video clips)
  3. Research question 4: Identification: Crimson Tide Photos, a unit in UA Athletics, provided us with a random sample of born digital photos that were taken during games from the 2017 Alabama Crimson Tide football season, and we were able to identify each image in terms of the play context and the game statistical situation at the time that each photo was taken. (See Plays with Example UA Images May 2019 for documentation of this milestone.)

Conventional Indexing Era: Omeka and ContentDM Software (2008 to 2018)

During this period, SLIS students in my annual spring semester Metadata course (LS 566) benefited from the application of my research as I was able to use it to inform my teaching. Specifically, my research served as the basis for their course project work, which involved the application of indexing theories by way of applying the standard/conventional approach to the indexing of a set of images. These images were provided by Ken Gaddy, Director of the Paul W. Bryant Museum, and they were both digitized black and white photos of Alabama Crimson Tide football games from the 1975 season and also born digital color images from the 2010 National Championship game.

The indexing of these images was accomplished using two different software applications.

  1. Omeka software (2011-2018): Here is a representative example of the end result of the last of such indexing projects from the Spring 2018 semester of this course.
  2. ContentDM (2008-2010)

Research-related Accomplishments by Year

See also Publications and Presentations

The entries below are arranged reverse chronologically:

2022

During the spring 2022 semester, we completed the design of an high-throughput ETL pipeline and successfully evaluated the pipeline by loading play-by-play statistical data for 17 college football seasons (2005-2021). This load resulted in a SPARQL queriable graph database (using Wikibase software) of over 2.5 million entities representing about 12,500 games and over 2.1 million plays that were run during those games. This successful database load is a scaled up proof of concept for applying linked data technologies to organization of information challenges.

Sample SPARQL queries:

2021

Design of ETL pipeline begins by Huapu Liu:

  • Python and R scripting for accessing and transforming game datasets from College Football Data's API
  • Optimized data upload speed to our Wikibase instance via API

Continuing work with Dr. Gan's group in the UA Department of Electrical and Computer Engineering. Completion of senior project work: Nguyen Nguyen, Jeff Reidy, and Noah Wagnon. "Final Presentation: Automated Jersey Number Recognition". Senior design team presentation for EECE 494.407 for 2020-21 academic year. PDF slidedeck

2020

MacCall, S.L. (2020). Systems and methods for digital asset organization. U.S . Patent number 10,534,812.

Partnership established with Dr. Yu Gan, Assistant Professor of Electrical and Computer Engineering at the University of Alabama. Dr. Gan is an expert in digital image processing, and he is working with masters student Alexander Ramey to automatically detect player numbers in order to extract player participation data from born digital and digitized historical video: The 2017 season National Championship game and the 1961 season Iron Bowl.

MacCall, S.L., Liu, H., & Anderson, C.M. (2020). Statistical data recovery from historical documentation of Alabama football games using Wikibase as a repository. Interactive demonstration accepted for Connecting Collections as Data: Transforming Communities, Sharing Knowledge, and Building Networks with International GLAM Labs, Washington, DC.

MacCall, S.L. (2020). Data-driven semantic DAM indexing incorporating statistical play-by-play game logs: A linked data application using Wikibase from the 2017 football season of the Alabama Crimson Tide. Conference paper accepted for presentation at the 2020 LD4 Conference on Linked Data in Libraries, College Station, TX.

Anderson, C.M., Liu, H., & MacCall, S.L. (2020). Crowdsourcing in a semantic indexing workflow for efficiently organizing historical multimedia sports collections. Poster accepted for the 2020 Annual Meeting of the Alabama Library Association, Birmingham, AL.

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with Wikibase software, as the basis for the major indexing project in my brand new spring 2020 Linked Data course (LS 590)

2019

MacCall, S.L., Liu, H., & Anderson, C.M. (2019). How much statistical data can be recovered from Alabama football history? Piloting a crowdsourced approach using Wikibase as data repository. Conference paper presented at 2019 Digitorium Digital Humanities Conference, Tuscaloosa, AL. [UA Institutional Repository deposit: https://ir.ua.edu/handle/123456789/6574]

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with OmekaS software, as the basis for the major metadata indexing project in my spring 2019 Metadata course (LS 566)

2018

MacCall, S.L. (2018). Investigation of a data-driven indexing method for multimedia asset collections in sports: Phase 2: Developing SLIS research capacity for key linked open data technologies. University of Alabama School of Library and Information Studies Research Fund -$1,000. Funded.

MacCall, S.L., & Bott, G. (2018). Investigation of a data-driven indexing method for multimedia asset collections in sports: Phase 1: How much data can be recovered from Alabama football history? University of Alabama Office of Research and Development Research Grants Committee Level 1 Program - $6,000. Funded

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with OmekaS software, as the basis for the major metadata indexing project in my spring 2018 Metadata course (LS 566)

2017

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with OmekaS software, as the basis for the major metadata indexing project in my spring 2017 Metadata course (LS 566)

2016

MacCall, S.L., McMillan, D.J., Vargo, C.J., Bradley, S.B., & Aversa, E.A. (2016). Efficiency, integration, interoperability: A 21st century approach to organizing sports digital assets for all libraries. Knight Foundation’s News Challenge for Libraries: How Might Libraries Serve 21st Century Information Needs? - pre-budget grant submission. Not funded.

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with Omeka Classic software, as the basis for the major metadata indexing project in my spring 2016 Metadata course (LS 566)

2015

MacCall, S.L. (Filed December 15, 2015). Systems and methods for digital asset organization. U.S. Utility Patent Application number 14/971,463.

MacCall, S.L., Vargo, C.J., Bradley, S.B., & Aversa, E.A. (2015). Development of a novel digital asset organizing method in sports. National Science Foundation - Small Business Innovation Research (SBIR) Phase I Grant - $225,000 ($74,925 sub-award to University of Alabama). Not funded.

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with Omeka Classic software, as the basis for the major metadata indexing project in my spring 2015 Metadata course (LS 566)

2014

MacCall, S.L. [Chief Scientist for MaxOrg, LLC], Aversa, E.A. [CEO for MaxOrg, LLC], & McMillan, D.J. [Technology Officer for MaxOrg, LLC]. (2014, 2015, 2016). Crimson Canvas - MaxOrg, LLC. Program participation: Commercial Development of Faculty Developed UA Intellectual Property sponsored by Alabama Innovation and Mentoring of Entrepreneurs (AIME).

MaxOrg, LLC formed as a faculty-led startup to contribute to the commercial development of UA intellectual property (see 2020 and 2015 entries above for patent issued and patent filed data respectively)

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with Omeka Classic software, as the basis for the major metadata indexing project in my spring 2014 Metadata course (LS 566)

2013

MacCall, S.L. & Gaddy, K. (2013). Optimal organizing of digital images in sports: A project of the Paul W. Bryant Museum and UA SLIS. Presented at the 2013 University of Alabama Program in Sports Communication Sports Symposium, Tuscaloosa, AL. [Slideshare: http://tinyurl.com/k875g5j]

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with Omeka Classic software, as the basis for the major metadata indexing project in my spring 2013 Metadata course (LS 566)

2011

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with Omeka Classic software, as the basis for the major metadata indexing project in my spring 2011 Metadata course (LS 566)

2010

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with ContentDM software, as the basis for the major metadata indexing project in my spring 2010 Metadata course (LS 566)

2009

Used the Paul W. Bryant Museum image sets provided in 2008 by Ken Gaddy, along with ContentDM software, as the basis for the major metadata indexing project in my spring 2009 Metadata course (LS 566)

2008

Partnership begun with Ken Gaddy, director of the Paul W. Bryant Museum. Ken provided my students and me with a set of digitized black and white photos from the 1975 Alabama Crimson Tide football season and a set of born digital color images from the 2010 National Championship game. These images served as the training set for students in my LS 566 course on Metadata every year since 2008. Equally importantly, my own study of these images was instrumental in the development of my research program and subsequently led to the research output reported here.

Adopted use of ContentDM image repository software for the metadata indexing project in my LS 566 Metadata course at SLIS.

Team Members and Major Supporters

  1. Ken Gaddy, Director, Paul W. Bryant Museum (retired)
  2. Dr. Greg Bott, Assistant Professor, UA Culverhouse College of Business (co-author of RGC grant)
  3. David McMillan, Executive Director, Enterprise Development Application Support and MaxOrg, LLC (led install of Wikibase software and its software customization to this project)
  4. Dr. Yu Gan, Assistant Professor, UA Department of Electrical and Computer Engineering (digital image processing)
  5. Huapu Liu, MLIS, my former Graduate Research Assistant and frequent co-author (provided indispensable support from the very beginning of the Wikibase phase of this research project when we were faced with an empty database!)
  6. C. Melissa Anderson, my current Graduate Research Assistant and co-author (project manager for RGC grant)
  7. Christina Schultz-Richert, MLIS graduate (worked to extend the football related model to Public Television episode transcripts)
  8. Dr. Elizabeth Aversa,retired SLIS Director and MaxOrg, LLC